On practical machines they aren't good for much. To access a value in a skip list you have to dereference way more pointers than in a b+ tree. On paper they're about the same, but in practice the binary tree will tend to outperform. You get way more work done per IO operation.
Skiplists are designed for fast intersection, not for single value lookup (assuming a sane design that's not based on linked lists, that's just an educational device that's never used in practice).
They are extremely good at intersections, as you can use the skip pointers in clever ways to skip ahead and eliminate whole swathes of values. You can kinda do that with b-trees[1] as well, but skip lists can beat them out in many cases.
It's highly dependent on the shape of the data though. For random numbers, it's probably an even match, but for postings lists and the like (where skiplists are often used), they perform extremely well as these often have runs of semiconsecutive values being intersected.
[1] I'll argue that if you squint a bit, a skiplist arguably is a sort of degenerate b-tree.
Redis sorted sets are probably the most widely deployed example. Redis uses a skiplist for range queries and ordered iteration paired with a hash table for O(1) lookups. Together they cover the full API at the right complexity for each operation
Skiplists also win over balanced BSTs when it comes to concurrent access. Lock-free implementations are much simplier to reason about and get right. ConcurrentSkipListMap has been in the standard library since Java 6 for exactly this reason and it holds up well under high contention
Skiplist operations are local for the most part, which makes it easier to write thread-safe code for than b-trees in practice. Anecdotally, they were a nice implementation problem for my Java class in uni. But I liked working with b-lists more.
Skip trees/graphs sound interesting, but I can't think of any use case for them off the top of my head.
Almost nothing. My friend and I used it once (in a rather obscure problem). Then used simple lists with some tricks with better performance because of the locality etc.
Could someone provide intuitive understanding for why the "express lanes" in a skip list are created probabilistically?
My first instinctive idea would be that there is an optimal distance, maybe based on absolute distance or by function of list size or frequency of access or whatever. Leaving the promotion to randomness is counter intuitive to me.
In the age of agentic programming and the ever increasing pressure to ship faster, I'm afraid this kind of knowledge will become more and more fringe, even moreso than it is today. Who has the time to think through the intricacies of parallel data structures? Clearly we'll just throw more hardware at problems, write yet another service/api/http endpoint and move on to the next hype. The LLM figures out the algorithms and we soon lose the skills to develop new ones. And tell each other the scifi BS myth that "AI" will invent new data structures in the future so we don't even beed humans in the loop.
AI is like a genie: be careful what you wish for or you'll get what you asked for.
Lately at work I've done C++ optimization tricks like inplace_map, inplace_string, placement new to inline map-like iterators inside a view adapter's iterators and putting that byte buffer as the first member of the class to not incur std::max_align_t padding with the other members. At a higher architecture level, I wrote a data model binding library that can serialize JSON, YAML and CBOR documents to an output iterator one byte at a time without incurring heap allocation in most cases.
This is because I work on an embedded system with 640 KiB of SRAM and given the sheer amount of run-time data it will have to handle and produce, I'm wary not only about heap usage, but also heap fragmentation.
AI will readily identify such tricks, it can even help implement them, but unless constrained otherwise AI will pick the most expedient solution that answers the question (note that I didn't say answers the problem).
I think the opposite is the case. We increasingly need to care more about performance and know how to leverage hardware better.
The market is telling us that through increased hardware prices.
LLMs being very powerful means that we need to start being smarter about allocating resources. Should chat apps really eat up gigabytes of RAM and be entitled to cores, when we could use that for inference?
Random access with similar performance to a balanced binary tree, and ordered iteration as simple as a linked list. It's a nice combination. (Of course, so is a binary search of a sorted array, which I lean more towards these days unless doing a lot of random insertions and deletions throughout the life of the mapping).
In my personal projects, I've used it to insert/delete transactions in a ledger. I wanted to be able to update/query the account balance fast. Like the article says, "fold operations".
For this problem, I’d consider a different approach. You have a fuzzer, and based on some seed it’s generating lots of records. You then need to query a specific record (or set of records) based on the leaf.
I’d just store a table of records with the leaf, associated with the seed. A good fuzzer is entirely deterministic. So you should be able to regenerate the entire run from simply knowing the seed. Just store a table of {leaf, seed}. Then gather all the seeds which generated the leaf you’re interested in and rerun the fuzzer for those seeds at query time to figure out what choices were made.
> Skiplists to the rescue! Or rather, a weird thing we invented called a “skiptree”…
I can't help but wonder. The article makes no mention of b-trees if any kind. To me, this sounded like the obvious first step.
If their main requirement was to do sequential access to load data, and their problem was how to speed up tree traversal on an ad-hoc tree data structure that was too deep, then I wonder if their problem was simply having tree nodes with too few children. A B+ tree specifically sounds to be right on the nose for the task.
IIUC, the tree structure is the very thing they're trying to store and query -- it's not merely an artefact of a data structure that they're using to store something else (the way a B+ tree or binary tree used to store, say, a set of product names would be). If their tree has long paths, it has long paths.
On practical machines they aren't good for much. To access a value in a skip list you have to dereference way more pointers than in a b+ tree. On paper they're about the same, but in practice the binary tree will tend to outperform. You get way more work done per IO operation.
Skiplists are designed for fast intersection, not for single value lookup (assuming a sane design that's not based on linked lists, that's just an educational device that's never used in practice).
They are extremely good at intersections, as you can use the skip pointers in clever ways to skip ahead and eliminate whole swathes of values. You can kinda do that with b-trees[1] as well, but skip lists can beat them out in many cases.
It's highly dependent on the shape of the data though. For random numbers, it's probably an even match, but for postings lists and the like (where skiplists are often used), they perform extremely well as these often have runs of semiconsecutive values being intersected.
[1] I'll argue that if you squint a bit, a skiplist arguably is a sort of degenerate b-tree.
Only somewhat related but there is supposedly a SIMD/GPU-friendly skiplist algo written about here: https://csaws.cs.technion.ac.il/~erez/Papers/GPUSkiplist.pdf
Redis sorted sets are probably the most widely deployed example. Redis uses a skiplist for range queries and ordered iteration paired with a hash table for O(1) lookups. Together they cover the full API at the right complexity for each operation
Skiplists also win over balanced BSTs when it comes to concurrent access. Lock-free implementations are much simplier to reason about and get right. ConcurrentSkipListMap has been in the standard library since Java 6 for exactly this reason and it holds up well under high contention
Skiplists have some nice properties - the code is fairly short and easy to understand, for one. Qt's QMap used to be skip list based, here's the rationale given for it: https://doc.qt.io/archives/qq/qq19-containers.html#associati...
It seems like Qt went from red-black tree to skip list in Qt4 and back to red-black tree in Qt5.
Skiplist operations are local for the most part, which makes it easier to write thread-safe code for than b-trees in practice. Anecdotally, they were a nice implementation problem for my Java class in uni. But I liked working with b-lists more.
Skip trees/graphs sound interesting, but I can't think of any use case for them off the top of my head.
Almost nothing. My friend and I used it once (in a rather obscure problem). Then used simple lists with some tricks with better performance because of the locality etc.
Could someone provide intuitive understanding for why the "express lanes" in a skip list are created probabilistically?
My first instinctive idea would be that there is an optimal distance, maybe based on absolute distance or by function of list size or frequency of access or whatever. Leaving the promotion to randomness is counter intuitive to me.
In the age of agentic programming and the ever increasing pressure to ship faster, I'm afraid this kind of knowledge will become more and more fringe, even moreso than it is today. Who has the time to think through the intricacies of parallel data structures? Clearly we'll just throw more hardware at problems, write yet another service/api/http endpoint and move on to the next hype. The LLM figures out the algorithms and we soon lose the skills to develop new ones. And tell each other the scifi BS myth that "AI" will invent new data structures in the future so we don't even beed humans in the loop.
AI is like a genie: be careful what you wish for or you'll get what you asked for.
Lately at work I've done C++ optimization tricks like inplace_map, inplace_string, placement new to inline map-like iterators inside a view adapter's iterators and putting that byte buffer as the first member of the class to not incur std::max_align_t padding with the other members. At a higher architecture level, I wrote a data model binding library that can serialize JSON, YAML and CBOR documents to an output iterator one byte at a time without incurring heap allocation in most cases.
This is because I work on an embedded system with 640 KiB of SRAM and given the sheer amount of run-time data it will have to handle and produce, I'm wary not only about heap usage, but also heap fragmentation.
AI will readily identify such tricks, it can even help implement them, but unless constrained otherwise AI will pick the most expedient solution that answers the question (note that I didn't say answers the problem).
This is also the reason why we have two polar opposite views on AI. “Slop generator” vs “Next best thing since sliced bread”.
With SOTA models it all depends on how you drive them.
I think the opposite is the case. We increasingly need to care more about performance and know how to leverage hardware better.
The market is telling us that through increased hardware prices.
LLMs being very powerful means that we need to start being smarter about allocating resources. Should chat apps really eat up gigabytes of RAM and be entitled to cores, when we could use that for inference?
Random access with similar performance to a balanced binary tree, and ordered iteration as simple as a linked list. It's a nice combination. (Of course, so is a binary search of a sorted array, which I lean more towards these days unless doing a lot of random insertions and deletions throughout the life of the mapping).
In my personal projects, I've used it to insert/delete transactions in a ledger. I wanted to be able to update/query the account balance fast. Like the article says, "fold operations".
skiplists form the basis of in-memory tables used by LSM trees, which are themselves the basis of most modern DBs (written post 2005).
A major global bank operated all trading, especially the complex stuff, off of a globally replicated skip list.
For this problem, I’d consider a different approach. You have a fuzzer, and based on some seed it’s generating lots of records. You then need to query a specific record (or set of records) based on the leaf.
I’d just store a table of records with the leaf, associated with the seed. A good fuzzer is entirely deterministic. So you should be able to regenerate the entire run from simply knowing the seed. Just store a table of {leaf, seed}. Then gather all the seeds which generated the leaf you’re interested in and rerun the fuzzer for those seeds at query time to figure out what choices were made.
FTA:
> Skiplists to the rescue! Or rather, a weird thing we invented called a “skiptree”…
I can't help but wonder. The article makes no mention of b-trees if any kind. To me, this sounded like the obvious first step.
If their main requirement was to do sequential access to load data, and their problem was how to speed up tree traversal on an ad-hoc tree data structure that was too deep, then I wonder if their problem was simply having tree nodes with too few children. A B+ tree specifically sounds to be right on the nose for the task.
IIUC, the tree structure is the very thing they're trying to store and query -- it's not merely an artefact of a data structure that they're using to store something else (the way a B+ tree or binary tree used to store, say, a set of product names would be). If their tree has long paths, it has long paths.
If you need a graph db, use a graph db.