I simply use SQLite for this. You can store the cache blocks in the SQLite database as blobs. One file, no sparse files. I don't think the "sparse file with separate metadata" approach is necessary here, and sparse files have hidden performance costs that grow with the number of populated extents. A sparse file is not all that different than a directory full of files. It might look like you're avoiding a filesystem lookup, but you're not; you've just moved it into the sparse extent lookup which you'll pay for every seek/read/write, not just once on open. You can simply use a regular file and let SQLite manage it entirely at the application level; this is no worse in performance and better for ops in a bunch of ways. Sparse files have a habit of becoming dense when they leave the filesystem they were created on.
I dont think the author could even use SQLite for this. NULL in SQLite is stored very compactly, not as pre-filled zeros. Must be talking about a columnar store.
I wonder if attaching a temporary db on fast storage, filled with results of the dense queries, would work without the big assumptions.
I think I did a poor job of explaining. SQLite is dealing with cached filesystem blocks here, and has nothing to do with their query engine. They aren't migrating their query engine to SQLite, they're migrating their sparse file cache to SQLite. The SQLite blobs will be holding ranges of RocksDB file data.
RocksDB has a pluggable filesystem layer (similar to SQLite virtual filesystems), they can read blocks from the SQLite cache layer directly without needing to fake a RocksDB file at all. This is how my solution (I've implemented this before) works. Mine is SQLite both places: one SQLite file (normal) holds cached blocks and another SQLite file (with virtual filesystem) runs queries against the cache layer. They can do this with SQLite holding the cache and RocksDB running the queries.
IMO, a little more effort would have given them a better solution.
Ah, clever. Since they chose RocksDB I wonder if Amazon supports zoned storage on NVMe. RocksDB has a zoned plugin which describes an alternative to yours.
I’ve used this technique in the past, and the problem is that the way some file systems perform the file‑offset‑to‑disk‑location mapping is not scalable. It might always be fine with 512 MB files, but I worked with large files and millions of extents, and it ran into issues, including out‑of‑memory errors on Linux with XFS.
The XFS issue has since been fixed (though you often have no control over which Linux version your program runs on), but in general I’d say it’s better to do such mapping in user space. In this case, there is a RocksDB present anyway, so this would come at no performance cost.
Microsoft MS-DOS and Windows supported this in the 90s with DriveSpace, and modern file systems like btrfs and zfs also support transparent compression.
sparse files are efficient but they break NFS quota accounting. we run ~10k pods and found that usage reporting drifts and rehydration latency causes weird timeouts. strict ext4 project quotas ended up being more reliable for us.
I am guessing the choice here is do you want the kernel to handle this and is that more performant than just managing a bunch of regular empty files and a home grown file allocation table.
Or even just bunch of little files representing segments of larger files.
I simply use SQLite for this. You can store the cache blocks in the SQLite database as blobs. One file, no sparse files. I don't think the "sparse file with separate metadata" approach is necessary here, and sparse files have hidden performance costs that grow with the number of populated extents. A sparse file is not all that different than a directory full of files. It might look like you're avoiding a filesystem lookup, but you're not; you've just moved it into the sparse extent lookup which you'll pay for every seek/read/write, not just once on open. You can simply use a regular file and let SQLite manage it entirely at the application level; this is no worse in performance and better for ops in a bunch of ways. Sparse files have a habit of becoming dense when they leave the filesystem they were created on.
I dont think the author could even use SQLite for this. NULL in SQLite is stored very compactly, not as pre-filled zeros. Must be talking about a columnar store.
I wonder if attaching a temporary db on fast storage, filled with results of the dense queries, would work without the big assumptions.
I think I did a poor job of explaining. SQLite is dealing with cached filesystem blocks here, and has nothing to do with their query engine. They aren't migrating their query engine to SQLite, they're migrating their sparse file cache to SQLite. The SQLite blobs will be holding ranges of RocksDB file data.
RocksDB has a pluggable filesystem layer (similar to SQLite virtual filesystems), they can read blocks from the SQLite cache layer directly without needing to fake a RocksDB file at all. This is how my solution (I've implemented this before) works. Mine is SQLite both places: one SQLite file (normal) holds cached blocks and another SQLite file (with virtual filesystem) runs queries against the cache layer. They can do this with SQLite holding the cache and RocksDB running the queries.
IMO, a little more effort would have given them a better solution.
Ah, clever. Since they chose RocksDB I wonder if Amazon supports zoned storage on NVMe. RocksDB has a zoned plugin which describes an alternative to yours.
I’ve used this technique in the past, and the problem is that the way some file systems perform the file‑offset‑to‑disk‑location mapping is not scalable. It might always be fine with 512 MB files, but I worked with large files and millions of extents, and it ran into issues, including out‑of‑memory errors on Linux with XFS.
The XFS issue has since been fixed (though you often have no control over which Linux version your program runs on), but in general I’d say it’s better to do such mapping in user space. In this case, there is a RocksDB present anyway, so this would come at no performance cost.
We can talk about even more general idea of saving file space: compression. Ever heard about it used across the whole filesystems?
Most compressible file formats are already compressed, and with compression you lose efficient non-sequential IO.
Microsoft MS-DOS and Windows supported this in the 90s with DriveSpace, and modern file systems like btrfs and zfs also support transparent compression.
You introduce overhead on both read and write without being a better solution to OPs problem.
sparse files are efficient but they break NFS quota accounting. we run ~10k pods and found that usage reporting drifts and rehydration latency causes weird timeouts. strict ext4 project quotas ended up being more reliable for us.
I am guessing the choice here is do you want the kernel to handle this and is that more performant than just managing a bunch of regular empty files and a home grown file allocation table.
Or even just bunch of little files representing segments of larger files.
AWS Storage Gateway's cached volume configuration does this automatically.