Efficient String Compression for Modern Database Systems

(cedardb.com)

66 points | by jandrewrogers 2 days ago ago

7 comments

crazygringo 26 minutes ago
I'm genuinely surprised that there isn't column-level shared-dictionary string compression built into SQLite, MySQL/MariaDB or Postgres, like this post is describing.
SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).
mbfg 3 hours ago
I wonder how one does like queries.
[-]
- hcs an hour ago
  After decompression, with the performance characteristics you'd expect. If it has to come off disk it's still a win or at least usually breaks even in their measurements. https://cedardb.com/blog/string_compression/#query-runtime
  The paper suggests that you could rework string matching to work on the compressed data but they haven't done it.
- speed_spread 2 hours ago
  s, jst cmprss ll qrs b rmvng vyls!
ForHackernews 2 hours ago
Never heard of CedarDB.
Seems to be another commercial cloud-hosted thing offering a Postgres API? https://dbdb.io/db/cedardb
https://cedardb.com/blog/ode_to_postgres/
[-]
- switz 2 hours ago
  I was evaluating it recently but it's not FOSS, so buyer beware. I'm totally fine with commercialization, but I hesitate to build on top of data stores with no escape hatches or maintenance plans–especially when they're venture backed. It is self-hostable, but not OSS.
- cmrdporcupine an hour ago
  It's a startup founded by -- and built with tech coming out of research by -- some well known people in the DB research community.
  Successor to Umbra, I believe.
  I know somebody (quite talented) working there. It's likely to kick ass in terms of performance.
  But it's hard to get people to pay for a DB these days.