Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient binary SST file format #11

Open
justinethier opened this issue Feb 4, 2022 · 2 comments
Open

Efficient binary SST file format #11

justinethier opened this issue Feb 4, 2022 · 2 comments

Comments

@justinethier
Copy link
Owner

justinethier commented Feb 4, 2022

At some point we need a more efficient binary SST file format.

@justinethier justinethier changed the title Efficient SST file format Efficient binary SST file format Mar 4, 2022
@justinethier
Copy link
Owner Author

justinethier commented Mar 4, 2022

Design considerations:

  • Create multiple files for each SST
    • Manifest contains header information, possibly indicate if a file is scheduled for deletion
    • Index file contains sparse set of keys and their location within the file
    • Actual SST file contains entries
  • Header contains sequence number, possibly CRC, anything else?
  • Entry contains key length, key contents, data length, data contents, deleted flag
    • data can be set to 0 length as an optimization when it is deleted
  • All length and seq number fields defined as 64-bit integers. or lower for length??
  • Can use single byte for deleted flag
  • Instead of caching a whole SST file, can just cache a range of the file specified by an index
    • We need to load this anyway when searching for a key
    • Allows us to cache a much smaller region of the file, may scale better
    • Still need to time-out the cache. May want to have configurable caching behavior (criteria to cache, TTL, etc)
  • Want cmd tools for dealing with binary data.
    • At a minimum want a tool to convert from a binary to text (json?) format to inspect data
    • If we are going to do that it would be handy to have a converter from that text format back to binary, to allow any changes to be made in a straightforward way

@justinethier
Copy link
Owner Author

An additional optimization is to gzip every segment of the SST file. That is, the values between one sparse index and another. Can then read the whole block into memory, decompress, and cache it. Saves I/O and space on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant