Page MenuHomeSoftware Heritage

Using a custom Sorted String Table format
Closed, MigratedEdits Locked

Description

It is a format to store what is conceptually a Sorted String Table. There is no reference defining what a Sorted String Table is and the implementations varies depending on the context. It is often said to have been introduced in a paper from Google. It is a Key/Value map sorted by Key.

Format

The custom format is a header:

  • Format version
  • Number of entries in the index

followed by an index which is a sorted list of fixed size entries:

  • SHA256,offset,size

after the index the content of the objects is found.

Writing

It is assumed writing is done in batch, sequentially

Reading

  • Binary search for the SHA256 in the index
  • Seek to the object content to stream it to the caller in chunks of a given size

Event Timeline

dachary renamed this task from Using custom format for 1TB archive to Using a custom format for 1TB archive.Feb 15 2021, 5:41 PM
dachary changed the task status from Open to Work in Progress.
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.

followed sequence of:

Size of SHA256, SWHID, Content
SHA256
SWHID
Content

see my comment/question in T3049#58734 about the presence of both SHA256 and SWHID in the format proposal

Updated the description, even simpler.

T3050 is a better fit as it does not require any specification or development.

dachary reopened this task as Work in Progress.

Let's leave it open: although T3050 is a better fit, it is not ready yet and an interim solution may be required.

It turns out there are a number of suitable formats (SST from RocksDB for one), no need to re-invent this wheel.

dachary renamed this task from Using a custom format for 1TB archive to Using a custom Sorted String format.Feb 23 2021, 11:21 PM
dachary reopened this task as Work in Progress.
dachary updated the task description. (Show Details)

Reopening for benchmarking purposes because there does not seem to be anything ready to use T3068.

dachary renamed this task from Using a custom Sorted String format to Using a custom Sorted String Table format.Feb 23 2021, 11:24 PM
dachary updated the task description. (Show Details)
dachary updated the task description. (Show Details)
gitlab-migration changed the task status from Wontfix to Migrated.Jan 8 2023, 4:33 PM
gitlab-migration claimed this task.
gitlab-migration changed the status of subtask T3045: Using pixz for 1TB archives from Resolved to Migrated.
gitlab-migration changed the status of subtask T3046: Using xz-file-format for 1TB archive from Resolved to Migrated.
gitlab-migration added a subscriber: gitlab-migration.