Page MenuHomeSoftware Heritage

Using xz-file-format for 1TB archive
Closed, MigratedEdits Locked

Description

How practical would it really be to use xz-file-format to create a 1TB archive with artifacts that have 3KB median size and 80KB average size ?

  • Each artifact would be a individually compressed block
  • An uncompressed index sorted index with fixed record sizes SHA256 => number of block is stored in the last block

The index does not fit in memory and needs sorting.

Event Timeline

dachary changed the task status from Open to Work in Progress.Feb 15 2021, 9:36 AM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.

https://py7zr.readthedocs.io/en/latest/archive_format.html

The 7z format is more complex because it knows about files, directories etc. It is not not just a compressed data format.

https://github.com/facebook/zstd/blob/master/doc/zstd_compression_format.md

The zstd format is tightly associated with the compression algorithm and is therefore more complex. It can however be a sequence of independently compressed content and could be used for the same purpose as xz.

Although simple and close to what is needed, Xz is not an exact match: the index would need to be maintained.