Page MenuHomeSoftware Heritage

Using pixz for 1TB archives
Closed, MigratedEdits Locked


How practical would it really be to use pixz to create a 1TB archive with artifacts that have 3KB median size and 80KB average size ? Are there any blockers / counter performances when using the pixz index to read a single artifact ? What is the CPU cost of uncompressing ? When reading a single artifact what is the overhead implied by the fixed block size ?

There can be up to 1TB / 3KB = ~300 millions artifiacts. Assuming the name of each artifact is a SHA256 i.e. 32 bytes, that's an index of 10GB, i.e. ~1% of the size.

Event Timeline

dachary triaged this task as Normal priority.Feb 15 2021, 8:49 AM
dachary created this task.
dachary created this object in space S1 Public.

The index is located at the end of the file.
The content of the archive is compressed as successive blocs of a given size.
The index is compressed as a single block of unlimited size.

XZ file format &

Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.

When extracting a single file (-x file) the in memory index is walked sequentially looking for the file.

There are two blockers:

  • The index must be read in memory (10GB is too big)
  • The index is searched sequentially O(N)

pixz is not usable as is.