Page MenuHomeSoftware Heritage

Using pixz for 1TB archives
Closed, MigratedEdits Locked

Description

How practical would it really be to use pixz to create a 1TB archive with artifacts that have 3KB median size and 80KB average size ? Are there any blockers / counter performances when using the pixz index to read a single artifact ? What is the CPU cost of uncompressing ? When reading a single artifact what is the overhead implied by the fixed block size ?

There can be up to 1TB / 3KB = ~300 millions artifiacts. Assuming the name of each artifact is a SHA256 i.e. 32 bytes, that's an index of 10GB, i.e. ~1% of the size.

Event Timeline

dachary triaged this task as Normal priority.Feb 15 2021, 8:49 AM
dachary created this task.
dachary created this object in space S1 Public.

https://github.com/vasi/pixz/blob/master/src/common.c#L115

The index is located at the end of the file.
The content of the archive is compressed as successive blocs of a given size.
The index is compressed as a single block of unlimited size.

XZ file format https://tukaani.org/xz/format.html & https://tukaani.org/xz/xz-file-format.txt

Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.

https://github.com/vasi/pixz/blob/master/src/read.c#L237

When extracting a single file (-x file) the in memory index is walked sequentially looking for the file.

There are two blockers:

  • The index must be read in memory (10GB is too big)
  • The index is searched sequentially O(N)

pixz is not usable as is.