Using pixz for 1TB archives
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	dachary
	Feb 15 2021, 8:49 AM

Description

How practical would it really be to use pixz to create a 1TB archive with artifacts that have 3KB median size and 80KB average size ? Are there any blockers / counter performances when using the pixz index to read a single artifact ? What is the CPU cost of uncompressing ? When reading a single artifact what is the overhead implied by the fixed block size ?

There can be up to 1TB / 3KB = ~300 millions artifiacts. Assuming the name of each artifact is a SHA256 i.e. 32 bytes, that's an index of 10GB, i.e. ~1% of the size.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3116 Roll out at least one operational mirror
Migrated	gitlab-migration	T3054 Scale out object storage design
Migrated	gitlab-migration	T3048 Using a custom Sorted String Table format
Migrated	gitlab-migration	T3045 Using pixz for 1TB archives

Event Timeline

dachary triaged this task as Normal priority.Feb 15 2021, 8:49 AM

dachary created this task.

dachary created this object in space S1 Public.

https://github.com/vasi/pixz/blob/master/src/common.c#L115

The index is located at the end of the file.
The content of the archive is compressed as successive blocs of a given size.
The index is compressed as a single block of unlimited size.

XZ file format https://tukaani.org/xz/format.html & https://tukaani.org/xz/xz-file-format.txt

Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.

dachary updated the task description. (Show Details)Feb 15 2021, 9:02 AM

https://github.com/vasi/pixz/blob/master/src/read.c#L237

When extracting a single file (-x file) the in memory index is walked sequentially looking for the file.

There are two blockers: