Page MenuHomeSoftware Heritage

properly handle ingestion of archives within archives (recursive extraction)
Closed, MigratedEdits Locked

Description

Currently, HAL accepts software deposits in .zip or .tar.gz formats, but if the deposit is not in .zip format, it wraps it into a .zip before sendind it to us. This fools our ingestion process, forcing the deposit into the Merkle tree of a .tar.gz blob instead of its contents.

This is clearly an error on the HAL side, and we will try to have it fixed there, but with the generalisation of software deposits, we may be confronted to zillions of mistakes like this ones, and while one waits for them to be fixed, we pollute our archive.
We need to decide whether to try and fix this behavior on our side, by recursively opening wrappers or just reject the deposit if we see the double wrapping.

Wrappers should be easy to spot: a .zip file containing just a .tar.gz file, a .tar file containing just a .tar file etc.....

Event Timeline

zack triaged this task as Normal priority.EditedJun 28 2018, 10:31 AM

The general problem (see below for the deposit-specific case) is indeed complex to deal with (both conceptually in a pure Merkle setting and practically due to the existence of zip bombs). I think a workable solution might be ingest the archive as is and also ingest a separate directory corresponding to the archive content, with some metadata linking the two. That way by default we will only return what we have ingested (without recursion), but we will offer ways to dig-in recursively, e.g., in the web app. There will be plenty of devils in plenty of details for this though.

For the deposit-specific case, I agree that we can consider an "error" any archive that is sent to us containing a single file which is itself an archive of some sort. We can easily add a check that makes the ingestion fail in those cases. I've filed T1123 about this aspect of the problem.

zack renamed this task from Decide how to handle software deposits containing double archive wrapping to properly handle ingestion of archives within archives (recursive extraction).Jun 28 2018, 10:31 AM