Page MenuHomeSoftware Heritage

tarball: Add support for tar.Z, tar.x, tar.lz files

Authored by ardumont on Nov 28 2019, 4:00 PM.



This will allow to deal with more tarballs (gnu, cran, nix, guix, etc...).

Implementation wise, this uses the shutil
high-level standard python library.

Feel free to propose a better way of doing this, I did not see any.

Note: I expected the ci build to fail. One test needs the "optional" lzip
dependency installed. It did not but...
I don't know how to idiomatically make this setup optional (depending on the
runtime's current setup). That is without hard-coding check on /usr/bin/lzip
for example, etc... And i'd rather we do that so any hint is welcome ;)

Related to T2110

Test Plan


Diff Detail

rDCORE Foundations and core functionalities
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont edited the summary of this revision. (Show Details)

Use a smaller .tar.lz sample file

  • core.tarballs: Improve docstring

No other clients of this relies on the actual 'nature' returned here.


same remark, nobody actually depends on is_tarball.

douardda added inline comments.

As always, naming... what does that 'specific' means here?


This example list should at least be terminated by a '...'. The description should mention that since it uses the tar command, it supports any compression format this later supports.


Why raise this exception from within the surrounding try block?
A non existing file should not raise a shutil specific exception IMHO.


This 'exist_ok' is a bit scary. Not sure we do want that.


no need for this "embedded" import statement, raise to the top of the module.


better dispatch the elements of the expected 3-tuple in 3 variables here. Like
for name, extenstions, function in ADDITIONAL...


Unless I'm mistaken, this is not true. The ValueError will be raised in case of a corrupted file for example, which is IMHO not what we want.

This revision now requires changes to proceed.Nov 29 2019, 11:01 AM

Adapt according to review


I will improve on this. It's explained in the docstring though.

It was initially named "unpack_tar_Z" as the first implementation was "specific" to tar.Z.
Then i realized the other unsupported tarballs by the standard python library (tar.x, tar.lz) could also be used by this function. So it was then specific to those tarballs.


I did not put the ellipsis because the use case foreseen at the time was only for those.
(See the register variable at the end of the module).

I agree we could use this for more though.
I now see that since the function is exposed by the module anyway, there is no limit in its us.


This implementation is oriented toward the shutil registration...
Thus raising the same error as the shutil mechanism expects.
(see the uncompress definition which now uses the shutil.unpack_archive)

I also want to avoid indirection in function to translate error (one function calling this one and trapping error the shutil expected way).

What should be a proper way to do so?

I could also make the _unpack_tar a private function.
But i don't know if that address totally your concern.
(Which i realize is what i aimed at initially...)


well, i want to make a dedicated directory containing only the content of the tarball.
I guess, we could let the client caller ensure that.
I will adapt in that direction.


oh yes, i did that and reverted it with something else.
I'll add it back.

Fix missing assertion adaptation

Quoting me on the diff's description for that.

Note: I expected the ci build to fail. One test needs the "optional" lzip
dependency installed.

And it did.

Fixed the ci build it by adding the lzip dependency on ci node.

Note: This means that we will need to add the lzip runtime dependency on a debian package somewhere (either swh-core's as it's the one exposing it or the client of this, loader-core comes to mind).

This revision is now accepted and ready to land.Dec 12 2019, 11:15 AM