Page MenuHomeSoftware Heritage

tarball: Add support for tar.Z, tar.x, tar.lz files
ClosedPublic

Authored by ardumont on Nov 28 2019, 4:00 PM.

Details

Summary

This will allow to deal with more tarballs (gnu, cran, nix, guix, etc...).

Implementation wise, this uses the shutil
high-level standard python library.

Feel free to propose a better way of doing this, I did not see any.

Note: I expected the ci build to fail. One test needs the "optional" lzip
dependency installed. It did not but...
I don't know how to idiomatically make this setup optional (depending on the
runtime's current setup). That is without hard-coding check on /usr/bin/lzip
for example, etc... And i'd rather we do that so any hint is welcome ;)

Related to T2110

Test Plan

tox

Diff Detail

Repository
rDCORE Foundations and core functionalities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Nov 28 2019, 4:00 PM
ardumont edited the summary of this revision. (Show Details)Nov 28 2019, 4:02 PM
ardumont edited the summary of this revision. (Show Details)Nov 28 2019, 4:03 PM
ardumont edited the summary of this revision. (Show Details)
ardumont updated this revision to Diff 8355.Nov 28 2019, 4:08 PM

Use a smaller .tar.lz sample file

ardumont updated this revision to Diff 8356.Nov 28 2019, 4:11 PM
  • core.tarballs: Improve docstring
ardumont added inline comments.Nov 28 2019, 5:18 PM
swh/core/tarball.py
164

No other clients of this relies on the actual 'nature' returned here.

swh/core/tests/test_tarball.py
11

same remark, nobody actually depends on is_tarball.

ardumont edited the summary of this revision. (Show Details)Nov 28 2019, 5:36 PM
douardda requested changes to this revision.Nov 29 2019, 11:01 AM
douardda added inline comments.
swh/core/tarball.py
17

As always, naming... what does that 'specific' means here?

18

This example list should at least be terminated by a '...'. The description should mention that since it uses the tar command, it supports any compression format this later supports.

80–81

Why raise this exception from within the surrounding try block?
A non existing file should not raise a shutil specific exception IMHO.

84

This 'exist_ok' is a bit scary. Not sure we do want that.

85

no need for this "embedded" import statement, raise to the top of the module.

85

better dispatch the elements of the expected 3-tuple in 3 variables here. Like
for name, extenstions, function in ADDITIONAL...

190

Unless I'm mistaken, this is not true. The ValueError will be raised in case of a corrupted file for example, which is IMHO not what we want.

This revision now requires changes to proceed.Nov 29 2019, 11:01 AM
ardumont updated this revision to Diff 8382.Nov 30 2019, 12:17 PM

Adapt according to review

swh/core/tarball.py
17

I will improve on this. It's explained in the docstring though.

It was initially named "unpack_tar_Z" as the first implementation was "specific" to tar.Z.
Then i realized the other unsupported tarballs by the standard python library (tar.x, tar.lz) could also be used by this function. So it was then specific to those tarballs.

18

I did not put the ellipsis because the use case foreseen at the time was only for those.
(See the register variable at the end of the module).

I agree we could use this for more though.
I now see that since the function is exposed by the module anyway, there is no limit in its us.

80–81

This implementation is oriented toward the shutil registration...
Thus raising the same error as the shutil mechanism expects.
(see the uncompress definition which now uses the shutil.unpack_archive)

I also want to avoid indirection in function to translate error (one function calling this one and trapping error the shutil expected way).

What should be a proper way to do so?

I could also make the _unpack_tar a private function.
But i don't know if that address totally your concern.
(Which i realize is what i aimed at initially...)

84

well, i want to make a dedicated directory containing only the content of the tarball.
I guess, we could let the client caller ensure that.
I will adapt in that direction.

85

oh yes, i did that and reverted it with something else.
I'll add it back.

ardumont updated this revision to Diff 8383.Nov 30 2019, 12:22 PM

Fix missing assertion adaptation

ardumont updated this revision to Diff 8405.Dec 3 2019, 12:31 PM

Plug to branch

Quoting me on the diff's description for that.

Note: I expected the ci build to fail. One test needs the "optional" lzip
dependency installed.

And it did.

Fixed the ci build it by adding the lzip dependency on ci node.

Note: This means that we will need to add the lzip runtime dependency on a debian package somewhere (either swh-core's as it's the one exposing it or the client of this, loader-core comes to mind).

douardda accepted this revision.Dec 12 2019, 11:15 AM
This revision is now accepted and ready to land.Dec 12 2019, 11:15 AM
ardumont updated this revision to Diff 8631.Dec 12 2019, 4:05 PM

Rebase and plug to master