Page MenuHomeSoftware Heritage

swh.loader.package: Implement GNU Loader

Authored by nahimilega on Aug 7 2019, 1:01 PM.

Diff Detail

rDLDBASE Generic VCS/Package Loader
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Revision Created:

{'author': {'email': b'', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 
'committer': {'email': b'', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 
'committer_date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 
'date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 
'directory': b':\xeb\xc2\x9e\xd1\...\x06\xb5(', 'id': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'message': b'swh-loader-package...n message', 
'metadata': {'package': {...}}, 
'parents': [], 
'synthetic': True, 
'type': 'tar'}

Snapshot Created:

{'branches': {b'8sync-0.1.0.tar.gz': {...}, b'8sync-0.2.0.tar.gz': {...}, b'HEAD': {...}}}

One of the branch

{b'8sync-0.1.0.tar.gz': {'target': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'target_type': 'revision'}, b'8sync-0.2.0.tar.gz': {'target': b'\xc57\xd4\xad\xbeM...w\x13\x84', 'target_type': 'revision'}, b'HEAD': {'target': b'8sync-0.2.0.tar.gz', 'target_type': 'alias'}}
ardumont added inline comments.

Those needs to be cleaned up: T1532 (resulting from multiple team discussions).

I know @anlambert said otherwise but still, that task explains it all.
I never got a chance to finish this but i won't accept new code using that deprecated behavior.

You can however reference the needed configuration as comments.


that comes from self.config['temp_directory']

That seems to go in the right direction.

There are some things to change as commented above.

Also, I'm missing as mentioned in irc:

  1. tests for the expected structure of the release, revision, snapshot
  2. the current tests need to also check the release, revision, snapshot [1]
  3. We use requests but i don't see mock for that, so i expect the tests to actually trigger http request, this needs to be mocked.
  4. As we are writing in temporary folder, make sure the tests are cleaning up after you run those.

[1] On irc, i mentioned something about logs but that's because when i wrote some of the initial loaders, it was painful to do otherwise.
Now, you can lift the docker-dev to help you with that:

  • scratch the db's data (so clean slate)
  • run the loader on your origin (here the tarball you want)
  • check the db's state after that (you'll have all you need in the different db tables)

The requisite stays the same though, you need to be sure you compute correctly the revision, release, snapshot (thus the tests in 1. above)


Please remove as it's unused.


Please stop using indirection, it's hard to follow through in plain text.
Be simple in your step, only use methods when you need to.

When the need to refactor into submethods arise, let's do it, not before.


Please, use some syntax checker (flycheck, flyspell, etc...)

Concealed this data into one dictionary to eliminate...

Also, more generally, when in a submethod, please use the docstring to mention those.

self.package_details = {
    'name': name,
    'origin_url': origin_url,
    'tarballs': kwargs['tarballs'],

Thus really inline this in the caller method.


Make that a warn instead.
That's more critical, all the more reason since it triggers a partial visit.

By the way, i'm not sure we want to make that a partial visit [1]
But let's keep it that way for now.

[1] If an archive listed is 404, then we most probably never recover it.
So we'll stay in that partial visit forever.
On the other hand, having a full visit even though we missed some tarballs is not factually true as well...
so ¯\_(ツ)_/¯


I don't think we want to keep 'nature' key (which is not that great of a name, my bad).

IIRC, it's the tarball's type 'tar' or 'zip' here.


Remove that comment.

If it's really necessary it's not in the right place.


Prefer one return statement. Also a warn log statement when detecting the invalid tarball.

uncompressed_path = None
self.tarball_invalid = False

  uncompressed_path = tarball.uncompress...
  self.log.warn('Invalid tarball detected...')

return path

Make that 'tar'.
We don't want to see 'zip' or something else here.


I'm not sure about this not being populated.
For information that's the revision parent list.
In dvcs, that usually reflects the commit/changeset history.

@olasd when you have a sec, what do you think?

That could be iterated over in a future iteration step, so i'm not blocking the diff for it

139 ↗(On Diff #6151)

I'm thus curious about the snapshot here...

161 ↗(On Diff #6151)
In this test there is no change between visit. So same result as first visit.
174 ↗(On Diff #6151)

use 1 artifact release so 1 revision...

Because release is also part of the swh ontology, that's confusing ;)

186 ↗(On Diff #6151)

Also, i'd keep the assertions to go in the right order:

  • content
  • directory
  • revision
  • release
  • snapshot
ardumont requested changes to this revision.Aug 8 2019, 6:16 AM
This revision now requires changes to proceed.Aug 8 2019, 6:16 AM
20 ↗(On Diff #6151)

Explicit the arguments, m is a mock.
Make it a better name... mock_<whatever-it-is>

100 ↗(On Diff #6151)

m is not clear, please improve on its naming.

Tests can be seen as automated documentation, so it must be as readable (as good documentation do ;)


@anlambert i should have asked you also ^^


@ardumont, currently we do not populate the parents field in the PyPI and npm loader. We simply create a revision for each package version and reference them in a snapshot each time we visit the package.

Nevertheless, maybe it could be interesting to populate it with the revision pointing to the previous package version. This is semantically correct but as our data model is based on VCS ones I am not sure if we can do that. For instance, this could be useful to easily generate diffs between two successive package version in the web application.


I do not think that indirection is necessary at the moment. Simply explaining in the docstring that the generate_and_load_snapshot will gather package versions as branches and generate a snapshot from them is sufficient to understand what is is intended to do.


In the PyPI and npm loader, branches are named the following: releases/<package_version> (see here or here for instance).

We should keep that naming scheme for coherency. Here, this implies extracting the version number from the tarball name.

20 ↗(On Diff #6151)

My bad for that as that snippet comes from the npm loader I wrote ;-)

100 ↗(On Diff #6151)

Same here, I am the one to blame for the lazy naming.



nahimilega marked 16 inline comments as done.
  • Make recomended changes

I converted this into a separate method to ease up in the understanding of this step. Although I will shift the working of this method in prepare() and put an inline comment there.


I have added a FIX ME here to address this issue ;)


With gnu, it is not possible to get the package_version, so here I used the basename of url (as shown in docstring). Hope that is fine ;)

140 ↗(On Diff #6197)

we are writing in temporary folder, make sure the tests are cleaning up after you run those.

I suppose this line will ensure this ;)

27–32 ↗(On Diff #6197)

We use requests but i don't see mock for that, so i expect the tests to actually trigger http request, this needs to be mocked.

The tarball request is mocked here

  • Fix the bug which was preventing the the loader to save snapshot
  • Restructure git commits

This is a problem it should be temp_directory = self.config['temp_directory']

But while running it gives key error. Key temp_directory not found.

Hence I temporarily changed it to this.
How can I fix this, is there something I need to do in the loader?

olasd requested changes to this revision.Aug 19 2019, 1:57 PM

Thanks for this code, this is a very good step forward!

There's a few bits of the code that don't feel right to me, but what I'll do is work towards accepting this code as-is, then I will file tasks to improve upon it.

Could you please reorder the test cases so that you check object types in the same order? This will make them easier to follow.

I suggest using the following order:

  • content (count and contains)
  • directory (count and contains)
  • revision (count and contains; also check the contents of the revision)
  • release (count)
  • snapshot (count and contains; also check the actual branches in the snapshot. The branch list you're providing in your tests is not being used right now, so please fix that.)
  • origin_visit (load status and visit status)

Once that's done I'll be able to accept the diff and file followup tasks.


[this is an aside from the code review, just reacting to @anlambert's remark on pointing at previous package versions]

For a given package version, determining the logical "previous package version" is not that easy, for instance when you consider long term support branches. A practical example follows.

Let's say that on the first visit, you find versions 1.0, 1.1, 1.2, 2.0 and 2.1. Your heuristic may generate the following edges:

  • 2.12.0
  • 2.01.2
  • 1.21.1
  • 1.11.0

On the second visit, you discover versions 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2 and 3.0. What edges should the heuristic generate then?

  • Version 2.0 is obviously not coming from 1.3 as we've seen that version before 1.3 appeared; This can make us wonder: did it really come from 1.2?
  • Version 2.2 could logically have two parents, 2.1 and 1.3 (if the project patches the oldest branch then forward-ports patches);
  • Vice versa, 1.3 could have logical parents 1.2 and 2.2 or 3.0 if the project backports patches.
  • Finally, we really have no indication where 3.0 branched from. It could even be a long-lived branch that spun off of 1.0!

We could refine the heuristic by parsing the changelog, if it exists, as that is a more explicit indication of parenthood. That's what the Debian loader did, originally: it's fairly easy to do there because there's a mandatory changelog file in a machine-parseable format (well, at least for "recent" packages), and tools in the ecosystem, for instance the bug tracking system, depend on the maintainers properly tracking the version lineage there.

The issue with tracking package version parenthood/history within the revisions themselves then becomes: what happens if a previous version is missing in the Software Heritage archive? This really makes the PID of a given package dependent on the previous contents of the archive, rather than being somewhat self-contained and only depending on the package data itself. So far we've erred on the side of keeping stuff as self-contained as possible.

If we find a decent heuristic to parse changelogs, we can build upon that to provide the information after the fact.

This revision now requires changes to proceed.Aug 19 2019, 1:57 PM

When you're running the worker, is the temp_directory key set in the config? If not then you need to add it.

  • Reorder tests and add missing tests


After configuring my docker environment to be able to run this and submitting a task, I've managed to navigate through an origin loaded with this loader, and the data looks alright.

I'll submit followup tasks on stuff I'd like to be improved in this code (not right now though, likely tomorrow).

This revision is now accepted and ready to land.Aug 20 2019, 5:45 PM
This revision was automatically updated to reflect the committed changes.