Implement GNU Loader
Details
- Reviewers
olasd - Group Reviewers
Reviewers - Commits
- rDLDBASEcb253bdae866: swh.loader.package: Add tests
rDLDBASE054ef894d0ae: swh.lister.package: Implement GNU loader
Diff Detail
- Repository
- rDLDBASE Generic VCS/Package Loader
- Branch
- fin_gnu
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 7281 Build 10304: tox-on-jenkins Jenkins Build 10303: arc lint + arc unit
Event Timeline
Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/147/ for more details.
Revision Created:
{'author': {'email': b'robot@softwareheritage.org', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 'committer': {'email': b'robot@softwareheritage.org', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 'committer_date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 'date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 'directory': b':\xeb\xc2\x9e\xd1\...\x06\xb5(', 'id': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'message': b'swh-loader-package...n message', 'metadata': {'package': {...}}, 'parents': [], 'synthetic': True, 'type': 'tar'}
Snapshot Created:
{'branches': {b'8sync-0.1.0.tar.gz': {...}, b'8sync-0.2.0.tar.gz': {...}, b'HEAD': {...}}}
One of the branch
{b'8sync-0.1.0.tar.gz': {'target': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'target_type': 'revision'}, b'8sync-0.2.0.tar.gz': {'target': b'\xc57\xd4\xad\xbeM...w\x13\x84', 'target_type': 'revision'}, b'HEAD': {'target': b'8sync-0.2.0.tar.gz', 'target_type': 'alias'}}
swh/loader/package/loader.py | ||
---|---|---|
53 | Those needs to be cleaned up: T1532 (resulting from multiple team discussions). I know @anlambert said otherwise but still, that task explains it all. You can however reference the needed configuration as comments. | |
54 | that comes from self.config['temp_directory'] |
That seems to go in the right direction.
There are some things to change as commented above.
Also, I'm missing as mentioned in irc:
- tests for the expected structure of the release, revision, snapshot
- the current tests need to also check the release, revision, snapshot [1]
- We use requests but i don't see mock for that, so i expect the tests to actually trigger http request, this needs to be mocked.
- As we are writing in temporary folder, make sure the tests are cleaning up after you run those.
[1] On irc, i mentioned something about logs but that's because when i wrote some of the initial loaders, it was painful to do otherwise.
Now, you can lift the docker-dev to help you with that:
- scratch the db's data (so clean slate)
- run the loader on your origin (here the tarball you want)
- check the db's state after that (you'll have all you need in the different db tables)
The requisite stays the same though, you need to be sure you compute correctly the revision, release, snapshot (thus the tests in 1. above)
swh/loader/package/loader.py | ||
---|---|---|
47 | Please remove as it's unused. | |
126 | Please stop using indirection, it's hard to follow through in plain text. When the need to refactor into submethods arise, let's do it, not before. | |
129 | Please, use some syntax checker (flycheck, flyspell, etc...) Concealed this data into one dictionary to eliminate... Also, more generally, when in a submethod, please use the docstring to mention those. | |
134 | self.package_details = { 'name': name, 'origin_url': origin_url, 'tarballs': kwargs['tarballs'], } Thus really inline this in the caller method. | |
171 | Make that a warn instead. By the way, i'm not sure we want to make that a partial visit [1] [1] If an archive listed is 404, then we most probably never recover it. | |
231 | I don't think we want to keep 'nature' key (which is not that great of a name, my bad). IIRC, it's the tarball's type 'tar' or 'zip' here. | |
286 | Remove that comment. If it's really necessary it's not in the right place. | |
293 | Prefer one return statement. Also a warn log statement when detecting the invalid tarball. uncompressed_path = None self.tarball_invalid = False try: ... uncompressed_path = tarball.uncompress... except: self.log.warn('Invalid tarball detected...') self.tarball_invalid return path | |
364 | Make that 'tar'. | |
368 | I'm not sure about this not being populated. @olasd when you have a sec, what do you think? That could be iterated over in a future iteration step, so i'm not blocking the diff for it | |
swh/loader/package/tests/test_loader.py | ||
139 | I'm thus curious about the snapshot here... | |
161 | In this test there is no change between visit. So same result as first visit. | |
174 | use 1 artifact release so 1 revision... Because release is also part of the swh ontology, that's confusing ;) | |
186 | Also, i'd keep the assertions to go in the right order:
|
swh/loader/package/tests/common.py | ||
---|---|---|
20 | Explicit the arguments, m is a mock. | |
swh/loader/package/tests/test_loader.py | ||
100 | m is not clear, please improve on its naming. Tests can be seen as automated documentation, so it must be as readable (as good documentation do ;) |
swh/loader/package/loader.py | ||
---|---|---|
368 | @anlambert i should have asked you also ^^ |
swh/loader/package/loader.py | ||
---|---|---|
368 | @ardumont, currently we do not populate the parents field in the PyPI and npm loader. We simply create a revision for each package version and reference them in a snapshot each time we visit the package. Nevertheless, maybe it could be interesting to populate it with the revision pointing to the previous package version. This is semantically correct but as our data model is based on VCS ones I am not sure if we can do that. For instance, this could be useful to easily generate diffs between two successive package version in the web application. | |
388 | I do not think that indirection is necessary at the moment. Simply explaining in the docstring that the generate_and_load_snapshot will gather package versions as branches and generate a snapshot from them is sufficient to understand what is is intended to do. | |
400 | ||
swh/loader/package/tests/common.py | ||
20 | My bad for that as that snippet comes from the npm loader I wrote ;-) | |
swh/loader/package/tests/test_loader.py | ||
100 | Same here, I am the one to blame for the lazy naming. |
swh/loader/package/loader.py | ||
---|---|---|
400 | s/coherency/consistency/ |
Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/148/ for more details.
swh/loader/package/loader.py | ||
---|---|---|
400 | With gnu, it is not possible to get the package_version, so here I used the basename of url (as shown in docstring). Hope that is fine ;) | |
swh/loader/package/tests/test_loader.py | ||
140 |
I suppose this line will ensure this ;) |
swh/loader/package/tests/common.py | ||
---|---|---|
27–32 |
The tarball request is mocked here |
- Fix the bug which was preventing the the loader to save snapshot
- Restructure git commits
Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/149/ for more details.
swh/loader/package/loader.py | ||
---|---|---|
46 | This is a problem it should be temp_directory = self.config['temp_directory'] But while running it gives key error. Key temp_directory not found. Hence I temporarily changed it to this. |
Thanks for this code, this is a very good step forward!
There's a few bits of the code that don't feel right to me, but what I'll do is work towards accepting this code as-is, then I will file tasks to improve upon it.
Could you please reorder the test cases so that you check object types in the same order? This will make them easier to follow.
I suggest using the following order:
- content (count and contains)
- directory (count and contains)
- revision (count and contains; also check the contents of the revision)
- release (count)
- snapshot (count and contains; also check the actual branches in the snapshot. The branch list you're providing in your tests is not being used right now, so please fix that.)
- origin_visit (load status and visit status)
Once that's done I'll be able to accept the diff and file followup tasks.
swh/loader/package/loader.py | ||
---|---|---|
368 | [this is an aside from the code review, just reacting to @anlambert's remark on pointing at previous package versions] For a given package version, determining the logical "previous package version" is not that easy, for instance when you consider long term support branches. A practical example follows. Let's say that on the first visit, you find versions 1.0, 1.1, 1.2, 2.0 and 2.1. Your heuristic may generate the following edges:
On the second visit, you discover versions 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2 and 3.0. What edges should the heuristic generate then?
We could refine the heuristic by parsing the changelog, if it exists, as that is a more explicit indication of parenthood. That's what the Debian loader did, originally: it's fairly easy to do there because there's a mandatory changelog file in a machine-parseable format (well, at least for "recent" packages), and tools in the ecosystem, for instance the bug tracking system, depend on the maintainers properly tracking the version lineage there. The issue with tracking package version parenthood/history within the revisions themselves then becomes: what happens if a previous version is missing in the Software Heritage archive? This really makes the PID of a given package dependent on the previous contents of the archive, rather than being somewhat self-contained and only depending on the package data itself. So far we've erred on the side of keeping stuff as self-contained as possible. If we find a decent heuristic to parse changelogs, we can build upon that to provide the information after the fact. |
swh/loader/package/loader.py | ||
---|---|---|
46 | When you're running the worker, is the temp_directory key set in the config? If not then you need to add it. |
Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/150/ for more details.
Thanks!
After configuring my docker environment to be able to run this and submitting a task, I've managed to navigate through an origin loaded with this loader, and the data looks alright.
I'll submit followup tasks on stuff I'd like to be improved in this code (not right now though, likely tomorrow).