Details

{'author': {'email': b'robot@softwareheritage.org', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 
'committer': {'email': b'robot@softwareheritage.org', 'fullname': b'Software Heritage', 'name': b'Software Heritage'}, 
'committer_date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 
'date': {'negative_utc': False, 'offset': 0, 'timestamp': {...}}, 
'directory': b':\xeb\xc2\x9e\xd1\...\x06\xb5(', 'id': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'message': b'swh-loader-package...n message', 
'metadata': {'package': {...}}, 
'parents': [], 
'synthetic': True, 
'type': 'tar'}

Snapshot Created:

{'branches': {b'8sync-0.1.0.tar.gz': {...}, b'8sync-0.2.0.tar.gz': {...}, b'HEAD': {...}}}

One of the branch

{b'8sync-0.1.0.tar.gz': {'target': b'D\x184\x88\xc0wL\x...x8aJB\xb3', 'target_type': 'revision'}, b'8sync-0.2.0.tar.gz': {'target': b'\xc57\xd4\xad\xbeM...w\x13\x84', 'target_type': 'revision'}, b'HEAD': {'target': b'8sync-0.2.0.tar.gz', 'target_type': 'alias'}}

ardumont added subscribers: anlambert, ardumont.Aug 7 2019, 2:45 PM

ardumont added inline comments.

swh/loader/package/loader.py
53	Those needs to be cleaned up: T1532 (resulting from multiple team discussions). I know @anlambert said otherwise but still, that task explains it all. I never got a chance to finish this but i won't accept new code using that deprecated behavior. You can however reference the needed configuration as comments.
54	that comes from self.config['temp_directory']

That seems to go in the right direction.

There are some things to change as commented above.

Also, I'm missing as mentioned in irc:

tests for the expected structure of the release, revision, snapshot
the current tests need to also check the release, revision, snapshot [1]
We use requests but i don't see mock for that, so i expect the tests to actually trigger http request, this needs to be mocked.
As we are writing in temporary folder, make sure the tests are cleaning up after you run those.

[1] On irc, i mentioned something about logs but that's because when i wrote some of the initial loaders, it was painful to do otherwise.
Now, you can lift the docker-dev to help you with that:

scratch the db's data (so clean slate)
run the loader on your origin (here the tarball you want)
check the db's state after that (you'll have all you need in the different db tables)

The requisite stays the same though, you need to be sure you compute correctly the revision, release, snapshot (thus the tests in 1. above)

swh/loader/package/loader.py
47	Please remove as it's unused.
126	Please stop using indirection, it's hard to follow through in plain text. Be simple in your step, only use methods when you need to. When the need to refactor into submethods arise, let's do it, not before.
129	Please, use some syntax checker (flycheck, flyspell, etc...) `Concealed this data into one dictionary to eliminate...` `eliminate` Also, more generally, when in a submethod, please use the docstring to mention those.
134	self.package_details = { 'name': name, 'origin_url': origin_url, 'tarballs': kwargs['tarballs'], } Thus really inline this in the caller method.
171	Make that a `warn` instead. That's more critical, all the more reason since it triggers a `partial` visit. By the way, i'm not sure we want to make that a partial visit [1] But let's keep it that way for now. [1] If an archive listed is 404, then we most probably never recover it. So we'll stay in that partial visit forever. On the other hand, having a full visit even though we missed some tarballs is not factually true as well... so ¯\_(ツ)_/¯
231	I don't think we want to keep 'nature' key (which is not that great of a name, my bad). IIRC, it's the tarball's type 'tar' or 'zip' here.
286	Remove that comment. If it's really necessary it's not in the right place.
293	Prefer one return statement. Also a warn log statement when detecting the invalid tarball. uncompressed_path = None self.tarball_invalid = False try: ... uncompressed_path = tarball.uncompress... except: self.log.warn('Invalid tarball detected...') self.tarball_invalid return path
364	Make that 'tar'. We don't want to see 'zip' or something else here.
368	I'm not sure about this not being populated. For information that's the revision parent list. In dvcs, that usually reflects the commit/changeset history. @olasd when you have a sec, what do you think? That could be iterated over in a future iteration step, so i'm not blocking the diff for it
swh/loader/package/tests/test_loader.py
139 ↗	(On Diff #6151)	I'm thus curious about the snapshot here...
161 ↗	(On Diff #6151)	In this test there is no change between visit. So same result as first visit.
174 ↗	(On Diff #6151)	use `1 artifact release so 1 revision...` Because release is also part of the swh ontology, that's confusing ;)
186 ↗	(On Diff #6151)	Also, i'd keep the assertions to go in the right order: content directory revision release snapshot

ardumont requested changes to this revision.Aug 8 2019, 6:16 AM

This revision now requires changes to proceed.Aug 8 2019, 6:16 AM

ardumont added inline comments.Aug 8 2019, 6:19 AM

swh/loader/package/tests/common.py
20 ↗	(On Diff #6151)	Explicit the arguments, m is a mock. Make it a better name... mock_<whatever-it-is>
swh/loader/package/tests/test_loader.py
100 ↗	(On Diff #6151)	`m` is not clear, please improve on its naming. Tests can be seen as automated documentation, so it must be as readable (as good documentation do ;)

ardumont added inline comments.Aug 8 2019, 6:20 AM

swh/loader/package/loader.py
368	@anlambert i should have asked you also ^^

anlambert added inline comments.Aug 8 2019, 1:52 PM

swh/loader/package/loader.py
368	@ardumont, currently we do not populate the `parents` field in the PyPI and npm loader. We simply create a revision for each package version and reference them in a snapshot each time we visit the package. Nevertheless, maybe it could be interesting to populate it with the revision pointing to the previous package version. This is semantically correct but as our data model is based on VCS ones I am not sure if we can do that. For instance, this could be useful to easily generate diffs between two successive package version in the web application.
388	I do not think that indirection is necessary at the moment. Simply explaining in the docstring that the `generate_and_load_snapshot` will gather package versions as branches and generate a snapshot from them is sufficient to understand what is is intended to do.
400	In the PyPI and npm loader, branches are named the following: `releases/<package_version>` (see here or here for instance). We should keep that naming scheme for coherency. Here, this implies extracting the version number from the tarball name.
swh/loader/package/tests/common.py
20 ↗	(On Diff #6151)	My bad for that as that snippet comes from the npm loader I wrote ;-)
swh/loader/package/tests/test_loader.py
100 ↗	(On Diff #6151)	Same here, I am the one to blame for the lazy naming.

anlambert added inline comments.Aug 8 2019, 1:54 PM

swh/loader/package/loader.py
400	s/coherency/consistency/

ardumont mentioned this in P497 logs.Aug 9 2019, 4:44 PM

Make recomended changes

swh/loader/package/loader.py
126	I converted this into a separate method to ease up in the understanding of this step. Although I will shift the working of this method in `prepare()` and put an inline comment there.
171	I have added a `FIX ME` here to address this issue ;)

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/148/ for more details.

Harbormaster completed remote builds in B7225: Diff 6197.Aug 12 2019, 11:56 AM

nahimilega added inline comments.Aug 12 2019, 11:59 AM

swh/loader/package/loader.py
400	With gnu, it is not possible to get the package_version, so here I used the basename of url (as shown in docstring). Hope that is fine ;)
swh/loader/package/tests/test_loader.py
140 ↗	(On Diff #6197)	we are writing in temporary folder, make sure the tests are cleaning up after you run those. I suppose this line will ensure this ;)

nahimilega added inline comments.Aug 12 2019, 12:01 PM

swh/loader/package/tests/common.py
27–32 ↗	(On Diff #6197)	We use requests but i don't see mock for that, so i expect the tests to actually trigger http request, this needs to be mocked. The tarball request is mocked here

Fix the bug which was preventing the the loader to save snapshot
Restructure git commits

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/149/ for more details.

Harbormaster completed remote builds in B7281: Diff 6246.Aug 18 2019, 4:39 PM

nahimilega added inline comments.Aug 18 2019, 4:49 PM

swh/loader/package/loader.py
46	This is a problem it should be temp_directory = self.config['temp_directory'] But while running it gives key error. Key `temp_directory` not found. Hence I temporarily changed it to this. How can I fix this, is there something I need to do in the loader?

Thanks for this code, this is a very good step forward!

There's a few bits of the code that don't feel right to me, but what I'll do is work towards accepting this code as-is, then I will file tasks to improve upon it.

Could you please reorder the test cases so that you check object types in the same order? This will make them easier to follow.

I suggest using the following order:

content (count and contains)
directory (count and contains)
revision (count and contains; also check the contents of the revision)
release (count)
snapshot (count and contains; also check the actual branches in the snapshot. The branch list you're providing in your tests is not being used right now, so please fix that.)
origin_visit (load status and visit status)

Once that's done I'll be able to accept the diff and file followup tasks.

swh/loader/package/loader.py
368	[this is an aside from the code review, just reacting to @anlambert's remark on pointing at previous package versions] For a given package version, determining the logical "previous package version" is not that easy, for instance when you consider long term support branches. A practical example follows. Let's say that on the first visit, you find versions `1.0`, `1.1`, `1.2`, `2.0` and `2.1`. Your heuristic may generate the following edges: `2.1` → `2.0` `2.0` → `1.2` `1.2` → `1.1` `1.1` → `1.0` On the second visit, you discover versions `1.0`, `1.1`, `1.2`, `1.3`, `2.0`, `2.1`, `2.2` and `3.0`. What edges should the heuristic generate then? Version `2.0` is obviously not coming from `1.3` as we've seen that version before `1.3` appeared; This can make us wonder: did it really come from `1.2`? Version `2.2` could logically have two parents, `2.1` and `1.3` (if the project patches the oldest branch then forward-ports patches); Vice versa, `1.3` could have logical parents `1.2` and `2.2` or `3.0` if the project backports patches. Finally, we really have no indication where `3.0` branched from. It could even be a long-lived branch that spun off of `1.0`! We could refine the heuristic by parsing the changelog, if it exists, as that is a more explicit indication of parenthood. That's what the Debian loader did, originally: it's fairly easy to do there because there's a mandatory changelog file in a machine-parseable format (well, at least for "recent" packages), and tools in the ecosystem, for instance the bug tracking system, depend on the maintainers properly tracking the version lineage there. The issue with tracking package version parenthood/history within the revisions themselves then becomes: what happens if a previous version is missing in the Software Heritage archive? This really makes the PID of a given package dependent on the previous contents of the archive, rather than being somewhat self-contained and only depending on the package data itself. So far we've erred on the side of keeping stuff as self-contained as possible. If we find a decent heuristic to parse changelogs, we can build upon that to provide the information after the fact.

This revision now requires changes to proceed.Aug 19 2019, 1:57 PM

olasd added inline comments.Aug 19 2019, 2:13 PM

swh/loader/package/loader.py
46	When you're running the worker, is the `temp_directory` key set in the config? If not then you need to add it.

Reorder tests and add missing tests

nahimilega marked an inline comment as done.Aug 19 2019, 11:41 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tox/150/ for more details.

Harbormaster completed remote builds in B7316: Diff 6280.Aug 19 2019, 11:42 PM

Thanks!

After configuring my docker environment to be able to run this and submitting a task, I've managed to navigate through an origin loaded with this loader, and the data looks alright.

I'll submit followup tasks on stuff I'd like to be improved in this code (not right now though, likely tomorrow).

olasd removed a reviewer: ardumont.Aug 20 2019, 5:45 PM

This revision is now accepted and ready to land.Aug 20 2019, 5:45 PM

Closed by commit rDLDBASE054ef894d0ae: swh.lister.package: Implement GNU loader (authored by nahimilega). · Explain WhyAug 20 2019, 8:33 PM

This revision was automatically updated to reflect the committed changes.

swh.loader.package: Implement GNU Loader
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 6311

swh/loader/package/init.py

swh/loader/package/loader.py

swh/loader/package/tasks.py

swh.loader.package: Implement GNU LoaderClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 6311

swh/loader/package/__init__.py

swh/loader/package/loader.py

swh/loader/package/tasks.py

swh.loader.package: Implement GNU Loader
ClosedPublic
Actions

Revision Contents
Changeset List

swh/loader/package/init.py