Page MenuHomeSoftware Heritage

debian.loader: Improve resolution computation and fix corner cases
ClosedPublic

Authored by ardumont on Dec 19 2019, 3:43 PM.

Details

Summary

+ loader.debian: Make sense of the actual revision resolution algo

Test Plan

tox

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 9904
Build 14624: tox-on-jenkinsJenkins
Build 14623: arc lint + arc unit

Event Timeline

ardumont created this revision.Dec 19 2019, 3:43 PM
ardumont retitled this revision from debian.loader: Deal with no revision resolution possible to debian.loader: Deal with no revision resolution possible scenario.Dec 19 2019, 3:44 PM
ardumont edited the summary of this revision. (Show Details)
ardumont added a project: Debian loader.
ardumont updated this revision to Diff 8813.Dec 19 2019, 4:05 PM

Squash commits

olasd requested changes to this revision.Dec 19 2019, 4:32 PM

The logic in that function is messing me up a little bit, probably because it's missing a docstring explaining what the arguments are. Could you add one?

I'm guessing artifact_metadata is one of the packages you're currently processing, and known_package_artifacts come from the previous snapshot?

I've also suggested a way to make the logic a little bit less mind-bending.

swh/loader/package/debian/loader.py
127–128

these lines are backwards

130–141

It looks like what we want is to match the artifacts of the current package with all packages referenced in the previous snapshot in turn.

I _think_ that the logic in this loop is inverted, but I have a hard time comprehending it. The comments are backwards too :(

The logic would be much clearer if we did the comparison using set logic:

  • build the set of (filename, (sha256, sha1, size, ...)) tuples for the package we're currently trying to load
  • build the set of (filename, (sha256, sha1, size, ...)) tuples for every revision found in the previous snapshot
  • check if these two sets are equal

We could even make these sets ordered by filename, which lets us turn them into (hashable) tuples, which lets us build a dict mapping ((filename, (hash, hash, size)), ...) to revision_ids directly when fetching the previous snapshot, making this function a plain dict lookup.

149

This try block feels overly broad.

Could you just do

if 'files' not in artifact_metadata:
    logger.warning(...)
    return None  # or even artifact_metadata.get('revision_id')

at the beginning of the function instead?

This revision now requires changes to proceed.Dec 19 2019, 4:32 PM
ardumont added inline comments.Dec 19 2019, 4:35 PM
swh/loader/package/debian/loader.py
149

that's not enough

that can explode a bit later.

vlorentz added inline comments.
swh/loader/package/debian/loader.py
149

I agree with @olasd that this try block is too long. There's a high risk of catching unrelated error.

ardumont added inline comments.Dec 19 2019, 4:40 PM
swh/loader/package/debian/loader.py
149

i agree the try is long.
But here, all this shenanigans is to solve a revision from metadata dict from previous revisions (revisions coming from the snapshot).

So it should fail only for that.
We do not do anything else.

ardumont added inline comments.Dec 19 2019, 4:45 PM
swh/loader/package/debian/loader.py
101

this i found backward...

149

that can explode a bit later

as in:

 8d073f5dc6c8b7a322618c35'}}, 'version': '0.3.9-2', 'revision_id': None}}
Dec 19 14:58:02 worker01 python3[557]: [2019-12-19 14:58:02,443: ERROR/ForkPoolWorker-1] Fail to load deb://Debian/packages/freshplayerplugin
                                       Traceback (most recent call last):
                                         File "/usr/lib/python3/dist-packages/swh/loader/package/loader.py", line 286, in load
                                           known_artifacts, p_info['raw'])
                                         File "/usr/lib/python3/dist-packages/swh/loader/package/debian/loader.py", line 104, in resolve_revision_from
                                           known_artifacts = known_artifacts['extrinsic']['raw']['files']
                                       KeyError: 'extrinsic'
olasd added inline comments.Dec 19 2019, 4:48 PM
swh/loader/package/debian/loader.py
149

One more argument for parsing these only once when fetching the old snapshot (see my other inline comment). ;)

ardumont added inline comments.Dec 19 2019, 4:48 PM
swh/loader/package/debian/loader.py
94–120

docstring is in the main class [1]

i do not know how we can deal with that properly without replicating stuff.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/loader.py$0-158

ardumont updated this revision to Diff 8818.Dec 19 2019, 7:17 PM
  • debian.loader: Extract revision resolution into a function (+ add dedicated tests)
  • debian.loader: Improve and fix revision resolution
ardumont updated this revision to Diff 8819.Dec 19 2019, 7:18 PM

Remove duplicated step

ardumont updated this revision to Diff 8824.Dec 19 2019, 9:11 PM
  • Remove unneeded instruction
  • shuffle the keys a bit to ensure order does not matter
ardumont updated this revision to Diff 8825.Dec 19 2019, 9:15 PM

Add missing test case

ardumont retitled this revision from debian.loader: Deal with no revision resolution possible scenario to debian.loader: Improve resolution computation and fix corner cases.Dec 20 2019, 9:49 AM
olasd accepted this revision.Dec 20 2019, 11:26 AM
This revision is now accepted and ready to land.Dec 20 2019, 11:26 AM
ardumont updated this revision to Diff 8829.Dec 20 2019, 11:31 AM

Plug to master branch