Page MenuHomeSoftware Heritage

swh-scanner: add policy greedybfs
ClosedPublic

Authored by DanSeraf on Aug 6 2021, 10:47 AM.

Details

Summary

This new policy will query chunks of 1000 node at once, in order to fill the Web API rate limit.

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6065 (id=21947)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit b18d2b87b54c7005f4185b32ef76e0dfe1306d0a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/146/ for more details.

vlorentz added inline comments.
swh/scanner/cli.py
157–158

I find "fill" a bit unclear. Do you mean using it to the max?

swh/scanner/policy.py
162–167
166

Doesn't this always take the

170–172

purely aesthetic (to prevent black from reformatting it with the weird newline placement)

swh/scanner/tests/conftest.py
81

why?

ardumont added inline comments.
swh/scanner/cli.py
157–158

I agree with val, hopefully the suggestion above improved it correctly.

swh/scanner/policy.py
166

What do you mean?

170–172

Ok, thanks for the suggestion!

swh/scanner/tests/conftest.py
81

This is used to avoid a possible RecursionError that could be generated by .mkdir.

swh/scanner/policy.py
166

nvm, I started writing this before noticing nodes is an iterator, and it looks like I forgot to delete it

Build is green

Patch application report for D6065 (id=21975)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit a25a63fd68311907a8273664ca6bc1c6e0474772
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/147/ for more details.

This revision is now accepted and ready to land.Aug 9 2021, 12:59 PM
This revision now requires review to proceed.Aug 9 2021, 1:44 PM
zack requested changes to this revision.Aug 9 2021, 1:56 PM

looks good!

I've only added a bunch of minor change requests on docs and comments

swh/scanner/cli.py
158

better: "greedybfs: same as "bfs" policy, but lookup the status of source code artifacts in chunks, in order to minimize the number of Web API round-trips with the archive."

swh/scanner/policy.py
70

docstring please, e.g.: "return the size of a source tree as the number of nodes it contains"

130

better: "query graph nodes in chunks (to maximize [...]"

rationale: you're not querying node chunks (i.e., node sub-parts), but batches of nodes together

156–157

s/1000/QUERY_LIMIT/

swh/scanner/tests/conftest.py
81

please add a comment in the source code here, explaining why the recursion limit is needed (you explained it to @vlorentz in the diff, but the diff will not remain in the code :-))

This revision now requires changes to proceed.Aug 9 2021, 1:56 PM
swh/scanner/policy.py
71

Oh I missed this. You can do it without allocating a list, like this ^

Build is green

Patch application report for D6065 (id=21985)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit d4a4e6c404bbba289b369dfcba6b2219b0695c8b
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/148/ for more details.

Thanks!

I'm approving this, but please fix the remaining occurrence of 1000 before landing, as per comment above.

swh/scanner/policy.py
157

you missed an occurrence of 1000 here :-), please fix it before landing

This revision is now accepted and ready to land.Aug 9 2021, 7:04 PM

Build is green

Patch application report for D6065 (id=21992)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit a9d0b9e2af0e5735608c1841d59e60662874b15e
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/149/ for more details.

This revision was automatically updated to reflect the committed changes.