Page MenuHomeSoftware Heritage

swh-scanner: add policy greedybfs
ClosedPublic

Authored by DanSeraf on Aug 6 2021, 10:47 AM.

Details

Summary

This new policy will query chunks of 1000 node at once, in order to fill the Web API rate limit.

Diff Detail

Repository
rDTSCN Code scanner
Branch
greedy-search
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22947
Build 35779: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35778: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6065 (id=21947)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit b18d2b87b54c7005f4185b32ef76e0dfe1306d0a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/146/ for more details.

vlorentz added inline comments.
swh/scanner/cli.py
157–158

I find "fill" a bit unclear. Do you mean using it to the max?

swh/scanner/policy.py
162–167
166

Doesn't this always take the

170–172

purely aesthetic (to prevent black from reformatting it with the weird newline placement)

swh/scanner/tests/conftest.py
81

why?

ardumont added inline comments.
swh/scanner/cli.py
157–158

I agree with val, hopefully the suggestion above improved it correctly.

swh/scanner/policy.py
166

What do you mean?

170–172

Ok, thanks for the suggestion!

swh/scanner/tests/conftest.py
81

This is used to avoid a possible RecursionError that could be generated by .mkdir.

swh/scanner/policy.py
166

nvm, I started writing this before noticing nodes is an iterator, and it looks like I forgot to delete it

Build is green

Patch application report for D6065 (id=21975)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit a25a63fd68311907a8273664ca6bc1c6e0474772
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/147/ for more details.

This revision is now accepted and ready to land.Aug 9 2021, 12:59 PM
This revision now requires review to proceed.Aug 9 2021, 1:44 PM
zack requested changes to this revision.Aug 9 2021, 1:56 PM

looks good!

I've only added a bunch of minor change requests on docs and comments

swh/scanner/cli.py
158

better: "greedybfs: same as "bfs" policy, but lookup the status of source code artifacts in chunks, in order to minimize the number of Web API round-trips with the archive."

swh/scanner/policy.py
70

docstring please, e.g.: "return the size of a source tree as the number of nodes it contains"

130

better: "query graph nodes in chunks (to maximize [...]"

rationale: you're not querying node chunks (i.e., node sub-parts), but batches of nodes together

156–157

s/1000/QUERY_LIMIT/

swh/scanner/tests/conftest.py
81

please add a comment in the source code here, explaining why the recursion limit is needed (you explained it to @vlorentz in the diff, but the diff will not remain in the code :-))

This revision now requires changes to proceed.Aug 9 2021, 1:56 PM
swh/scanner/policy.py
71

Oh I missed this. You can do it without allocating a list, like this ^

Build is green

Patch application report for D6065 (id=21985)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit d4a4e6c404bbba289b369dfcba6b2219b0695c8b
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/148/ for more details.

Thanks!

I'm approving this, but please fix the remaining occurrence of 1000 before landing, as per comment above.

swh/scanner/policy.py
158

you missed an occurrence of 1000 here :-), please fix it before landing

This revision is now accepted and ready to land.Aug 9 2021, 7:04 PM

Build is green

Patch application report for D6065 (id=21992)

Rebasing onto 98011a4226...

Current branch diff-target is up to date.
Changes applied before test
commit a9d0b9e2af0e5735608c1841d59e60662874b15e
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Fri Aug 6 10:46:11 2021 +0200

    add policy greedybfs

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/149/ for more details.

This revision was automatically updated to reflect the committed changes.