Page MenuHomeSoftware Heritage

Add random directory sampling policy
Needs ReviewPublic

Authored by Alphare on Sep 26 2022, 6:52 PM.

Details

Summary

This makes use of the new discovery algorithm introduced in
`swh-loader-core`, which should help speed up large (think Linux
kernel or way larger) scans.

Most of the time is spend walking the on-disk directory and hashing,
which is where the new optimizations in `swh-model==6.5.0` should come
in handy. Python is close to its limit in that regard, some future
endeavor should look into setting up SWH for native extensions.

Diff Detail

Repository
rDTSCN Code scanner
Branch
discovery
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31770
Build 49711: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49710: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8539 (id=30793)

Rebasing onto 90534ecae4...

Current branch diff-target is up to date.
Changes applied before test
commit c5714a6aa5f31f503bd188e48cfc59d2271ac60b
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:45:56 2022 +0200

    Add random directory sampling policy
    
    This makes use of the new discovery algorithm introduced in
    ``swh-loader-core``, which should help speed up large (think Linux
    kernel or way larger) scans.
    
    Most of the time is spend walking the on-disk directory and hashing,
    which is where the new optimizations in ``swh-model==6.5.0`` should come
    in handy. Python is close to its limit in that regard, some future
    endeavor should look into setting up SWH for native extensions.

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/163/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/163/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 26 2022, 6:53 PM
Harbormaster failed remote builds in B31770: Diff 30793!

Build has FAILED

Patch application report for D8539 (id=30796)

Rebasing onto 90534ecae4...

Current branch diff-target is up to date.
Changes applied before test
commit cf90184235c442e42d9901870766e3f93248868b
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:45:56 2022 +0200

    Add random directory sampling policy
    
    This makes use of the new discovery algorithm introduced in
    ``swh-loader-core``, which should help speed up large (think Linux
    kernel or way larger) scans.
    
    Most of the time is spend walking the on-disk directory and hashing,
    which is where the new optimizations in ``swh-model==6.5.0`` should come
    in handy. Python is close to its limit in that regard, some future
    endeavor should look into setting up SWH for native extensions.

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/164/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/164/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 26 2022, 10:58 PM
Harbormaster failed remote builds in B31773: Diff 30796!

Build has FAILED

Patch application report for D8539 (id=30878)

Rebasing onto 90534ecae4...

Current branch diff-target is up to date.
Changes applied before test
commit 99f70d6fe730bea88e482ff80effcc27800e7a96
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:45:56 2022 +0200

    Add random directory sampling policy
    
    This makes use of the new discovery algorithm introduced in
    ``swh-loader-core``, which should help speed up large (think Linux
    kernel or way larger) scans.
    
    Most of the time is spend walking the on-disk directory and hashing,
    which is where the new optimizations in ``swh-model==6.5.0`` should come
    in handy. Python is close to its limit in that regard, some future
    endeavor should look into setting up SWH for native extensions.

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/172/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/172/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 28 2022, 11:01 AM
Harbormaster failed remote builds in B31850: Diff 30878!

Requirements hack for the CI

Build is green

Patch application report for D8539 (id=30916)

Rebasing onto f4f0917769...

First, rewinding head to replay your work on top of it...
Applying: Add random directory sampling policy
Changes applied before test
commit d77225e50a47fd884954bf41e0b762062ef30973
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:45:56 2022 +0200

    Add random directory sampling policy
    
    This makes use of the new discovery algorithm introduced in
    ``swh-loader-core``, which should help speed up large (think Linux
    kernel or way larger) scans.
    
    Most of the time is spend walking the on-disk directory and hashing,
    which is where the new optimizations in ``swh-model==6.5.0`` should come
    in handy. Python is close to its limit in that regard, some future
    endeavor should look into setting up SWH for native extensions.

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/182/ for more details.

ardumont added inline comments.
requirements-swh.txt
4 ↗(On Diff #30916)

I guess it's temporary to make the ci build ok.

Why is the discovery code part implemented in the loader.core and not in the storage code (client side)?
(since it's used both by scanner and loader so far, that would have made sense).

Yes, i've asked it twice ¯\_(ツ)_/¯

[1] D8538#222652

swh/scanner/policy.py
268–269

etc... below

(as far as i remember to .swhid does trigger computation, so might as well reduce that)

Build is green

Patch application report for D8539 (id=30929)

Rebasing onto f4f0917769...

First, rewinding head to replay your work on top of it...
Applying: Add random directory sampling policy
Changes applied before test
commit 00fa0e63578400b21d4df6e6b75c9fc733b57861
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:45:56 2022 +0200

    Add random directory sampling policy
    
    This makes use of the new discovery algorithm introduced in
    ``swh-loader-core``, which should help speed up large (think Linux
    kernel or way larger) scans.
    
    Most of the time is spend walking the on-disk directory and hashing,
    which is where the new optimizations in ``swh-model==6.5.0`` should come
    in handy. Python is close to its limit in that regard, some future
    endeavor should look into setting up SWH for native extensions.

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/186/ for more details.

requirements-swh.txt
4 ↗(On Diff #30916)

I guess it's temporary to make the ci build ok.

Yep

Why is the discovery code part implemented in the loader.core and not in the storage code (client side)?
(since it's used both by scanner and loader so far, that would have made sense).

I feel like these pieces of code are are implementation details of the archive consumer and not something the storage (even client-side) should know about. I don't feel super strongly about it, it's just my intuition about where the code should live that may make more sense as it grows.

swh/scanner/policy.py
268–269

Ah, good point. I sometimes forget there isn't an optimizer going over my code, heh.

requirements-swh.txt
4 ↗(On Diff #30929)

If that commit is landed, we can release it so you can have the standard version number requirement here.

4 ↗(On Diff #30916)

ok.

It feels strange to me to have the scanner consumer specifically depend on the loader.core.
Then again scanner does not seem to depend on the storage either currently so ¯\_(ツ)_/¯.

Maybe that means we need to push that code in swh.core which is the sole common part
between all swh components.

But same as you, i won't be too pushy.

ardumont added subscribers: vlorentz, douardda, anlambert, olasd.

Fine, i've one comment i'd like others to have a look at though [1] regarding where
that new discovery (interface) code should go. It feels currently a bit off to me that this code
is in loader-core. Loaders are not the sole archive consumers (scanner, webapp, cli, indexer, cooker, ...).

[1] https://forge.softwareheritage.org/D8539?id=30916#inline-60774

@vlorentz @douardda @olasd @anlambert ^

Fine, i've one comment i'd like others to have a look at though [1] regarding where
that new discovery (interface) code should go. It feels currently a bit off to me that this code
is in loader-core. Loaders are not the sole archive consumers (scanner, webapp, cli, indexer, cooker, ...).

[1] https://forge.softwareheritage.org/D8539?id=30916#inline-60774

@vlorentz @douardda @olasd @anlambert ^

If it's to be used generically, this discovery code should pretty definitely not be in swh.loader.core.

  • The generic discovery algorithm, and base abstract classes/protocols, should probably be in swh.model, as they're tied to that structure;
  • The swh.storage-based discovery mechanism could live in swh.storage.algorithms, and be used by swh.loader.core;
  • The REST API-based discovery mechanism could live in swh.web.client, or stay in swh.scanner.
In D8539#222941, @olasd wrote:

Fine, i've one comment i'd like others to have a look at though [1] regarding where
that new discovery (interface) code should go. It feels currently a bit off to me that this code
is in loader-core. Loaders are not the sole archive consumers (scanner, webapp, cli, indexer, cooker, ...).

[1] https://forge.softwareheritage.org/D8539?id=30916#inline-60774

@vlorentz @douardda @olasd @anlambert ^

If it's to be used generically, this discovery code should pretty definitely not be in swh.loader.core.

  • The generic discovery algorithm, and base abstract classes/protocols, should probably be in swh.model, as they're tied to that structure;
  • The swh.storage-based discovery mechanism could live in swh.storage.algorithms, and be used by swh.loader.core;
  • The REST API-based discovery mechanism could live in swh.web.client, or stay in swh.scanner.

I started to split the code to different modules with #D8937 and #D8938, does it looks good for you?
Also I'm not really sure for the destination of DIscoveryStorageConnection, can it stay in the model or should it go to storage.discovery?