Page MenuHomeSoftware Heritage

swh-scanner: new scan policies
ClosedPublic

Authored by DanSeraf on Jul 15 2021, 7:17 PM.

Details

Summary

The purpose of this diff is the creation of an interface to create new scan policies.
Some of the policies implemented in the benchmark branch are moved in the master branch in order to let the user select the scan approach (Related to T3420).

Diff Detail

Repository
rDTSCN Code scanner
Branch
policy
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22671
Build 35353: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35352: arc lint + arc unit

Unit TestsFailed

TimeTest
27 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scanner.tests.test_policy::test_directory_priority_policy
live_server = <[AttributeError("'LiveServer' object has no attribute 'url'") raised in repr()] LiveServer object at 0x7f91f360abe0> aiosession = <aiohttp.client.ClientSession object at 0x7f920dd4d550> event_loop = <_UnixSelectorEventLoop running=False closed=False debug=False>
20 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scanner.tests.test_policy::test_file_priority_policy
live_server = <[AttributeError("'LiveServer' object has no attribute 'url'") raised in repr()] LiveServer object at 0x7f91f360abe0> aiosession = <aiohttp.client.ClientSession object at 0x7f91f359eef0> event_loop = <_UnixSelectorEventLoop running=False closed=False debug=False>
18 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scanner.tests.test_policy::test_lazybfs_policy
live_server = <[AttributeError("'LiveServer' object has no attribute 'url'") raised in repr()] LiveServer object at 0x7f91f360abe0> aiosession = <aiohttp.client.ClientSession object at 0x7f920dd4d748> event_loop = <_UnixSelectorEventLoop running=False closed=False debug=False>
136 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scanner.tests.test_backend::test_backend_endpoint_all_present
8 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scanner.tests.test_backend::test_backend_endpoint_one_not_present
View Full Test Results (3 Failed · 31 Passed)

Event Timeline

Build is green

Patch application report for D5996 (id=21646)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit ddfccaea8f616fc806f81050fbd6b89509768ab8
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    new scan policies

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/136/ for more details.

zack requested changes to this revision.Jul 16 2021, 11:02 AM
zack added a subscriber: zack.

Looks great! I've noted down only some nits.

swh/scanner/policy.py
27

"SWHIDs"

115

why is this @no_type_check needed?

125

comment is unclear: "deepest" what? maybe "check deepest node first" ?

149

how come the "known" value can be the empty string here? Ideally it should always be a boolean, either Ture of False. If you need a tristate, maybe None would better?

swh/scanner/tests/test_policy.py
1–5

I'm confused by these two test files: is test_policy.py testing anything at all right now?

In test_scanner it looks like you're testing the results with each policy, but not actually the order in which things are queried from the backend (i.e., the policies themselves). Is that correct?

If so, we should test the various node ordering here in test_policy.py. It shouldn't be hard on a simple model object example.

This revision now requires changes to proceed.Jul 16 2021, 11:02 AM
swh/scanner/policy.py
115

Because when iterating over the merkle tree defined in swh.model.merkle mypy expects a MerkleNode while actually are subclasses such as model.from_disk.Directory/Content that have properties not defined in the base class.

swh/scanner/tests/test_policy.py
1–5

Actually it's not easy to test the order of the nodes queried to the backend since the order could be read only during the request to the backend.

If we want to test the behavior of the policies we could just save the known status of the nodes in swh.scanner.data.MerkleNodeInfo. This could be useful for both the unit tests and for the policies to know the status of a certain node.

The status can be defined as an enumeration:

  • NodeStatus.UNSET = the known value is not already set
  • NodeStatus.SET = the known value is set by the policy
  • NodeStatus.QUERIED = the known value is set by a policy after a backend request
swh/scanner/tests/test_policy.py
1–5

Here's what I had in mind to test policies.

To test policies you need a swh-scanner backend, which can either be mocked or the real thing that one can spawn during tests.

If you are mocking it, you can make it so at each received request, the mock code append to a list the received SWHIDs one by one. At the end of the test session you can access that list and it will give you the order in which SWHIDs have been queried.

If you are not mocking the backend, you can either subclass or modify the real backend so that it keeps a log of all queried SWHIDs in an analogous list, which you can check after the scan is done to check the order in it.

How does this sound?

test scan policies SWHIDs request to the backend

Build has FAILED

Patch application report for D5996 (id=21702)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit 879770845b0c592ed1c62acfe60d89879cd9b4f6
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    new scan policies

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/137/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/137/console

Build has FAILED

Patch application report for D5996 (id=21706)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit 80a42bf1fe71214c7528138d2a3b63940b149d59
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    new scan policies

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/138/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/138/console

Build is green

Patch application report for D5996 (id=21712)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit f92723a32bc25437f83d18ff16ce510918a6091d
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    new scan policies

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/139/ for more details.

Wonderful, thanks for adding the order tests! LGTM.

I'm accepting this, but please notice that you overlooked one of the comments in my previous review: it's a minor one, the one about "deepest" wording in a comment.

Please fix/improve it before landing.

This revision is now accepted and ready to land.Jul 21 2021, 9:59 AM

Build is green

Patch application report for D5996 (id=21717)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit 00c8ad617c92cc0718fb9f00579b524b609eba29
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    new scan policies

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/140/ for more details.

Build is green

Patch application report for D5996 (id=21718)

Rebasing onto 33b1316bcd...

Current branch diff-target is up to date.
Changes applied before test
commit d5a070e1429d54fd2679d68b71b19efb65d833dd
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jul 15 19:07:33 2021 +0200

    add scan policies
    
    - make scan policy selectable through --policy/-p option
    - abstract 'Policy' class to create new scan policies
    - integration of existing policies into the master branch
    
    Closes T3420

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/141/ for more details.

This revision was automatically updated to reflect the committed changes.