Page MenuHomeSoftware Heritage

WIP: scanner benchmark
ClosedPublic

Authored by DanSeraf on Dec 11 2020, 12:46 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Commits
rDTSCN7bd1939949dc: scanner experiments

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4721 (id=16722)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.
Changes applied before test
commit 60ace4ee71099e122ec1557edafb830ce0f0af5a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner benchmark

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/console

zack requested changes to this revision.Dec 14 2020, 3:03 PM
zack added a subscriber: zack.
zack added inline comments.
benchmark.py
32–34

As per our call today, here (and elsewhere) is the main thing we'd like to change: instead of having ports hardcoded as a range in the code, let's have a separate file mapping knowledge base (KB) names to URLs. (We said ports in the call, but now that I think of it URLs is even more flexible.)
The content of the file could be something like:

$ cat kb-urls.txt
known0 http://localhost:6000/api/1/
known10 http://localhost:6001/api/1/
known20 http://localhost:6002/api/1/
# ignore this line
# ...
known100 http://localhost:6010/api/1/
swhknown http://localhost:6011/api/1/

This way we have a single place where we define all the KBs to test and the associated base URLs.

Having comment support as in the above example would enable easily switching on/off knowledge bases for the experiments.

71–78

this is going away too, in favor of a KB-mapping file

82–84

subprocess.check_call is the standard way to avoid having to test for exit code by hand.

But, even better, since Python 3.5 you have subprocess.run which supports a check=True kwarg.

See: https://docs.python.org/3/library/subprocess.html#subprocess.run

requirements.txt
13 ↗(On Diff #16722)

please drop this (see my other comment about repo info below)

run_backend.sh
9

if we have a mapping file, you can also switch to passing that as an argument to this script: gunicorn will start all (and only) the KB backends listed in the file (you just have to strip the "http://" prefix, and what you get will be a good value for gunicorn -b

run_benchmark.sh
14–17

this will go too, in favor of the mapping file name as input

swh/scanner/benchmark_algos.py
25–28

if you want to simplify this, there's: https://docs.python.org/3/library/collections.html#collections.Counter

c = Counter()
c['api_calls'] += 1
c['queries'] += 1
357–359

To keep disk usage manageable, we have cloned all rope with --depth 1, which means you will always obtain one revision in the rev list. So this is pointless.

Also, I've stored repo info on granet already, like this:

$ cat repos-20000/99/13399.info.json 
{
  "origin": "https://github.com/mjsoltysiak/Python",
  "commit": "359a4e613084b053ee6b5ebb696ba0be7c284aa7"
}

So let's drop the usage of PythonGit all together. Just load the info available from the repo info.json file and add them to the tuples you emit.

363–375

add a trailing "else" that will fail if you get passed an unexpected value (or at least log it)

alternatively you can switch to a dictionary used as a dispatched from algorithm names to Callables

378–383

add to this the repo numerical ID, just in case

380

It would be nice to compare on some repos the result of len(source_tree) with the number of nodes I've independently counter for each repo.

You can obtain it like this on granet:

$ zstdcat repos-20000/99/01399.nodes.zst | wc -l
281

in theory, len(source_tree) should return the same value for the given repo.

swh/scanner/cli.py
277

why do you need a dedicated logger?
AFAICT the following should be enough:
logging.basicConfig(filename="benchmark.log")
and you can also customize the logging style with style=...

see: https://docs.python.org/3/library/logging.html#logging.basicConfig

or am I missing something?

This revision now requires changes to proceed.Dec 14 2020, 3:03 PM

requested changes
+ algorithms can be specified from run_benchmark.sh
+ if "random" algorithm is specified, benchmark.py will run three experiments using the default seeds (10, 20, 30)

The mapping file should contain:

  • the backend name
  • the backend api URL
  • the backend filepath

Build has FAILED

Patch application report for D4721 (id=16835)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.
Changes applied before test
commit 939e5412bdf119e41564474c46c61fc97d5230d2
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/console

remove git missing imports in mypy.ini

Build has FAILED

Patch application report for D4721 (id=16852)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.
Changes applied before test
commit 9f3753a472003f297381fb4248f14116e9c2b8c6
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/console

variable name in run_benchmark.sh

Build has FAILED

Patch application report for D4721 (id=16853)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.
Changes applied before test
commit 075d7fc88aa61deb145aeb1964f40ab98eb9a835
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/console

This revision is now accepted and ready to land.Dec 18 2020, 8:38 AM

wrong algorithm name in example

Build has FAILED

Patch application report for D4721 (id=16887)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.
Changes applied before test
commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/console

This revision was landed with ongoing or failed builds.Dec 19 2020, 4:46 PM
This revision was automatically updated to reflect the committed changes.