⚙ D4721 WIP: scanner benchmark

DanSeraf created this revision.Dec 11 2020, 12:46 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptDec 11 2020, 12:46 PM

Build has FAILED

Patch application report for D4721 (id=16722)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.

Changes applied before test

commit 60ace4ee71099e122ec1557edafb830ce0f0af5a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner benchmark

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/console

Harbormaster failed remote builds in B17887: Diff 16722!Dec 11 2020, 12:48 PM

zack requested changes to this revision.Dec 14 2020, 3:03 PM

zack added a subscriber: zack.

zack added inline comments.

benchmark.py
32–34	As per our call today, here (and elsewhere) is the main thing we'd like to change: instead of having ports hardcoded as a range in the code, let's have a separate file mapping knowledge base (KB) names to URLs. (We said ports in the call, but now that I think of it URLs is even more flexible.) The content of the file could be something like: $ cat kb-urls.txt known0 http://localhost:6000/api/1/ known10 http://localhost:6001/api/1/ known20 http://localhost:6002/api/1/ # ignore this line # ... known100 http://localhost:6010/api/1/ swhknown http://localhost:6011/api/1/ This way we have a single place where we define all the KBs to test and the associated base URLs. Having comment support as in the above example would enable easily switching on/off knowledge bases for the experiments.
71–78	this is going away too, in favor of a KB-mapping file
82–84	`subprocess.check_call` is the standard way to avoid having to test for exit code by hand. But, even better, since Python 3.5 you have `subprocess.run` which supports a `check=True` kwarg. See: https://docs.python.org/3/library/subprocess.html#subprocess.run
requirements.txt
13 ↗	(On Diff #16722)	please drop this (see my other comment about repo info below)
run_backend.sh
9	if we have a mapping file, you can also switch to passing that as an argument to this script: gunicorn will start all (and only) the KB backends listed in the file (you just have to strip the "http://" prefix, and what you get will be a good value for `gunicorn -b`
run_benchmark.sh
14–17	this will go too, in favor of the mapping file name as input
swh/scanner/benchmark_algos.py
25–28	if you want to simplify this, there's: https://docs.python.org/3/library/collections.html#collections.Counter c = Counter() c['api_calls'] += 1 c['queries'] += 1
357–359	To keep disk usage manageable, we have cloned all rope with `--depth 1`, which means you will always obtain one revision in the rev list. So this is pointless. Also, I've stored repo info on granet already, like this: $ cat repos-20000/99/13399.info.json { "origin": "https://github.com/mjsoltysiak/Python", "commit": "359a4e613084b053ee6b5ebb696ba0be7c284aa7" } So let's drop the usage of PythonGit all together. Just load the info available from the repo info.json file and add them to the tuples you emit.
363–375	add a trailing "else" that will fail if you get passed an unexpected value (or at least log it) alternatively you can switch to a dictionary used as a dispatched from algorithm names to Callables
378–383	add to this the repo numerical ID, just in case
380	It would be nice to compare on some repos the result of `len(source_tree)` with the number of nodes I've independently counter for each repo. You can obtain it like this on granet: $ zstdcat repos-20000/99/01399.nodes.zst \| wc -l 281 in theory, `len(source_tree)` should return the same value for the given repo.
swh/scanner/cli.py
277	why do you need a dedicated logger? AFAICT the following should be enough: `logging.basicConfig(filename="benchmark.log")` and you can also customize the logging style with `style=...` see: https://docs.python.org/3/library/logging.html#logging.basicConfig or am I missing something?

This revision now requires changes to proceed.Dec 14 2020, 3:03 PM

requested changes
+ algorithms can be specified from run_benchmark.sh
+ if "random" algorithm is specified, benchmark.py will run three experiments using the default seeds (10, 20, 30)

The mapping file should contain:

the backend name
the backend api URL
the backend filepath

Build has FAILED

Patch application report for D4721 (id=16835)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.

Changes applied before test

commit 939e5412bdf119e41564474c46c61fc97d5230d2
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/console

Harbormaster failed remote builds in B17988: Diff 16835!Dec 17 2020, 7:53 AM

remove git missing imports in mypy.ini

Build has FAILED

Patch application report for D4721 (id=16852)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.

Changes applied before test

commit 9f3753a472003f297381fb4248f14116e9c2b8c6
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/console

Harbormaster failed remote builds in B18005: Diff 16852!Dec 17 2020, 2:07 PM

variable name in run_benchmark.sh

Build has FAILED

Patch application report for D4721 (id=16853)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.

Changes applied before test

commit 075d7fc88aa61deb145aeb1964f40ab98eb9a835
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/console

Harbormaster failed remote builds in B18006: Diff 16853!Dec 17 2020, 2:14 PM

zack accepted this revision.Dec 18 2020, 8:38 AM

This revision is now accepted and ready to land.Dec 18 2020, 8:38 AM

wrong algorithm name in example

Build has FAILED

Patch application report for D4721 (id=16887)

Rebasing onto 65f0b8e4c6...

Current branch diff-target is up to date.

Changes applied before test

commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/console

Harbormaster failed remote builds in B18039: Diff 16887!Dec 19 2020, 4:43 PM

This revision was landed with ongoing or failed builds.Dec 19 2020, 4:46 PM

Closed by commit rDTSCN7bd1939949dc: scanner experiments (authored by DanSeraf). · Explain Why

This revision was automatically updated to reflect the committed changes.

DanSeraf added a commit: rDTSCN7bd1939949dc: scanner experiments.

WIP: scanner benchmark
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D4721 (id=16722)

Changes applied before test

Patch application report for D4721 (id=16835)

Changes applied before test

Patch application report for D4721 (id=16852)

Changes applied before test

Patch application report for D4721 (id=16853)

Changes applied before test

Patch application report for D4721 (id=16887)

Changes applied before test

Revision Contents
Changeset List

Diff 16888

benchmark.py

run_backend.sh

run_benchmark.sh

swh/scanner/backend.py

swh/scanner/benchmark_algos.py

swh/scanner/cli.py

swh/scanner/model.py

WIP: scanner benchmarkClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D4721 (id=16722)

Changes applied before test

Patch application report for D4721 (id=16835)

Changes applied before test

Patch application report for D4721 (id=16852)

Changes applied before test

Patch application report for D4721 (id=16853)

Changes applied before test

Patch application report for D4721 (id=16887)

Changes applied before test

Revision ContentsChangeset List

Diff 16888

benchmark.py

run_backend.sh

run_benchmark.sh

swh/scanner/backend.py

swh/scanner/benchmark_algos.py

swh/scanner/cli.py

swh/scanner/model.py

WIP: scanner benchmark
ClosedPublic
Actions

Revision Contents
Changeset List