Details
- Reviewers
zack - Group Reviewers
Reviewers - Commits
- rDTSCN7bd1939949dc: scanner experiments
Diff Detail
- Repository
- rDTSCN Code scanner
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Event Timeline
Build has FAILED
Patch application report for D4721 (id=16722)
Rebasing onto 65f0b8e4c6...
Current branch diff-target is up to date.
Changes applied before test
commit 60ace4ee71099e122ec1557edafb830ce0f0af5a Author: Daniele Serafini <me@danieleserafini.eu> Date: Thu Dec 10 23:59:31 2020 +0100 scanner benchmark
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/84/console
benchmark.py | ||
---|---|---|
32–34 | As per our call today, here (and elsewhere) is the main thing we'd like to change: instead of having ports hardcoded as a range in the code, let's have a separate file mapping knowledge base (KB) names to URLs. (We said ports in the call, but now that I think of it URLs is even more flexible.) $ cat kb-urls.txt known0 http://localhost:6000/api/1/ known10 http://localhost:6001/api/1/ known20 http://localhost:6002/api/1/ # ignore this line # ... known100 http://localhost:6010/api/1/ swhknown http://localhost:6011/api/1/ This way we have a single place where we define all the KBs to test and the associated base URLs. Having comment support as in the above example would enable easily switching on/off knowledge bases for the experiments. | |
71–78 | this is going away too, in favor of a KB-mapping file | |
82–84 | subprocess.check_call is the standard way to avoid having to test for exit code by hand. But, even better, since Python 3.5 you have subprocess.run which supports a check=True kwarg. See: https://docs.python.org/3/library/subprocess.html#subprocess.run | |
requirements.txt | ||
13 ↗ | (On Diff #16722) | please drop this (see my other comment about repo info below) |
run_backend.sh | ||
9 | if we have a mapping file, you can also switch to passing that as an argument to this script: gunicorn will start all (and only) the KB backends listed in the file (you just have to strip the "http://" prefix, and what you get will be a good value for gunicorn -b | |
run_benchmark.sh | ||
14–17 | this will go too, in favor of the mapping file name as input | |
swh/scanner/benchmark_algos.py | ||
25–28 | if you want to simplify this, there's: https://docs.python.org/3/library/collections.html#collections.Counter c = Counter() c['api_calls'] += 1 c['queries'] += 1 | |
357–359 | To keep disk usage manageable, we have cloned all rope with --depth 1, which means you will always obtain one revision in the rev list. So this is pointless. Also, I've stored repo info on granet already, like this: $ cat repos-20000/99/13399.info.json { "origin": "https://github.com/mjsoltysiak/Python", "commit": "359a4e613084b053ee6b5ebb696ba0be7c284aa7" } So let's drop the usage of PythonGit all together. Just load the info available from the repo info.json file and add them to the tuples you emit. | |
363–375 | add a trailing "else" that will fail if you get passed an unexpected value (or at least log it) alternatively you can switch to a dictionary used as a dispatched from algorithm names to Callables | |
378–383 | add to this the repo numerical ID, just in case | |
380 | It would be nice to compare on some repos the result of len(source_tree) with the number of nodes I've independently counter for each repo. You can obtain it like this on granet: $ zstdcat repos-20000/99/01399.nodes.zst | wc -l 281 in theory, len(source_tree) should return the same value for the given repo. | |
swh/scanner/cli.py | ||
277 | why do you need a dedicated logger? see: https://docs.python.org/3/library/logging.html#logging.basicConfig or am I missing something? |
requested changes
+ algorithms can be specified from run_benchmark.sh
+ if "random" algorithm is specified, benchmark.py will run three experiments using the default seeds (10, 20, 30)
The mapping file should contain:
- the backend name
- the backend api URL
- the backend filepath
Build has FAILED
Patch application report for D4721 (id=16835)
Rebasing onto 65f0b8e4c6...
Current branch diff-target is up to date.
Changes applied before test
commit 939e5412bdf119e41564474c46c61fc97d5230d2 Author: Daniele Serafini <me@danieleserafini.eu> Date: Thu Dec 10 23:59:31 2020 +0100 scanner experiments
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/85/console
Build has FAILED
Patch application report for D4721 (id=16852)
Rebasing onto 65f0b8e4c6...
Current branch diff-target is up to date.
Changes applied before test
commit 9f3753a472003f297381fb4248f14116e9c2b8c6 Author: Daniele Serafini <me@danieleserafini.eu> Date: Thu Dec 10 23:59:31 2020 +0100 scanner experiments
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/86/console
Build has FAILED
Patch application report for D4721 (id=16853)
Rebasing onto 65f0b8e4c6...
Current branch diff-target is up to date.
Changes applied before test
commit 075d7fc88aa61deb145aeb1964f40ab98eb9a835 Author: Daniele Serafini <me@danieleserafini.eu> Date: Thu Dec 10 23:59:31 2020 +0100 scanner experiments
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/87/console
Build has FAILED
Patch application report for D4721 (id=16887)
Rebasing onto 65f0b8e4c6...
Current branch diff-target is up to date.
Changes applied before test
commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300 Author: Daniele Serafini <me@danieleserafini.eu> Date: Thu Dec 10 23:59:31 2020 +0100 scanner experiments
Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/88/console