Page MenuHomeSoftware Heritage

Cleanup `archive.lookup_missing_hashes` and `api_swhid_known`
Changes PlannedPublicDraft

Authored by Ericson2314 on May 5 2022, 6:29 PM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Summary

Each commit is self-contained and has its own description. But the basic idea is to try to:

  • Work with structured information not strings as long as possible
  • Avoid mixing hashes of different types of objects together

Diff Detail

Repository
rDWAPPS Web applications
Branch
lookup_missing_hashes-bytes
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29159
Build 45590: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 45589: arc lint + arc unit

Unit TestsFailed

TimeTest
326 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.web.tests.api.test_apiresponse::Tests / Python tests / test_api_endpoints_have_cors_headers
client = <django.test.client.Client object at 0x7f0774cff668> content = {'blake2s256': 'e91d48e652a7f3f9f856464f7b1f409513964cfa0170e71800c6d0868400bac9', 'data': '{\n "name": "highlightjs-...e": "http://wcoder.github.io/highlightjs-line-numbers.js/"\n}\n', 'encoding': 'us-ascii', 'hljs_language': 'json', ...} directory = '7a707609b5ab1e3fcf121d9077b089b375b0c831'
186 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.web.tests.api.views.test_identifiers::Tests / Python tests / test_api_known_swhid_all_present
api_client = <rest_framework.test.APIClient object at 0x7f077472b7f0> content = {'blake2s256': 'af0742eab3f09fa8d9e8252feca9440d5308fa0af6ac6ecdca91d4f5cc4f8c28', 'data': 'This is only a very brief ...s a naive epsilon of 0.01 radians to ensure\ntermination.\n', 'encoding': 'us-ascii', 'hljs_language': 'markdown', ...} directory = 'a1e9e42365675e533d9b555db1170929cd34364d'
324 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.web.tests.api.views.test_identifiers::Tests / Python tests / test_api_known_swhid_same_hash
api_client = <rest_framework.test.APIClient object at 0x7f077496e940> content = {'blake2s256': 'af0742eab3f09fa8d9e8252feca9440d5308fa0af6ac6ecdca91d4f5cc4f8c28', 'data': 'This is only a very brief ...s a naive epsilon of 0.01 radians to ensure\ntermination.\n', 'encoding': 'us-ascii', 'hljs_language': 'markdown', ...}
184 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.web.tests.api.views.test_identifiers::Tests / Python tests / test_api_known_swhid_some_present
api_client = <rest_framework.test.APIClient object at 0x7f07744c6240> content = {'blake2s256': 'af0742eab3f09fa8d9e8252feca9440d5308fa0af6ac6ecdca91d4f5cc4f8c28', 'data': 'This is only a very brief ...s a naive epsilon of 0.01 radians to ensure\ntermination.\n', 'encoding': 'us-ascii', 'hljs_language': 'markdown', ...} directory = 'a1e9e42365675e533d9b555db1170929cd34364d'
3,418 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.web.tests.api.views.test_origin::Tests / Python tests / test_api_lookup_origin_visits
self = <hypothesis.core.StateForActualGivenExecution object at 0x7f076c677cc0> data = ConjectureData(INTERESTING, 495 bytes, frozen)
View Full Test Results (6 Failed · 969 Passed · 5 Skipped)

Event Timeline

This conflicts with D7748, but only superficially. I am hedging my bet on which will get past CI first by basing them both on master :). Whatever lands first, I will then rebase the other on top.

Build has FAILED

Patch application report for D7749 (id=28019)

Rebasing onto 468dda170e...

Current branch diff-target is up to date.
Changes applied before test
commit a8cee44ed55ac7f4627a58c6a11391d4590f6795
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1783/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1783/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2022, 6:50 PM
Harbormaster failed remote builds in B29095: Diff 28019!

directory -> directory.id in test, hopefully fixing

Build has FAILED

Patch application report for D7749 (id=28020)

Rebasing onto 468dda170e...

Current branch diff-target is up to date.
Changes applied before test
commit f3b80574ad3dbf8452c5e799052c9b2a80a6d1f7
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1784/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1784/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 5 2022, 7:40 PM
Harbormaster failed remote builds in B29096: Diff 28020!

Build has FAILED

Patch application report for D7749 (id=28077)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit 68f1a4376d482f7efd432b6dc87030755c33acfc
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1796/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1796/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 6 2022, 4:31 PM
Harbormaster failed remote builds in B29149: Diff 28077!
Ericson2314 retitled this revision from Make archive.lookup_missing_hashes output bytes to Cleanup `archive.lookup_missing_hashes` and `api_swhid_known`.May 6 2022, 4:32 PM
Ericson2314 edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D7749 (id=28078)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit fec74e69e6fd54e03d5e66ed4056b599824fe5da
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1797/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1797/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 6 2022, 4:39 PM
Harbormaster failed remote builds in B29150: Diff 28078!

See if factoring out function makes mypy happy

Build has FAILED

Patch application report for D7749 (id=28080)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit 8f0b3d92456954ce01d6e39178f90d0b3200a4b6
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1799/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1799/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 6 2022, 5:01 PM
Harbormaster failed remote builds in B29152: Diff 28080!
  • Fix error (had set comprehension not map!)
  • Add TODO about leveraging D7751 once it lands

Build has FAILED

Patch application report for D7749 (id=28084)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit 96e128796b3c5f93e8d75878b8391136bdf2f00a
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1800/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1800/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 6 2022, 5:36 PM
Harbormaster failed remote builds in B29156: Diff 28084!

Remember to collect iterable into set

Build has FAILED

Patch application report for D7749 (id=28087)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit cf84ac6e03d3785264243e8b166a96541d36647c
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1801/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1801/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 6 2022, 7:21 PM
Harbormaster failed remote builds in B29159: Diff 28087!

Build has FAILED

Patch application report for D7749 (id=28088)

Rebasing onto e6a8303eef...

Current branch diff-target is up to date.
Changes applied before test
commit cf84ac6e03d3785264243e8b166a96541d36647c
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:48:07 2022 -0400

    Overhaul `lookup_missing_hashes`
    
    Keep hashes separated by type to make bugs less likely.

commit 9825ad425f2960e8449bdd85378125ddb585eb3e
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 14:59:12 2022 -0400

    Tweak `api_swhid_known` for perf and avoiding strings
    
    By shuffling around the algorithm, we avoid a `hash_to_bytes` and work
    more with the structured data.

commit b3c82465a688e3f6bdea7a1568e3993344c9a229
Author: John Ericson <John.Ericson@Obsidian.Systems>
Date:   Thu May 5 12:20:51 2022 -0400

    Make archive.lookup_missing_hashes output bytes
    
    All things equal, I think the bytes representation is better, and in
    this case it works well for existing callers too.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1802/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1802/console

Harbormaster returned this revision to the author for changes because remote builds failed.May 7 2022, 7:43 PM
Harbormaster failed remote builds in B29160: Diff 28088!