Page MenuHomeSoftware Heritage

identifiers: Expose git_object instead of manifest
ClosedPublic

Authored by vlorentz on Apr 30 2021, 12:25 PM.

Details

Summary

The git_object is what will be actually useful to the vault.

It's also easier to test, because test_identifier.py has the
entire git_object in its test data.

However, it makes the code messier, and it's quite a big change,
so I'm not sure it's worth it.
Instead, I could slightly rewrite the tests, and make the vault
re-add the header itself (but it would duplicate this code...)

Depends on D5650

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21176
Build 32867: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 32866: arc lint + arc unit

Unit TestsFailed

TimeTest
2 msJenkins > .tox.py3-full.lib.python3.7.site-packages.swh.model.tests.test_cli.TestIdentify::test_content_id
self = <swh.model.tests.test_cli.TestIdentify testMethod=test_content_id> def test_content_id(self):
2 msJenkins > .tox.py3-full.lib.python3.7.site-packages.swh.model.tests.test_cli.TestIdentify::test_content_id_from_stdin
self = <swh.model.tests.test_cli.TestIdentify testMethod=test_content_id_from_stdin> def test_content_id_from_stdin(self):
4 msJenkins > .tox.py3-full.lib.python3.7.site-packages.swh.model.tests.test_cli.TestIdentify::test_directory_id
self = <swh.model.tests.test_cli.TestIdentify testMethod=test_directory_id> def test_directory_id(self):
4 msJenkins > .tox.py3-full.lib.python3.7.site-packages.swh.model.tests.test_cli.TestIdentify::test_exclude
self = <swh.model.tests.test_cli.TestIdentify testMethod=test_exclude> def test_exclude(self):
1 msJenkins > .tox.py3-full.lib.python3.7.site-packages.swh.model.tests.test_cli.TestIdentify::test_hide_filename
self = <swh.model.tests.test_cli.TestIdentify testMethod=test_hide_filename> def test_hide_filename(self):
View Full Test Results (37 Failed · 433 Passed · 2 Skipped)

Event Timeline

Build has FAILED

Patch application report for D5652 (id=20186)

Could not rebase; Attempt merge onto f7e9d5c290...

Updating f7e9d5c..eaa5daa
Fast-forward
 swh/model/hashutil.py               |  59 +++----------------
 swh/model/identifiers.py            | 113 +++++++++++++++++++++++-------------
 swh/model/tests/test_identifiers.py |  56 +++++++++++++-----
 3 files changed, 124 insertions(+), 104 deletions(-)
Changes applied before test
commit eaa5daa3d5f225f6a4e9894f8b73798ea3e75a48
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 3a4f484feaccd569a0b9dc37531d7af50bc3871e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/326/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/326/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 30 2021, 12:26 PM
Harbormaster failed remote builds in B21176: Diff 20186!

Build has FAILED

Patch application report for D5652 (id=20191)

Could not rebase; Attempt merge onto f7e9d5c290...

Updating f7e9d5c..8062ed9
Fast-forward
 swh/model/hashutil.py               |  59 +++----------------
 swh/model/identifiers.py            | 113 +++++++++++++++++++++++-------------
 swh/model/tests/test_identifiers.py |  56 +++++++++++++-----
 3 files changed, 124 insertions(+), 104 deletions(-)
Changes applied before test
commit 8062ed9a4bc314ee019765033bf985e30be579c2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 3a4f484feaccd569a0b9dc37531d7af50bc3871e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/329/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/329/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 30 2021, 1:27 PM
Harbormaster failed remote builds in B21180: Diff 20191!

Build has FAILED

Patch application report for D5652 (id=20192)

Could not rebase; Attempt merge onto f7e9d5c290...

Updating f7e9d5c..64829e0
Fast-forward
 swh/model/hashutil.py               |  71 ++++++++--------------
 swh/model/identifiers.py            | 113 +++++++++++++++++++++++-------------
 swh/model/tests/test_identifiers.py |  56 +++++++++++++-----
 3 files changed, 138 insertions(+), 102 deletions(-)
Changes applied before test
commit 64829e0d9ea398bd2bf289bf1c57ea1fddd7a6af
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 3a4f484feaccd569a0b9dc37531d7af50bc3871e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/330/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/330/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 30 2021, 1:29 PM
Harbormaster failed remote builds in B21181: Diff 20192!
vlorentz edited the test plan for this revision. (Show Details)

Build has FAILED

Patch application report for D5652 (id=20204)

Could not rebase; Attempt merge onto df036ef1c3...

Updating df036ef..60b138a
Fast-forward
 swh/model/hashutil.py               |  71 ++++++++--------------
 swh/model/identifiers.py            | 113 +++++++++++++++++++++++-------------
 swh/model/tests/test_identifiers.py |  56 +++++++++++++-----
 3 files changed, 138 insertions(+), 102 deletions(-)
Changes applied before test
commit 60b138a0e843ce22a2efa7ee6ddef0ec12ea4a4f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 8221147ce6e74b3f685baead1906675b3454065c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/332/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/332/console

rebase + restore hash_git_data

Build is green

Patch application report for D5652 (id=20476)

Could not rebase; Attempt merge onto 31cb72e9f0...

Updating 31cb72e..6ca45e4
Fast-forward
 swh/model/hashutil.py               |  55 ++++++++----------
 swh/model/identifiers.py            | 113 +++++++++++++++++++++++-------------
 swh/model/tests/test_identifiers.py |  56 +++++++++++++-----
 3 files changed, 140 insertions(+), 84 deletions(-)
Changes applied before test
commit 6ca45e467df6a04a18c09e250ec030cc5598df7d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 523ab642bb48491f8e09fb61cc32a9e11420b556
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/335/ for more details.

Overall, considering that the manifests have never been used anywhere, and that the headers don't make them harder to unpack (on the contrary, they make them less ambiguous), I don't mind having them fully compatible with "git objects".

I've made a bunch of stylistic comments but I think the gist of this change is fine.

swh/model/identifiers.py
32

Importing underscore functions looks like poor style. I guess the git_header function could just be public (and be called git_object_header).

159

should probably just be hashlib.new

425–428

Maybe call it format_kv_git_object (or format_git_object_with_headers) by opposition to the directory object which is special.

468

format_git_object_from_parts? from_chunks?

apply comments:

  • rename _git_header to git_object_header
  • replace _git_object_to_identifier_str with direct call to hashlib.new, and remove import of _new_hash
  • rename format_git_object to format_git_object_from_headers
  • rename format_list_git_object to format_git_object_from_parts
swh/model/identifiers.py
425–428

I went for format_git_object_from_headers instead, for consistency with format_git_object_from_parts.

Build is green

Patch application report for D5652 (id=20479)

Could not rebase; Attempt merge onto 31cb72e9f0...

Updating 31cb72e..8c904dc
Fast-forward
 swh/model/hashutil.py               |  55 +++++++++---------
 swh/model/identifiers.py            | 107 ++++++++++++++++++++++--------------
 swh/model/tests/test_identifiers.py |  56 ++++++++++++++-----
 3 files changed, 135 insertions(+), 83 deletions(-)
Changes applied before test
commit 8c904dc6af617d595c1295b80c3976ea1861b92f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 12:24:00 2021 +0200

    identifiers: Expose git_object instead of manifest
    
    The git_object is what will be actually useful to the vault.
    
    It's also easier to test, because test_identifier.py has the
    entire git_object in its test data.

commit 523ab642bb48491f8e09fb61cc32a9e11420b556
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Apr 30 11:26:32 2021 +0200

    identifiers: Expose manifest computation
    
    Before this commit, manifests were only computed internally before
    hashing, so they were not available to outside modules.
    
    This makes testing the module very painful, because identifier
    functions can only be tested by checking the hash; so test failures
    did not show mismatches between the computed manifest and
    the expected one.
    
    Additionally, the 'git bare cooker' of the vault is likely to use
    these as well, as it needs to format git objects in the same format.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/336/ for more details.

olasd added inline comments.
swh/model/identifiers.py
425–428

Thing is, there's headers *and* a message (which is why I suggested with). But hey, I like consistency as much as the next person...

This revision is now accepted and ready to land.May 11 2021, 11:31 AM