Page MenuHomeSoftware Heritage

content_get_metadata: Adapt endpoint to return Dict[bytes, List[Dict]]
ClosedPublic

Authored by ardumont on Mon, Jan 13, 3:26 PM.

Details

Summary

Prior to this, a shallow content was returned with all keys to None except for
the sha1 key.

This now returns a dict with keys the identifier (as bytes) and the value the associated contents as list.

Clients identified using the api endpoint are [1]:

  • journal's test
  • webapp's test
  • indexer storage's rehash module (not in use right now, it was used to add a new column on contents).

They will need a slight adaptation.

[1]

grep -r ".content_get_metadata(" * | grep -v ".tox" | grep ".content_get_metadata("
swh-indexer/swh/indexer/rehash.py:                content_metadata = self.storage.content_get_metadata(
swh-journal/swh/journal/tests/test_replay.py:    contents = list(storage.content_get_metadata(
swh-storage/swh/storage/storage.py:    def content_get_metadata(self, content, db=None, cur=None):
swh-storage/swh/storage/tests/test_storage.py:        assert list(swh_storage.content_get_metadata([cont['sha1']])) == \
swh-storage/swh/storage/tests/test_storage.py:    def test_content_get_metadata(self, swh_storage):
swh-storage/swh/storage/tests/test_storage.py:        actual_md = list(swh_storage.content_get_metadata(
swh-storage/swh/storage/tests/test_storage.py:        gen = swh_storage.content_get_metadata([missing_cont['sha1']])
swh-storage/swh/storage/tests/test_storage.py:    def test_generate_content_get_metadata(self, swh_storage, swh_contents):
swh-storage/swh/storage/tests/test_storage.py:        actual_contents = list(swh_storage.content_get_metadata(get_sha1s))
swh-storage/swh/storage/in_memory.py:    def content_get_metadata(self, content):
swh-web/swh/web/tests/browse/views/test_content.py:    content_data = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/api/views/test_content.py:    expected_data = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/data.py:    contents_metadata = storage.content_get_metadata(contents)
swh-web/swh/web/tests/common/test_service.py:    content_metadata = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/common/test_service.py:    expected_content = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/common/test_service.py:    expected_content = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/conftest.py:    def content_get_metadata(self, cnt_id):
swh-web/swh/web/tests/conftest.py:        metadata = next(self.storage.content_get_metadata([cnt_id_bytes]))
Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Mon, Jan 13, 3:26 PM

That's inconsistent with other get methods, which yield None if something is missing. (I don't think their behavior is great, but consistency is more important imo)

ardumont added inline comments.Mon, Jan 13, 3:32 PM
swh/storage/in_memory.py
374

Is it better to skip it or to yield None here?

ardumont updated this revision to Diff 8958.Mon, Jan 13, 4:16 PM
  • cnt_get_metadata: Change api to return a Dict[id, Optional[content]]
ardumont retitled this revision from storage.content_get_metadata: Adapt to return nothing if inexistent id to content_get_metadata: Adapt endpoint to return Dict[bytes, Union[None, Dict]].Mon, Jan 13, 4:18 PM
ardumont edited the summary of this revision. (Show Details)
vlorentz added inline comments.Mon, Jan 13, 4:24 PM
swh/storage/storage.py
553–554

you can skip the conditionals by using an inner join and initializing result to {sha1: None for sha1 in contents}.

553–554

Even better: you can initialize it to dict.fromkeys(contents)!

ardumont updated this revision to Diff 8959.Mon, Jan 13, 4:32 PM
  • content_get_metadata: Change api to return Dict[id, List[Dict]]
ardumont added inline comments.Mon, Jan 13, 4:36 PM
swh/storage/storage.py
553–554

Sorry, i had to change the type to something more appropriate.
I think your first proposal still stands though, will try to adapt it.

vlorentz added inline comments.Mon, Jan 13, 4:39 PM
swh/storage/storage.py
553–554

yeah, you can use {sha1: [] for sha1 in contents} instead

ardumont updated this revision to Diff 8962.Mon, Jan 13, 4:45 PM
  • content_get_metadata: Change api to return Dict[bytes, List[Dict]]
swh/storage/storage.py
553–554

Yes, it works! thanks.

ardumont edited the summary of this revision. (Show Details)Mon, Jan 13, 4:45 PM
ardumont updated this revision to Diff 8963.Mon, Jan 13, 4:47 PM

Update copyright header

Sorry to not say it earlier, but you're missing a test where there are two contents with the same sha1 (which will show you there's a bug in the in-mem impl)

Sorry to not say it earlier, but you're missing a test where there are two contents with the same sha1 (which will show you there's a bug in the in-mem impl)

don't be
that's great ;)

I'm not sure how to properly test this for postgres though (i saw a similar scenario which raises a hashcollision for that case i think)...

ardumont added inline comments.Mon, Jan 13, 4:58 PM
swh/storage/in_memory.py
368

This can actually be fixed in that diff now as @vlorentz mentioned.
In progress ;)

ardumont retitled this revision from content_get_metadata: Adapt endpoint to return Dict[bytes, Union[None, Dict]] to content_get_metadata: Adapt endpoint to return Dict[bytes, List[Dict]].EditedTue, Jan 14, 10:52 AM

I'm not sure how to properly test this for postgres though (i saw a similar scenario which raises a hashcollision for that case i think)...

Yes, we cannot test this for postgres.
There is a unicity constraint on the sha1.
(So the list should always be of length 1 for new added contents)

The content_add_metadata from the in-memory implementation should behave the same.
That means the fixme we are talking about in the in-memory implementation should go away altogether.

ardumont updated this revision to Diff 8964.Tue, Jan 14, 11:38 AM

Fix in-memory implementation's fixme

vlorentz accepted this revision.Tue, Jan 14, 11:57 AM
This revision is now accepted and ready to land.Tue, Jan 14, 11:57 AM