Page MenuHomeSoftware Heritage

content_get_metadata: Adapt endpoint to return Dict[bytes, List[Dict]]
ClosedPublic

Authored by ardumont on Jan 13 2020, 3:26 PM.

Details

Summary

Prior to this, a shallow content was returned with all keys to None except for
the sha1 key.

This now returns a dict with keys the identifier (as bytes) and the value the associated contents as list.

Clients identified using the api endpoint are [1]:

  • journal's test
  • webapp's test
  • indexer storage's rehash module (not in use right now, it was used to add a new column on contents).

They will need a slight adaptation.

[1]

grep -r ".content_get_metadata(" * | grep -v ".tox" | grep ".content_get_metadata("
swh-indexer/swh/indexer/rehash.py:                content_metadata = self.storage.content_get_metadata(
swh-journal/swh/journal/tests/test_replay.py:    contents = list(storage.content_get_metadata(
swh-storage/swh/storage/storage.py:    def content_get_metadata(self, content, db=None, cur=None):
swh-storage/swh/storage/tests/test_storage.py:        assert list(swh_storage.content_get_metadata([cont['sha1']])) == \
swh-storage/swh/storage/tests/test_storage.py:    def test_content_get_metadata(self, swh_storage):
swh-storage/swh/storage/tests/test_storage.py:        actual_md = list(swh_storage.content_get_metadata(
swh-storage/swh/storage/tests/test_storage.py:        gen = swh_storage.content_get_metadata([missing_cont['sha1']])
swh-storage/swh/storage/tests/test_storage.py:    def test_generate_content_get_metadata(self, swh_storage, swh_contents):
swh-storage/swh/storage/tests/test_storage.py:        actual_contents = list(swh_storage.content_get_metadata(get_sha1s))
swh-storage/swh/storage/in_memory.py:    def content_get_metadata(self, content):
swh-web/swh/web/tests/browse/views/test_content.py:    content_data = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/api/views/test_content.py:    expected_data = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/data.py:    contents_metadata = storage.content_get_metadata(contents)
swh-web/swh/web/tests/common/test_service.py:    content_metadata = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/common/test_service.py:    expected_content = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/common/test_service.py:    expected_content = archive_data.content_get_metadata(content['sha1'])
swh-web/swh/web/tests/conftest.py:    def content_get_metadata(self, cnt_id):
swh-web/swh/web/tests/conftest.py:        metadata = next(self.storage.content_get_metadata([cnt_id_bytes]))
Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10065
Build 14904: tox-on-jenkinsJenkins
Build 14903: arc lint + arc unit

Event Timeline

That's inconsistent with other get methods, which yield None if something is missing. (I don't think their behavior is great, but consistency is more important imo)

swh/storage/in_memory.py
367–373

Is it better to skip it or to yield None here?

  • cnt_get_metadata: Change api to return a Dict[id, Optional[content]]
ardumont retitled this revision from storage.content_get_metadata: Adapt to return nothing if inexistent id to content_get_metadata: Adapt endpoint to return Dict[bytes, Union[None, Dict]].Jan 13 2020, 4:18 PM
ardumont edited the summary of this revision. (Show Details)
swh/storage/storage.py
555–560

you can skip the conditionals by using an inner join and initializing result to {sha1: None for sha1 in contents}.

555–560

Even better: you can initialize it to dict.fromkeys(contents)!

  • content_get_metadata: Change api to return Dict[id, List[Dict]]
swh/storage/storage.py
555–560

Sorry, i had to change the type to something more appropriate.
I think your first proposal still stands though, will try to adapt it.

swh/storage/storage.py
555–560

yeah, you can use {sha1: [] for sha1 in contents} instead

  • content_get_metadata: Change api to return Dict[bytes, List[Dict]]
swh/storage/storage.py
555–560

Yes, it works! thanks.

Sorry to not say it earlier, but you're missing a test where there are two contents with the same sha1 (which will show you there's a bug in the in-mem impl)

Sorry to not say it earlier, but you're missing a test where there are two contents with the same sha1 (which will show you there's a bug in the in-mem impl)

don't be
that's great ;)

I'm not sure how to properly test this for postgres though (i saw a similar scenario which raises a hashcollision for that case i think)...

swh/storage/in_memory.py
366–367

This can actually be fixed in that diff now as @vlorentz mentioned.
In progress ;)

ardumont retitled this revision from content_get_metadata: Adapt endpoint to return Dict[bytes, Union[None, Dict]] to content_get_metadata: Adapt endpoint to return Dict[bytes, List[Dict]].EditedJan 14 2020, 10:52 AM

I'm not sure how to properly test this for postgres though (i saw a similar scenario which raises a hashcollision for that case i think)...

Yes, we cannot test this for postgres.
There is a unicity constraint on the sha1.
(So the list should always be of length 1 for new added contents)

The content_add_metadata from the in-memory implementation should behave the same.
That means the fixme we are talking about in the in-memory implementation should go away altogether.

Fix in-memory implementation's fixme

This revision is now accepted and ready to land.Jan 14 2020, 11:57 AM