Details

Reviewers

Group Reviewers

Commits

rDSTOCf71f5318a1eb: Add support for skipped content in in-memory storage
rDSTOf71f5318a1eb: Add support for skipped content in in-memory storage

Summary

In reference to T1633

Similar to db call skipped_content_missing, the
function checks for content in memory storage
of skipped content

Diff Detail

Repository

rDSTO Storage manager

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

twitu created this revision.Jul 6 2019, 9:25 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptJul 6 2019, 9:25 AM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/543/ for more details.

Harbormaster completed remote builds in B6706: Diff 5685.Jul 6 2019, 9:28 AM

This logic is similar to one being used in db.skipped_content_missing. However the current implementation in_memory._content_add does not populate _skipped_contents and _skipped_content_indexes.

I think there are two ways to do it:

I can modify _content_add however I am not sure how to check decide whether content should be skipped. Is it by checking the algorithm used for hash or the length of content, if so what is the limit?
I can add a new function that explicitly only adds skipped contents.

It should be implemented to behave like in the postgresql storage (storage.py and db.py), so you should change _content_add, and use the same algorithm.

I have a concern here, storage.py line 120. The function self.content_missing can throw an exception in case of a hash collision. Shouldn't line 120 be in a try except block to catch that error and ignore that particular content?

Secondly, I don't fully understand what it means for content to be hidden or absent when can this happen?

In db.py line 128, the query does not compare blake2s256 despite content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']. Does this mean that skipped content will never be hashed with blake2s256?

I did not find any mechanism in db.py that is actually storing skipped_content. db.py line 51, is passed without implementation.

For in memory content_add, I am adding skipped_content, similar to how regular content is being added.

Modify in_memory content_add to add skipped_content

Remove dependency

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/556/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/556/console

Harbormaster failed remote builds in B6856: Diff 5816!Jul 13 2019, 9:24 AM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/557/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/557/console

Harbormaster failed remote builds in B6857: Diff 5817!Jul 13 2019, 9:28 AM

Use all hashes in a content

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/559/ for more details.

Harbormaster completed remote builds in B6889: Diff 5853.Jul 16 2019, 6:55 PM

Add break to prevent multiple yields

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/560/ for more details.

Harbormaster completed remote builds in B6891: Diff 5855.Jul 16 2019, 8:26 PM

Thanks! As I mentioned on IRC, you should un-skip test_skipped_content_add in swh/storage/tests/test_in_memory.py, otherwise all this new code does not get tested at all.

This revision now requires changes to proceed.Jul 18 2019, 11:16 AM

Change index storage mechanism

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/568/ for more details.

Harbormaster completed remote builds in B6949: Diff 5913.Jul 19 2019, 5:24 PM

Rebase and update

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/569/ for more details.

Harbormaster completed remote builds in B6950: Diff 5914.Jul 19 2019, 5:34 PM

Rebase on master

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/570/ for more details.

Harbormaster completed remote builds in B6951: Diff 5915.Jul 19 2019, 5:41 PM

vlorentz added inline comments.Jul 19 2019, 5:41 PM

swh/storage/tests/test_in_memory.py
39–40	You must also remove this function, or the test is still replaced by empty code.

vlorentz requested changes to this revision.Jul 19 2019, 5:41 PM

This revision now requires changes to proceed.Jul 19 2019, 5:41 PM

Fixed skipped_content counter bug

twitu marked an inline comment as done.Jul 21 2019, 10:19 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/571/ for more details.

Harbormaster completed remote builds in B6954: Diff 5918.Jul 21 2019, 10:23 PM

vlorentz requested changes to this revision.Jul 22 2019, 4:30 PM

vlorentz added inline comments.

swh/storage/in_memory.py
112–113	You can remove these three affectations
115–116	Isn't `status` mandatory?
117–118	Hmm... why wasn't this needed before?
119–124	You don't need `content_by_status` as a temporary variable, just fill `content_with_data` and `content_without_data` directly in the loop.
126–131	Nitpick: I would prefer the function names to be named `_content_add_absent` and `_content_add_present`, so their relation with `content_add`/`_content_add` is clearer; and it's not confusing wrt the `with_data` argument.