Page MenuHomeSoftware Heritage

Get skipped content that are missing data
ClosedPublic

Authored by twitu on Jul 6 2019, 9:25 AM.

Details

Summary

In reference to T1633

Similar to db call skipped_content_missing, the
function checks for content in memory storage
of skipped content

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6958
Build 9781: tox-on-jenkinsJenkins
Build 9780: arc lint + arc unit

Event Timeline

This logic is similar to one being used in db.skipped_content_missing. However the current implementation in_memory._content_add does not populate _skipped_contents and _skipped_content_indexes.

I think there are two ways to do it:

  • I can modify _content_add however I am not sure how to check decide whether content should be skipped. Is it by checking the algorithm used for hash or the length of content, if so what is the limit?
  • I can add a new function that explicitly only adds skipped contents.

It should be implemented to behave like in the postgresql storage (storage.py and db.py), so you should change _content_add, and use the same algorithm.

I have a concern here, storage.py line 120. The function self.content_missing can throw an exception in case of a hash collision. Shouldn't line 120 be in a try except block to catch that error and ignore that particular content?

Secondly, I don't fully understand what it means for content to be hidden or absent when can this happen?

In db.py line 128, the query does not compare blake2s256 despite content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']. Does this mean that skipped content will never be hashed with blake2s256?

I did not find any mechanism in db.py that is actually storing skipped_content. db.py line 51, is passed without implementation.

For in memory content_add, I am adding skipped_content, similar to how regular content is being added.

  • Modify in_memory content_add to add skipped_content
  • Use all hashes in a content
  • Add break to prevent multiple yields

Thanks! As I mentioned on IRC, you should un-skip test_skipped_content_add in swh/storage/tests/test_in_memory.py, otherwise all this new code does not get tested at all.

This revision now requires changes to proceed.Jul 18 2019, 11:16 AM
  • Change index storage mechanism
swh/storage/tests/test_in_memory.py
39–40

You must also remove this function, or the test is still replaced by empty code.

This revision now requires changes to proceed.Jul 19 2019, 5:41 PM

Fixed skipped_content counter bug

vlorentz added inline comments.
swh/storage/in_memory.py
112–113

You can remove these three affectations

115–116

Isn't status mandatory?

117–118

Hmm... why wasn't this needed before?

119–124

You don't need content_by_status as a temporary variable, just fill content_with_data and content_without_data directly in the loop.

126–131

Nitpick: I would prefer the function names to be named _content_add_absent and _content_add_present, so their relation with content_add/_content_add is clearer; and it's not confusing wrt the with_data argument.

This revision now requires changes to proceed.Jul 22 2019, 4:30 PM
twitu added inline comments.
swh/storage/in_memory.py
115–116

I have used the logic in storage.py line 110. If status indeed mandatory, I can simplify the code.

126–131

These names look much better.

twitu marked an inline comment as done.
  • Refactoring
This revision is now accepted and ready to land.Jul 22 2019, 4:57 PM
This revision now requires changes to proceed.Jul 22 2019, 4:58 PM

One last thing: could you squash your commits, and improve the commit message (eg. "add support for skipped content in the in-memory storage")

Squash and change commit message

This revision is now accepted and ready to land.Jul 22 2019, 5:09 PM
This revision was landed with ongoing or failed builds.Jul 22 2019, 5:11 PM
This revision was automatically updated to reflect the committed changes.