Page MenuHomeSoftware Heritage

Get skipped content that are missing data
ClosedPublic

Authored by twitu on Jul 6 2019, 9:25 AM.

Details

Summary

In reference to T1633

Similar to db call skipped_content_missing, the
function checks for content in memory storage
of skipped content

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

twitu created this revision.Jul 6 2019, 9:25 AM
twitu added a comment.Jul 6 2019, 9:30 AM

This logic is similar to one being used in db.skipped_content_missing. However the current implementation in_memory._content_add does not populate _skipped_contents and _skipped_content_indexes.

I think there are two ways to do it:

  • I can modify _content_add however I am not sure how to check decide whether content should be skipped. Is it by checking the algorithm used for hash or the length of content, if so what is the limit?
  • I can add a new function that explicitly only adds skipped contents.
vlorentz added a subscriber: vlorentz.EditedJul 10 2019, 2:51 PM

It should be implemented to behave like in the postgresql storage (storage.py and db.py), so you should change _content_add, and use the same algorithm.

twitu added a comment.EditedJul 13 2019, 6:25 AM

I have a concern here, storage.py line 120. The function self.content_missing can throw an exception in case of a hash collision. Shouldn't line 120 be in a try except block to catch that error and ignore that particular content?

Secondly, I don't fully understand what it means for content to be hidden or absent when can this happen?

twitu added a comment.Jul 13 2019, 8:43 AM

In db.py line 128, the query does not compare blake2s256 despite content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']. Does this mean that skipped content will never be hashed with blake2s256?

twitu added a comment.EditedJul 13 2019, 9:15 AM

I did not find any mechanism in db.py that is actually storing skipped_content. db.py line 51, is passed without implementation.

For in memory content_add, I am adding skipped_content, similar to how regular content is being added.

twitu updated this revision to Diff 5816.Jul 13 2019, 9:21 AM
  • Modify in_memory content_add to add skipped_content
twitu updated this revision to Diff 5817.Jul 13 2019, 9:21 AM
  • Remove dependency
twitu updated this revision to Diff 5853.Jul 16 2019, 6:51 PM
  • Use all hashes in a content
twitu updated this revision to Diff 5855.Jul 16 2019, 8:22 PM
  • Add break to prevent multiple yields
vlorentz requested changes to this revision.Jul 18 2019, 11:16 AM

Thanks! As I mentioned on IRC, you should un-skip test_skipped_content_add in swh/storage/tests/test_in_memory.py, otherwise all this new code does not get tested at all.

This revision now requires changes to proceed.Jul 18 2019, 11:16 AM
twitu updated this revision to Diff 5913.Jul 19 2019, 5:20 PM
  • Change index storage mechanism
twitu updated this revision to Diff 5914.Jul 19 2019, 5:30 PM

Rebase and update

twitu updated this revision to Diff 5915.Jul 19 2019, 5:37 PM

Rebase on master

vlorentz added inline comments.Jul 19 2019, 5:41 PM
swh/storage/tests/test_in_memory.py
39–40

You must also remove this function, or the test is still replaced by empty code.

vlorentz requested changes to this revision.Jul 19 2019, 5:41 PM
This revision now requires changes to proceed.Jul 19 2019, 5:41 PM
twitu updated this revision to Diff 5918.Jul 21 2019, 10:18 PM

Fixed skipped_content counter bug

twitu marked an inline comment as done.Jul 21 2019, 10:19 PM
vlorentz requested changes to this revision.Jul 22 2019, 4:30 PM
vlorentz added inline comments.
swh/storage/in_memory.py
112–113

You can remove these three affectations

116–117

Isn't status mandatory?

118–119

Hmm... why wasn't this needed before?

120–125

You don't need content_by_status as a temporary variable, just fill content_with_data and content_without_data directly in the loop.

127–132

Nitpick: I would prefer the function names to be named _content_add_absent and _content_add_present, so their relation with content_add/_content_add is clearer; and it's not confusing wrt the with_data argument.

This revision now requires changes to proceed.Jul 22 2019, 4:30 PM
twitu marked 3 inline comments as done.Jul 22 2019, 4:55 PM
twitu added inline comments.
swh/storage/in_memory.py
116–117

I have used the logic in storage.py line 110. If status indeed mandatory, I can simplify the code.

127–132

These names look much better.

twitu updated this revision to Diff 5922.Jul 22 2019, 4:55 PM
twitu marked an inline comment as done.
  • Refactoring
vlorentz accepted this revision.Jul 22 2019, 4:57 PM
This revision is now accepted and ready to land.Jul 22 2019, 4:57 PM
vlorentz requested changes to this revision.Jul 22 2019, 4:58 PM
This revision now requires changes to proceed.Jul 22 2019, 4:58 PM

One last thing: could you squash your commits, and improve the commit message (eg. "add support for skipped content in the in-memory storage")

twitu updated this revision to Diff 5923.Jul 22 2019, 5:08 PM

Squash and change commit message

vlorentz accepted this revision.Jul 22 2019, 5:09 PM
This revision is now accepted and ready to land.Jul 22 2019, 5:09 PM
This revision was landed with ongoing or failed builds.Jul 22 2019, 5:11 PM
This revision was automatically updated to reflect the committed changes.