Details
- Reviewers
zack - Commits
- rDCIDX3b31e9cf99cc: swh.indexer.producer: Adapt producer to optionally send dict
rDCIDX67be61085529: d/control: Bump dependency to latest swh-model
rDCIDX4623c67ce5d6: swh.indexer.rehash: Respect semantics regarding hashes (re)computation
rDCIDXdc451fcc5415: swh.indexer.tasks: Rename queue name to swh_indexer_rehash
rDCIDX5ee575fa6948: swh.indexer.producer: Abstract away scheduler's internal api
rDCIDXd3a9b21062a1: swh.indexer.rehash: Rename module to rehash + review adaptations
rDCIDX51225e12cab2: swh.indexer.tasks: Open task to recompute checksums
rDCIDX80eafec10b8f: swh.indexer.recompute: Add storage primary_key option for corruption check
rDCIDX49b05e8688d9: swh.indexer.recompute: class to add/update hash checksums in storage
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- master
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 797 Build 1068: arc lint + arc unit
Event Timeline
Add a storage primary_key option to generically check for corruption
This no longer fetches metadata from contents. The contents passed as
parameter should be self-contained and in adequation with the
'primary_key' configuration option. As of now, that primary key is
'sha1'.
Unknown or corrupted contents are skipped.
Add some other improvments:
- docstring clarification
- Add a batch_size to retrieve contents' blobs
- Add another batch_size for the update contents operations
- Align option names with _ as per usual conventions
Just went through this, and commented here and there. Aside from a few minor things, two major points:
- whether this code should be in charge of detect corruptions (my take: no)
- the semantics of the recompute parameter
swh/indexer/recompute.py | ||
---|---|---|
13 ↗ | (On Diff #579) | The name of the Python module / filename (recompute.py) isn't great, maybe rehash.py would be better? The name of the class OTOH is fine, just adding this comment here as I couldn't highlight directly the filename :) |
14 ↗ | (On Diff #579) | s/blob/content/, as that the official name in our glossary |
16 ↗ | (On Diff #579) | uh? not sure what this means |
23–25 ↗ | (On Diff #579) | I don't think "new" is right here, as they're not necessarily new, if recompute (next option) is true, they won't be. Also, the docstring here should describe the syntax to specify checksums that need an optional checksum length, i.e., ALGO:LENGTH |
27 ↗ | (On Diff #579) | Similarly, calling this "recompute" should be enough, IMO. Or maybe "recompute_checksums". |
30–32 ↗ | (On Diff #579) | Not sure I understand why. Surely we can update primary keys as well as other columns? |
45–46 ↗ | (On Diff #579) | This argument should not have a default value, as there is really no sane default for this. |
60 ↗ | (On Diff #579) | I guess this should be something like indexer/rehash instead, right? |
98–99 ↗ | (On Diff #579) | AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used. |
101–109 ↗ | (On Diff #579) | I'm unconvinced this checks belong to this class. We have a component responsible for integrity/bitrot checking: it's part of the archiver. Having that check in multiple places doesn't sound like a good idea. In my view, this class is more low level, it takes a content blob, compute a set of its checksums, and update the DB. Full stop. |
127–129 ↗ | (On Diff #579) | The semantics noted down in T692 was different than this.
(note that this *never* recomputes checksums that are not given in checksum_algorithms) |
swh/indexer/recompute.py | ||
---|---|---|
13 ↗ | (On Diff #579) |
Clearly yes. |
16 ↗ | (On Diff #579) | It means that if we detect corrupted contents, that is their composite key based on the content's raw data no longer match with the actual content, we won't attempt anything on them and skip them altogether. It's not that brick's concern to deal with corrupted contents, it's the checker's concern. |
30–32 ↗ | (On Diff #579) |
sure
Indeed. Well, it's bad in my mind, since we use that to compute the hash of the directory holding the content.
It was a belt and whistle security... Do you mean it should be done in the lower layer swh-storage? |
45–46 ↗ | (On Diff #579) | Yes, i should have left this in comments (I think i did and then got it back in). |
60 ↗ | (On Diff #579) | yes |
65 ↗ | (On Diff #579) | typo here and it's fixed locally. |
98–99 ↗ | (On Diff #579) |
Yes. |
101–109 ↗ | (On Diff #579) |
I don't see it as a check to fix corrupted data.
The checker (which is not in production as far as i remember).
I agree with the duplication being bad.
Ok. |
127–129 ↗ | (On Diff #579) | I'll adapt accordingly. |
Adapt according to review
- Use boolean as default value
- Fix wrong configuration key
- Adapt according to review
Rebase
- Recompute class to trigger an add/update hash checksums in storage
- Open task to recompute checksums
- Add a storage primary_key option to generically check for corruption
- Use boolean as default value
- Fix wrong configuration key
- refactoring: Adapt according to review
- Respect semantics regarding hashes (re)computation
- swh.indexer.tasks: Rename queue name to swh_indexer_rehash
Adaptation to actually make it work
- Respect semantics regarding hashes (re)computation
- swh.indexer.tasks: Rename queue name to swh_indexer_rehash
This is now production ready (i readapted the initial diff description).
Manual testing modus operandi:
- injected a repository using the git-loader.
- altered the db schema to add a new column alter table content add column blake2b512 bytea null;
- computing the list of sha1s from the schema psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g' > sha1s
- Checking the number of contents before batch update:
select count(*) from content; 3991 select count(*) from content where blake2b512 is null; 3991
- feeding those sha1 for updates
Configuration file ~/.config/swh/storage/rehash.yml:
storage: cls: remote args: url: http://localhost:5002/ compute_checksums: - blake2b512 recompute_checksums: false batch_size_retrieve_content: 10 batch_size_update: 100
And trigger the run:
cat sha1s | python3 -m swh.indexer.producer --task-name rehash --dict-with-key sha1
- Check everything has been updated when done:
softwareheritage-dev=# select count(*) from content where blake2b512 is null; count ------- 0 (1 row) softwareheritage-dev=# select count(*) from content where blake2b512 is not null; count ------- 3991 (1 row)
Conclusion: working \o/
Interactive rebase, squash, improve commit messages
Nothing changes in the code.
- swh.indexer.recompute: class to add/update hash checksums in storage
- swh.indexer.tasks: Open task to recompute checksums
- swh.indexer.recompute: Add storage primary_key option for corruption check
- swh.indexer.rehash: Rename module to rehash + review adaptations
- swh.indexer.rehash: Respect semantics regarding hashes (re)computation
- swh.indexer.tasks: Rename queue name to swh_indexer_rehash
- swh.indexer.producer: Abstract away scheduler's internal api
- swh.indexer.producer: Adapt producer to optionally send dict
- d/control: Bump dependency to latest swh-model