Details

Reviewers

Commits

rDCIDX3b31e9cf99cc: swh.indexer.producer: Adapt producer to optionally send dict
rDCIDX67be61085529: d/control: Bump dependency to latest swh-model
rDCIDX4623c67ce5d6: swh.indexer.rehash: Respect semantics regarding hashes (re)computation
rDCIDXdc451fcc5415: swh.indexer.tasks: Rename queue name to swh_indexer_rehash
rDCIDX5ee575fa6948: swh.indexer.producer: Abstract away scheduler's internal api
rDCIDXd3a9b21062a1: swh.indexer.rehash: Rename module to rehash + review adaptations
rDCIDX51225e12cab2: swh.indexer.tasks: Open task to recompute checksums
rDCIDX80eafec10b8f: swh.indexer.recompute: Add storage primary_key option for corruption check
rDCIDX49b05e8688d9: swh.indexer.recompute: class to add/update hash checksums in storage

Summary

Production ready.

Related T692
Depends D192

Diff Detail

Repository

rDCIDX Metadata indexer

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 799
Build 1070: arc lint + arc unit

Event Timeline

ardumont created this revision.Mar 3 2017, 11:48 AM

ardumont added inline comments.Mar 3 2017, 12:44 PM

swh/indexer/recompute.py
45 ↗	(On Diff #577)	batch_size
83 ↗	(On Diff #577)	bug: not 'id' but 'sha1'
91 ↗	(On Diff #577)	At the moment, sha1 is the primary key. This won't always be the case so use for example, an option to define the actual primary key and check that instead.

Add a storage primary_key option to generically check for corruption

This no longer fetches metadata from contents. The contents passed as
parameter should be self-contained and in adequation with the
'primary_key' configuration option. As of now, that primary key is
'sha1'.

Unknown or corrupted contents are skipped.

Add some other improvments:

docstring clarification
Add a batch_size to retrieve contents' blobs
Add another batch_size for the update contents operations
Align option names with _ as per usual conventions

ardumont marked 3 inline comments as done.Mar 3 2017, 2:06 PM

Amend docstrings

Just went through this, and commented here and there. Aside from a few minor things, two major points:

whether this code should be in charge of detect corruptions (my take: no)
the semantics of the recompute parameter

swh/indexer/recompute.py
13 ↗	(On Diff #579)	The name of the Python module / filename (recompute.py) isn't great, maybe rehash.py would be better? Unless you think it might be used to recompute other stuff, but I can't see many synergies among "stuff that recompute stuff" in general. The name of the class OTOH is fine, just adding this comment here as I couldn't highlight directly the filename :)
14 ↗	(On Diff #579)	s/blob/content/, as that the official name in our glossary
16 ↗	(On Diff #579)	uh? not sure what this means
23–25 ↗	(On Diff #579)	I don't think "new" is right here, as they're not necessarily new, if recompute (next option) is true, they won't be. So I proposed to call this parameter simply "checksums". In the content of this class it should be specific enough. Also, the docstring here should describe the syntax to specify checksums that need an optional checksum length, i.e., ALGO:LENGTH
27 ↗	(On Diff #579)	Similarly, calling this "recompute" should be enough, IMO. Or maybe "recompute_checksums".
30–32 ↗	(On Diff #579)	Not sure I understand why. Surely we can update primary keys as well as other columns? Whether that's a good idea or not is a different matter :-), but I don't think inhibiting to recompute sha1 should be an hard-coded detail of this class.
45–46 ↗	(On Diff #579)	This argument should not have a default value, as there is really no sane default for this. Just leave it empty, and make the code barf is nothing else is provided when invoking the code.
60 ↗	(On Diff #579)	I guess this should be something like `indexer/rehash` instead, right?
98–99 ↗	(On Diff #579)	AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used. It should be extended to do so.
101–109 ↗	(On Diff #579)	I'm unconvinced this checks belong to this class. We have a component responsible for integrity/bitrot checking: it's part of the archiver. Having that check in multiple places doesn't sound like a good idea. In my view, this class is more low level, it takes a content blob, compute a set of its checksums, and update the DB. Full stop. In fact, leaving it as low level as that might enable to use that to correct wrong primary keys (if they ever happen); whereas with this check it will be impossible to do so.
127–129 ↗	(On Diff #579)	The semantics noted down in T692 was different than this. With the above code, you're saying: if recompute is true, recompute all known checksums + the new ones requested What I had in mind was: among the checksums specified in `checksum_algorithms`, look which ones are already defined in the DB for the requested content by default (recompute=False) do not recompute the checksums that were already known when recompute=True, recompute also the already known ones (note that this never recomputes checksums that are not given in checksum_algorithms)

This revision now requires changes to proceed.Mar 13 2017, 3:40 PM

ardumont marked 6 inline comments as done.Mar 13 2017, 4:52 PM

ardumont added inline comments.

swh/indexer/recompute.py
13 ↗	(On Diff #579)	The name of the Python module / filename (recompute.py) isn't great, maybe rehash.py would be better? Clearly yes.
16 ↗	(On Diff #579)	It means that if we detect corrupted contents, that is their composite key based on the content's raw data no longer match with the actual content, we won't attempt anything on them and skip them altogether. It's not that brick's concern to deal with corrupted contents, it's the checker's concern.
30–32 ↗	(On Diff #579)	Not sure I understand why. Surely we can update primary keys as well as other columns? sure Whether that's a good idea or not is a different matter :-) Indeed. Well, it's bad in my mind, since we use that to compute the hash of the directory holding the content. Changing it would break the idempotence of that directory's hash computation. but I don't think inhibiting to recompute sha1 should be an hard-coded detail of this class. It was a belt and whistle security... Do you mean it should be done in the lower layer swh-storage?
45–46 ↗	(On Diff #579)	Yes, i should have left this in comments (I think i did and then got it back in). Anyway, changing it.
60 ↗	(On Diff #579)	yes
65 ↗	(On Diff #579)	typo here and it's fixed locally.
98–99 ↗	(On Diff #579)	AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used. It should be extended to do so. Yes.
101–109 ↗	(On Diff #579)	I'm unconvinced this checks belong to this class. I don't see it as a check to fix corrupted data. I see it as a check to avoid messing with the data in the db as explained in prior comment. We have a component responsible for integrity/bitrot checking: it's part of the archiver. The checker (which is not in production as far as i remember). Having that check in multiple places doesn't sound like a good idea. I agree with the duplication being bad. In my view, this class is more low level, it takes a content blob, compute a set of its checksums, and update the DB. Full stop. In fact, leaving it as low level as that might enable to use that to correct wrong primary keys (if they ever happen); whereas with this check it will be impossible to do so. Ok. This means we have order in the bricks we use. We must assert that we'll run the checker (or archiver) prior to this.
127–129 ↗	(On Diff #579)	I'll adapt accordingly.

Adapt according to review

Use boolean as default value
Fix wrong configuration key
Adapt according to review

Respect semantics regarding hashes (re)computation

Harbormaster completed remote builds in B773: Diff 605.Mar 14 2017, 1:52 PM

Respect semantics regarding hashes (re)computation

Harbormaster completed remote builds in B774: Diff 606.Mar 14 2017, 1:55 PM

ardumont edited the summary of this revision. (Show Details)Mar 14 2017, 1:56 PM

ardumont marked 9 inline comments as done.Mar 14 2017, 4:06 PM

ardumont added inline comments.

swh/indexer/recompute.py
98–99 ↗	(On Diff #579)	AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used. It should be extended to do so. Yes D192

ardumont mentioned this in T700: Clean up redundant code swh.core.hashutil/swh.model.hashutil.Mar 17 2017, 10:44 AM

Rebase

Recompute class to trigger an add/update hash checksums in storage
Open task to recompute checksums
Add a storage primary_key option to generically check for corruption
Use boolean as default value
Fix wrong configuration key
refactoring: Adapt according to review
Respect semantics regarding hashes (re)computation
swh.indexer.tasks: Rename queue name to swh_indexer_rehash

ardumont edited the summary of this revision. (Show Details)Mar 17 2017, 11:38 AM

Adaptation to actually make it work

Respect semantics regarding hashes (re)computation
swh.indexer.tasks: Rename queue name to swh_indexer_rehash

Adapt indexer producer to be able to optionally send dict

This is now production ready (i readapted the initial diff description).

Manual testing modus operandi:

injected a repository using the git-loader.

altered the db schema to add a new column alter table content add column blake2b512 bytea null;

computing the list of sha1s from the schema psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g' > sha1s

Checking the number of contents before batch update:

select count(*) from content;
3991
select count(*) from content where blake2b512 is null;
3991

feeding those sha1 for updates

Configuration file ~/.config/swh/storage/rehash.yml:

storage:
  cls: remote
  args:
    url: http://localhost:5002/
compute_checksums:
  - blake2b512
recompute_checksums: false
batch_size_retrieve_content: 10
batch_size_update: 100

And trigger the run:

cat sha1s | python3 -m swh.indexer.producer --task-name rehash --dict-with-key sha1

Check everything has been updated when done:

softwareheritage-dev=# select count(*) from content where blake2b512 is null;
 count
-------
     0
(1 row)
softwareheritage-dev=# select count(*) from content where blake2b512 is not null;
 count
-------
  3991
(1 row)

Conclusion: working \o/

ardumont edited the summary of this revision. (Show Details)Mar 17 2017, 12:02 PM

Remove print statement

d/control: Bump dependency to latest swh-model

Interactive rebase, squash, improve commit messages

Nothing changes in the code.

swh.indexer.recompute: class to add/update hash checksums in storage
swh.indexer.tasks: Open task to recompute checksums
swh.indexer.recompute: Add storage primary_key option for corruption check
swh.indexer.rehash: Rename module to rehash + review adaptations
swh.indexer.rehash: Respect semantics regarding hashes (re)computation
swh.indexer.tasks: Rename queue name to swh_indexer_rehash
swh.indexer.producer: Abstract away scheduler's internal api
swh.indexer.producer: Adapt producer to optionally send dict
d/control: Bump dependency to latest swh-model

Closed by commit rDCIDX49b05e8688d9: swh.indexer.recompute: class to add/update hash checksums in storage (authored by ardumont). · Explain WhyMar 21 2017, 10:50 AM

This revision was automatically updated to reflect the committed changes.

ardumont mentioned this in rDCIDX67be61085529: d/control: Bump dependency to latest swh-model.

Recompute class to trigger an add/update hash checksums in storage
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 626

swh/indexer/init.py

swh/indexer/producer.py

swh/indexer/rehash.py

swh/indexer/tasks.py

Recompute class to trigger an add/update hash checksums in storageClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 626

swh/indexer/__init__.py

swh/indexer/producer.py

swh/indexer/rehash.py

swh/indexer/tasks.py

Recompute class to trigger an add/update hash checksums in storage
ClosedPublic
Actions

Revision Contents
Changeset List

swh/indexer/init.py