Page MenuHomeSoftware Heritage

Recompute class to trigger an add/update hash checksums in storage
ClosedPublic

Authored by ardumont on Mar 3 2017, 11:48 AM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 799
Build 1070: arc lint + arc unit

Event Timeline

swh/indexer/recompute.py
45 ↗(On Diff #577)

batch_size

83 ↗(On Diff #577)

bug: not 'id' but 'sha1'

91 ↗(On Diff #577)

At the moment, sha1 is the primary key.
This won't always be the case so use for example, an option to define the actual primary key and check that instead.

Add a storage primary_key option to generically check for corruption

This no longer fetches metadata from contents. The contents passed as
parameter should be self-contained and in adequation with the
'primary_key' configuration option. As of now, that primary key is
'sha1'.

Unknown or corrupted contents are skipped.

Add some other improvments:

  • docstring clarification
  • Add a batch_size to retrieve contents' blobs
  • Add another batch_size for the update contents operations
  • Align option names with _ as per usual conventions
zack requested changes to this revision.Mar 13 2017, 3:40 PM

Just went through this, and commented here and there. Aside from a few minor things, two major points:

  • whether this code should be in charge of detect corruptions (my take: no)
  • the semantics of the recompute parameter
swh/indexer/recompute.py
13 ↗(On Diff #579)

The name of the Python module / filename (recompute.py) isn't great, maybe rehash.py would be better?
Unless you think it might be used to recompute other stuff, but I can't see many synergies among "stuff that recompute stuff" in general.

The name of the class OTOH is fine, just adding this comment here as I couldn't highlight directly the filename :)

14 ↗(On Diff #579)

s/blob/content/, as that the official name in our glossary

16 ↗(On Diff #579)

uh? not sure what this means

23–25 ↗(On Diff #579)

I don't think "new" is right here, as they're not necessarily new, if recompute (next option) is true, they won't be.
So I proposed to call this parameter simply "checksums". In the content of this class it should be specific enough.

Also, the docstring here should describe the syntax to specify checksums that need an optional checksum length, i.e., ALGO:LENGTH

27 ↗(On Diff #579)

Similarly, calling this "recompute" should be enough, IMO. Or maybe "recompute_checksums".

30–32 ↗(On Diff #579)

Not sure I understand why. Surely we can update primary keys as well as other columns?
Whether that's a good idea or not is a different matter :-), but I don't think inhibiting to recompute sha1 should be an hard-coded detail of this class.

45–46 ↗(On Diff #579)

This argument should not have a default value, as there is really no sane default for this.
Just leave it empty, and make the code barf is nothing else is provided when invoking the code.

60 ↗(On Diff #579)

I guess this should be something like indexer/rehash instead, right?

98–99 ↗(On Diff #579)

AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used.
It should be extended to do so.

101–109 ↗(On Diff #579)

I'm unconvinced this checks belong to this class. We have a component responsible for integrity/bitrot checking: it's part of the archiver. Having that check in multiple places doesn't sound like a good idea. In my view, this class is more low level, it takes a content blob, compute a set of its checksums, and update the DB. Full stop.
In fact, leaving it as low level as that might enable to use that to *correct* wrong primary keys (if they ever happen); whereas with this check it will be impossible to do so.

127–129 ↗(On Diff #579)

The semantics noted down in T692 was different than this.

  • With the above code, you're saying: if recompute is true, recompute all known checksums + the new ones requested
  • What I had in mind was:
    • among the checksums specified in checksum_algorithms, look which ones are already defined in the DB for the requested content
    • by default (recompute=False) do not recompute the checksums that were already known
    • when recompute=True, recompute also the already known ones

(note that this *never* recomputes checksums that are not given in checksum_algorithms)

This revision now requires changes to proceed.Mar 13 2017, 3:40 PM
ardumont added inline comments.
swh/indexer/recompute.py
13 ↗(On Diff #579)

The name of the Python module / filename (recompute.py) isn't great, maybe rehash.py would be better?

Clearly yes.

16 ↗(On Diff #579)

It means that if we detect corrupted contents, that is their composite key based on the content's raw data no longer match with the actual content, we won't attempt anything on them and skip them altogether.

It's not that brick's concern to deal with corrupted contents, it's the checker's concern.

30–32 ↗(On Diff #579)

Not sure I understand why. Surely we can update primary keys as well as other columns?

sure

Whether that's a good idea or not is a different matter :-)

Indeed.

Well, it's bad in my mind, since we use that to compute the hash of the directory holding the content.
Changing it would break the idempotence of that directory's hash computation.

but I don't think inhibiting to recompute sha1 should be an hard-coded detail of this class.

It was a belt and whistle security...

Do you mean it should be done in the lower layer swh-storage?

45–46 ↗(On Diff #579)

Yes, i should have left this in comments (I think i did and then got it back in).
Anyway, changing it.

60 ↗(On Diff #579)

yes

65 ↗(On Diff #579)

typo here and it's fixed locally.

98–99 ↗(On Diff #579)

AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used. It should be extended to do so.

Yes.

101–109 ↗(On Diff #579)

I'm unconvinced this checks belong to this class.

I don't see it as a check to fix corrupted data.
I see it as a check to avoid messing with the data in the db as explained in prior comment.

We have a component responsible for integrity/bitrot checking: it's part of the archiver.

The checker (which is not in production as far as i remember).

Having that check in multiple places doesn't sound like a good idea.

I agree with the duplication being bad.

In my view, this class is more low level, it takes a content blob, compute a set of its checksums, and update the DB. Full stop. In fact, leaving it as low level as that might enable to use that to *correct* wrong primary keys (if they ever happen); whereas with this check it will be impossible to do so.

Ok.
This means we have order in the bricks we use.
We must assert that we'll run the checker (or archiver) prior to this.

127–129 ↗(On Diff #579)

I'll adapt accordingly.

ardumont edited edge metadata.
ardumont marked 3 inline comments as done.

Adapt according to review

  • Use boolean as default value
  • Fix wrong configuration key
  • Adapt according to review

Respect semantics regarding hashes (re)computation

Respect semantics regarding hashes (re)computation

ardumont added inline comments.
swh/indexer/recompute.py
98–99 ↗(On Diff #579)

AFAICT this will not work for hashutil.hashdata(), as it doesn't currently support specifying the length of the checksums being used.
It should be extended to do so.

Yes

D192

ardumont marked an inline comment as done.

Rebase

  • Recompute class to trigger an add/update hash checksums in storage
  • Open task to recompute checksums
  • Add a storage primary_key option to generically check for corruption
  • Use boolean as default value
  • Fix wrong configuration key
  • refactoring: Adapt according to review
  • Respect semantics regarding hashes (re)computation
  • swh.indexer.tasks: Rename queue name to swh_indexer_rehash

Adaptation to actually make it work

  • Respect semantics regarding hashes (re)computation
  • swh.indexer.tasks: Rename queue name to swh_indexer_rehash

Adapt indexer producer to be able to optionally send dict

This is now production ready (i readapted the initial diff description).

Manual testing modus operandi:

  • injected a repository using the git-loader.
  • altered the db schema to add a new column alter table content add column blake2b512 bytea null;
  • computing the list of sha1s from the schema psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g' > sha1s
  • Checking the number of contents before batch update:
select count(*) from content;
3991
select count(*) from content where blake2b512 is null;
3991
  • feeding those sha1 for updates

Configuration file ~/.config/swh/storage/rehash.yml:

storage:
  cls: remote
  args:
    url: http://localhost:5002/
compute_checksums:
  - blake2b512
recompute_checksums: false
batch_size_retrieve_content: 10
batch_size_update: 100

And trigger the run:

cat sha1s | python3 -m swh.indexer.producer --task-name rehash --dict-with-key sha1
  • Check everything has been updated when done:
softwareheritage-dev=# select count(*) from content where blake2b512 is null;
 count
-------
     0
(1 row)
softwareheritage-dev=# select count(*) from content where blake2b512 is not null;
 count
-------
  3991
(1 row)

Conclusion: working \o/

d/control: Bump dependency to latest swh-model

Interactive rebase, squash, improve commit messages

Nothing changes in the code.

  • swh.indexer.recompute: class to add/update hash checksums in storage
  • swh.indexer.tasks: Open task to recompute checksums
  • swh.indexer.recompute: Add storage primary_key option for corruption check
  • swh.indexer.rehash: Rename module to rehash + review adaptations
  • swh.indexer.rehash: Respect semantics regarding hashes (re)computation
  • swh.indexer.tasks: Rename queue name to swh_indexer_rehash
  • swh.indexer.producer: Abstract away scheduler's internal api
  • swh.indexer.producer: Adapt producer to optionally send dict
  • d/control: Bump dependency to latest swh-model