Page MenuHomeSoftware Heritage

[WIP] Add a (redis-based) validation error reporting facility
AbandonedPublic

Authored by douardda on Oct 26 2021, 5:48 PM.

Details

Summary

This allows to use a ValidatingPorxyStorage that will filter out invalid
objects but report them in a redis server.

Key for an invalid object is:

  • its swhid, if any,
  • a string "<object_type>:<hexhash>" if if has a compute_hash() method,
  • a string like "<object_type>:<key1>=<value1>,<key2>=<value2>[,...]" if obj.unique_identifier() is a dict

Related to T3693

/!\ this is a WIP, no test, no proper requirement, etc. to discuss the idea
It does not even properly filter out invalid objects for now.

Also, with the approach of this diff, the object stored in redis is not the message from kafka, but the result of a double msgpack->dict->BaseModel->dict->msgpack transformation which is not great (with filtering/sanitizing stuff embedded in the pipeline).

Finally, having to implement the computation of the key like that is a bit depressing...
(who said "SWHID everywhere!"?)

Diff Detail

Event Timeline

Build is green

Patch application report for D6554 (id=23818)

Rebasing onto 49a932c989...

Current branch diff-target is up to date.
Changes applied before test
commit ebdcbc140e7b25e145a2430c4b836a64e8dc201b
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Oct 26 16:46:40 2021 +0200

    Add a (redis-based) validation error reporting facility
    
    This allows to use a ValidatingPorxyStorage that will filter out invalid
    objects but report them in a redis server.
    
    Key for an invalid object is:
    - its swhid, if any,
    - a string "<object_type>:<hexhash>" if if has a compute_hash() method,
    - a string like "<object_type>:<key1>=<value1>,<key2>=<value2>[,...]" if
      obj.unique_identifier() is a dict

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1462/ for more details.

Finally, having to implement the computation of the key like that is a bit depressing...
(who said "SWHID everywhere!"?)

What objects are missing swhids?

Two nitpicks:

  • rename on_error to errors, for consistency with bytes.decode()?
  • And make the constructor convert it from a string to an enum so the value checked at runtime (currently, typoing the value makes it ignore errors completely), and other methods benefit from static type-checking.
swh/storage/proxies/validate.py
50

maybe?

70

This will drop content objects. I understand why, but I don't think this is a good idea because they may never be available again.

At the very least, it should be clearly documented this should only be used when mirroring, and should always be addressed ASAP.

swh/storage/proxies/validate.py
70

I mean it will drop their 'data', not the content object iself

used another approach, see D6571