In T2332#42825, @ardumont wrote:Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.
That's the case.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Mar 25 2020
Mar 25 2020
Mar 24 2020
Mar 24 2020
Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.
to be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.
it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs
I'll write my remarks down here for tracking purposes
sampled collisions extracted from sentry and storage [1]
Mar 16 2020
Mar 16 2020
vlorentz updated the task description for T2316: Align row deduplication of all _add endpoints on release_add.
Mar 12 2020
Mar 12 2020
Mar 10 2020
Mar 10 2020
vlorentz accepted D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
thanks!
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/1041/ for more details.
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Rebase on latest master
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/1040/ for more details.
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build was aborted
Harbormaster failed remote builds in B11008: Diff 9937 for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception!
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build was aborted
Mar 9 2020
Mar 9 2020
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Improve collision scenario checks
Harbormaster failed remote builds in B11000: Diff 9929 for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception!
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build was aborted
ardumont added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Could you add more assertions to test_content_add_collision and test_content_add_metadata_collision, to check for the new common behavior?
vlorentz requested changes to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Could you add more assertions to test_content_add_collision and test_content_add_metadata_collision, to check for the new common behavior?
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Add coverage on extra conversion step
vlorentz added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
olasd added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/1033/ for more details.
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Harbormaster failed remote builds in B10997: Diff 9926 for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception!
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build was aborted
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Align storages to return the list of colliding hashes
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
- pgstorage: Return the list of colliding content hashes
- improve regexp extraction
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
olasd added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
vlorentz added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
olasd requested changes to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
As mentioned inline in the pg storage diff, in general we should return /all/ colliding contents that we can find, rather than a single one. So in the end, the exception argument should be a List[Dict[str, bytes]].
ardumont added inline comments to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Mar 8 2020
Mar 8 2020
ardumont added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/1031/ for more details.
Harbormaster failed remote builds in B10993: Diff 9923 for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception!
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build was aborted
Mar 7 2020
Mar 7 2020
ardumont updated the diff for D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Adapt according to review
ardumont updated the summary of D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
vlorentz accepted D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
lgtm, but I'd like someone else to review it as well
Mar 6 2020
Mar 6 2020
swh-public-ci added a comment to D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/1029/ for more details.
ardumont retitled D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception from storage: Identify and provide the collision hash(es) in HashCollision exception to storage(s): Identify and provide the collision hash(es) in HashCollision exception.
olasd triaged T2304: Cassandra storage: Reduce the size of the "secondary lookup tables" for contents as Normal priority.
Feb 28 2020
Feb 28 2020
caf51a044377cf62f73a02cd6c641d94b4e32c95
Feb 26 2020
Feb 26 2020
ardumont closed T2239: storage: kafka issue: Can't pickle <class 'cimpl.KafkaException'>: import of module 'cimpl' failed as Resolved.
I think so (latest storage version deployed).
Feb 25 2020
Feb 25 2020
olasd added a comment to T2239: storage: kafka issue: Can't pickle <class 'cimpl.KafkaException'>: import of module 'cimpl' failed.
@vlorentz I guess this could be closed?
Feb 18 2020
Feb 18 2020
vlorentz updated the task description for T2290: Implement origin_metadata endpoints in swh/storage/cassandra/.
vlorentz renamed T2291: Implement metadata_provider endpoints in swh/storage/cassandra/ from T2290: Implement metadata_provider endpoints in swh/storage/cassandra/ to Implement metadata_provider endpoints in swh/storage/cassandra/.
vlorentz triaged T2291: Implement metadata_provider endpoints in swh/storage/cassandra/ as Normal priority.
Feb 17 2020
Feb 17 2020
vlorentz added a comment to T2239: storage: kafka issue: Can't pickle <class 'cimpl.KafkaException'>: import of module 'cimpl' failed.
python-cassandra too has some un-unpicklable errors:
Feb 14 2020
Feb 14 2020
Feb 6 2020
Feb 6 2020
ardumont closed T2185: Make webapp0 use Cassandra as storage backend., a subtask of T1892: Cassandra as a storage backend, as Resolved.
Feb 5 2020
Feb 5 2020
vlorentz added a comment to T2239: storage: kafka issue: Can't pickle <class 'cimpl.KafkaException'>: import of module 'cimpl' failed.
Relatedly, errors raised by tenacity cannot be pickled because they contain a Lock: https://github.com/jd/tenacity/issues/147
ardumont changed the status of T2185: Make webapp0 use Cassandra as storage backend., a subtask of T1892: Cassandra as a storage backend, from Open to Work in Progress.
ardumont changed the status of T2185: Make webapp0 use Cassandra as storage backend. from Open to Work in Progress.
storage02.euwest.azure exposes a rpc server using cassandra as storage backend.
webapp0 has been updated to use it.
ardumont closed T2183: Switch webapp0 to use swh-search instead of postgresql search., a subtask of T2185: Make webapp0 use Cassandra as storage backend., as Resolved.
Feb 3 2020
Feb 3 2020
thx
and swh-storage debian package built [1] (passing the cassandra tests ;)
vlorentz closed T2186: Merge swh-storage-cassandra in swh-storage master, a subtask of T2185: Make webapp0 use Cassandra as storage backend., as Resolved.
Jan 30 2020
Jan 30 2020
I'm fine with switching to IRIs in the doc, just please expand what it means on first use (with a mention like "they are like URIs but"), as I don't think the acronym is that well-known yet, especially in the US.
Jan 29 2020
Jan 29 2020
ardumont renamed T2211: Go beyond git expressivity from Go beyound git expressivity to Go beyond git expressivity.
done indeed.
vlorentz closed T2184: Replay origins to ElasticSearch's "origin" index, a subtask of T2183: Switch webapp0 to use swh-search instead of postgresql search., as Resolved.
Jan 27 2020
Jan 27 2020
It's deployed btw.
Jan 24 2020
Jan 24 2020
ardumont closed T2167: Deploy swh-search, a subtask of T1910: Redesign origin search using a dedicated component (swh-search), as Resolved.
ardumont closed T2167: Deploy swh-search, a subtask of T2183: Switch webapp0 to use swh-search instead of postgresql search., as Resolved.
ardumont changed the status of T2167: Deploy swh-search, a subtask of T1910: Redesign origin search using a dedicated component (swh-search), from Open to Work in Progress.
ardumont changed the status of T2167: Deploy swh-search, a subtask of T2183: Switch webapp0 to use swh-search instead of postgresql search., from Open to Work in Progress.
Cool, looks like this is all ready within our code base:
ardumont renamed T1910: Redesign origin search using a dedicated component (swh-search) from Redesign origin search using a dedicated component to Redesign origin search using a dedicated component (swh-search).
Jan 23 2020
Jan 23 2020
olasd closed T546: Update debian loader to register origin_visit's state, a subtask of T534: Add completion information to softwareheritage.origin_visit table, as Resolved.
Considering the age of the bug report and how many underlying libraries have been upgraded, we can reopen this when we notice it again.
Is this still "a thing"?
Jan 22 2020
Jan 22 2020
vlorentz changed the status of T2214: Scale-out graph and database storage in production from Open to Work in Progress.
vlorentz added projects to T2211: Go beyond git expressivity: Data Model, Storage manager, Mercurial loader.
vlorentz added a project to T2214: Scale-out graph and database storage in production: Storage manager.