Page MenuHomeSoftware Heritage

Prevent erroneous HashCollisions by using the same ctime for all rows.
ClosedPublic

Authored by vlorentz on Apr 8 2020, 10:32 AM.

Details

Summary

'swh_content_add' tries to avoid this issue with a DISTINCT clause
on the entire row; but it is useless because 'ctime' cells differ by
a few microseconds.
This commit ensures all ctime values are exactly the same, so they
are filtered out.

An alternative would be to change 'swh_content_add' to do:

select distinct on (sha1, sha1_git, sha256, blake2s256, length, status) sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content

instead of:

select distinct sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content

but this is more verbose and there's no good reason to call 'now()' for
every row.

Related to T2332

Diff Detail

Repository
rDSTO Storage manager
Branch
ctime-hashcollision
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11659
Build 17686: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 17685: arc lint + arc unit

Event Timeline

This revision is now accepted and ready to land.Apr 8 2020, 10:36 AM

Build is green

Patch application report for D2977 (id=10572)

Rebasing onto 82b41bac01...

Current branch diff-target is up to date.
Changes applied before test
commit 8e8577e8d49194c79e730bd7978253f98e4e0176
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Apr 8 10:30:28 2020 +0200

    Prevent erroneous HashCollisions by using the same ctime for all rows.
    
    'swh_content_add' tries to avoid this issue with a DISTINCT clause
    on the entire row; but it is useless because 'ctime' cells differ by
    a few microseconds.
    This commit ensures all ctime values are exactly the same, so they
    are filtered out.
    
    An alternative would be to change 'swh_content_add' to do:
select distinct on (sha1, sha1_git, sha256, blake2s256, length, status) sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content
```

instead of:

```
select distinct sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content
```

but this is more verbose and there's no good reason to call 'now()' for
every row.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/61/ for more details.