Page MenuHomeSoftware Heritage

Prevent erroneous HashCollisions by using the same ctime for all rows.
ClosedPublic

Authored by vlorentz on Apr 8 2020, 10:32 AM.

Details

Summary

'swh_content_add' tries to avoid this issue with a DISTINCT clause
on the entire row; but it is useless because 'ctime' cells differ by
a few microseconds.
This commit ensures all ctime values are exactly the same, so they
are filtered out.

An alternative would be to change 'swh_content_add' to do:

select distinct on (sha1, sha1_git, sha256, blake2s256, length, status) sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content

instead of:

select distinct sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content

but this is more verbose and there's no good reason to call 'now()' for
every row.

Related to T2332

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

This revision is now accepted and ready to land.Apr 8 2020, 10:36 AM

Build is green

Patch application report for D2977 (id=10572)

Rebasing onto 82b41bac01...

Current branch diff-target is up to date.
Changes applied before test
commit 8e8577e8d49194c79e730bd7978253f98e4e0176
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Apr 8 10:30:28 2020 +0200

    Prevent erroneous HashCollisions by using the same ctime for all rows.
    
    'swh_content_add' tries to avoid this issue with a DISTINCT clause
    on the entire row; but it is useless because 'ctime' cells differ by
    a few microseconds.
    This commit ensures all ctime values are exactly the same, so they
    are filtered out.
    
    An alternative would be to change 'swh_content_add' to do:
select distinct on (sha1, sha1_git, sha256, blake2s256, length, status) sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content
```

instead of:

```
select distinct sha1, sha1_git, sha256, blake2s256, length, status, ctime from tmp_content
```

but this is more verbose and there's no good reason to call 'now()' for
every row.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/61/ for more details.