Page MenuHomeSoftware Heritage

Do not reference unloaded snapshots in origin_visit_update
ClosedPublic

Authored by olasd on May 29 2020, 3:02 PM.

Details

Summary

When a repository fails to load, and the visit is set to partial status, some
loaders might have been able to generate a snapshot and an identifier, but
without actually loading it to the archive. This generates partial visits
referencing an inexistant snapshot.

Make sure that we only reference a snapshot in origin_visit_update when it's
actually been loaded (and flushed) to storage; do so by turning the
get_snapshot_id method into a loaded_snapshot attribute that gets populated
explicitly after flushing the snapshot to storage.

Depends on D3199.
Depends on D3200.

Test Plan

new tox tests added for the new behavior

Event Timeline

Build is green

Patch application report for D3201 (id=11358)

Could not rebase; Attempt merge onto faa138226d...

Updating faa1382..3921520
Fast-forward
 swh/loader/core/loader.py            |  22 ++++----
 swh/loader/core/tests/test_loader.py | 100 +++++++++++++++++++++++++++++++----
 2 files changed, 99 insertions(+), 23 deletions(-)
Changes applied before test
commit 3921520a9d8b9534aa53ad4c3dc1b3eb9f14dc98
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 13:07:17 2020 +0200

    Do not reference unloaded snapshots in origin_visit_update
    
    When a repository fails to load, and the visit is set to `partial` status, some
    loaders might have been able to generate a snapshot and an identifier, but
    without actually loading it to the archive. This generates partial visits
    referencing an inexistant snapshot.
    
    Make sure that we only reference a snapshot in origin_visit_update when it's
    actually been loaded (and flushed) to storage; do so by turning the
    `get_snapshot_id` method into a `loaded_snapshot` attribute that gets populated
    explicitly after flushing the snapshot to storage.

commit 6fdb065fc767d656430904dcd3581cedee19f35c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 14:36:41 2020 +0200

    Refactor DVCS loader tests to actually run the DVCSLoader store_data method
    
    This will allow us to run integration tests on the behavior of that method. This
    also caught a tiny bug in the store_data method.

commit 326bc17070e637ceb1f35ecac37a95188d225341
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 14:55:00 2020 +0200

    Make the test_loader origin a global variable to be able to reuse it later

commit cf98b1cb9b990966513551556caba34feff0561b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 14:55:30 2020 +0200

    Drop useless fs mark on test that does not use any test fixtures

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/54/ for more details.

I wonder if there should be a test checking the snapshot isn't referenced if Storage.snapshot_add errors

swh/loader/core/loader.py
108

Shouldn't it be self.loaded_snapshot_id?

Rename variable to loaded_snapshot_id

Build has FAILED

Patch application report for D3201 (id=11362)

Rebasing onto 6fdb065fc7...

Current branch diff-target is up to date.
Changes applied before test
commit f8f04ee2a3f6dd0fec71b279c750aebca8a2a8ce
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 13:07:17 2020 +0200

    Do not reference unloaded snapshots in origin_visit_update
    
    When a repository fails to load, and the visit is set to `partial` status, some
    loaders might have been able to generate a snapshot and an identifier, but
    without actually loading it to the archive. This generates partial visits
    referencing an inexistant snapshot.
    
    Make sure that we only reference a snapshot in origin_visit_update when it's
    actually been loaded (and flushed) to storage; do so by turning the
    `get_snapshot_id` method into a `loaded_snapshot` attribute that gets populated
    explicitly after flushing the snapshot to storage.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/55/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/55/console

Drop the signature change for DVCSLoader.get_snapshot()

Build is green

Patch application report for D3201 (id=11363)

Rebasing onto 6fdb065fc7...

Current branch diff-target is up to date.
Changes applied before test
commit d6b22b241759ceafaf861cc4614cc04f49f1441b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 13:07:17 2020 +0200

    Do not reference unloaded snapshots in origin_visit_update
    
    When a repository fails to load, and the visit is set to `partial` status, some
    loaders might have been able to generate a snapshot and an identifier, but
    without actually loading it to the archive. This generates partial visits
    referencing an inexistant snapshot.
    
    Make sure that we only reference a snapshot in origin_visit_update when it's
    actually been loaded (and flushed) to storage; do so by turning the
    `get_snapshot_id` method into a `loaded_snapshot` attribute that gets populated
    explicitly after flushing the snapshot to storage.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/56/ for more details.

Add test of DVCSLoader behavior when snapshot_add fails in last resort

I wonder if there should be a test checking the snapshot isn't referenced if Storage.snapshot_add errors

That feels a bit overkill, but it wasn't too hard to add, so here we go.

Build is green

Patch application report for D3201 (id=11364)

Rebasing onto 6fdb065fc7...

Current branch diff-target is up to date.
Changes applied before test
commit dd2dc476670aa11ff48600741cfe903d0c21ff58
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri May 29 13:07:17 2020 +0200

    Do not reference unloaded snapshots in origin_visit_update
    
    When a repository fails to load, and the visit is set to `partial` status, some
    loaders might have been able to generate a snapshot and an identifier, but
    without actually loading it to the archive. This generates partial visits
    referencing an inexistant snapshot.
    
    Make sure that we only reference a snapshot in origin_visit_update when it's
    actually been loaded (and flushed) to storage; do so by turning the
    `get_snapshot_id` method into a `loaded_snapshot` attribute that gets populated
    explicitly after flushing the snapshot to storage.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/57/ for more details.

This revision is now accepted and ready to land.May 29 2020, 3:29 PM