Page MenuHomeSoftware Heritage

loader.git: Mark visit status as not_found when relevant
ClosedPublic

Authored by ardumont on Feb 8 2021, 5:51 PM.

Details

Summary

When the initial communication with the git server is failing initially (e.g repository
is not found), this marks the visit status as not_found.

When the initial communication is ok but a failure occurs during the fetch step (e.g
pack file too big, ...), the visit status is marked as failed.

Related to T3030

Test Plan

tox
(failing until swh.loader.core > 0.16)

Diff Detail

Repository
rDLDG Git loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D5041 (id=17981)

Could not rebase; Attempt merge onto 220678c753...

Updating 220678c..37e87d9
Fast-forward
 requirements-swh.txt                |  2 +-
 swh/loader/svn/loader.py            | 15 +++++++++++----
 swh/loader/svn/tests/test_loader.py | 16 ++++++++++++++--
 3 files changed, 26 insertions(+), 7 deletions(-)
Changes applied before test
commit 37e87d907e04543903ad9ba1e7862d42da70a829
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:14:26 2021 +0100

    svn.loader: Mark visit status as not_found when relevant
    
    Related to T3030

commit aa71a198881188d7e42ea82d9896fcb394e6d1d4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:36:57 2021 +0100

    svn.loader: Mark visit status as failed
    
    Related to T3030

Link to build: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/102/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/102/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 8 2021, 5:52 PM
Harbormaster failed remote builds in B19102: Diff 17981!

Build is green

Patch application report for D5041 (id=17981)

Could not rebase; Attempt merge onto 220678c753...

Updating 220678c..37e87d9
Fast-forward
 requirements-swh.txt                |  2 +-
 swh/loader/svn/loader.py            | 15 +++++++++++----
 swh/loader/svn/tests/test_loader.py | 16 ++++++++++++++--
 3 files changed, 26 insertions(+), 7 deletions(-)
Changes applied before test
commit 37e87d907e04543903ad9ba1e7862d42da70a829
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:14:26 2021 +0100

    svn.loader: Mark visit status as not_found when relevant
    
    Related to T3030

commit aa71a198881188d7e42ea82d9896fcb394e6d1d4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:36:57 2021 +0100

    svn.loader: Mark visit status as failed
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/104/ for more details.

Build is green

Patch application report for D5041 (id=18054)

Could not rebase; Attempt merge onto 220678c753...

Updating 220678c..3fd2dc5
Fast-forward
 requirements-swh.txt                |  2 +-
 swh/loader/svn/loader.py            | 15 +++++++++++----
 swh/loader/svn/tests/test_loader.py | 16 ++++++++++++++--
 3 files changed, 26 insertions(+), 7 deletions(-)
Changes applied before test
commit 3fd2dc5203ed9cfa4378f1c03ca089528c66bc03
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:14:26 2021 +0100

    svn.loader: Mark visit status as not_found when relevant
    
    When failing to communicate with the svn server, a visit status ends up in "not_found"
    status.
    
    Related to T3030

commit ddc18d08514725fe7fb59857c8181e2cdaa47e3e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 17:36:57 2021 +0100

    svn.loader: Mark visit status as failed
    
    With the latest swh.loader.core 0.17, a partial without without snapshot is failed.
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDSVN/job/tests-on-diff/106/ for more details.

That seems too generic. SubversionException has some attributes that can be used to check what caused the error.

ardumont retitled this revision from svn.loader: Mark visit status as not_found when relevant to loader.git: Mark visit status as not_found when relevant.Feb 11 2021, 2:18 PM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D5041 (id=18062)

Could not rebase; Attempt merge onto 824880e7d7...

Updating 824880e..195cf94
Fast-forward
 requirements-swh.txt                   |  2 +-
 swh/loader/git/loader.py               | 10 ++++++---
 swh/loader/git/tests/test_from_disk.py | 41 +++++++++++++++++++++++++++++++++-
 swh/loader/git/tests/test_loader.py    | 40 +++++++++++++++++++++++++++++----
 4 files changed, 84 insertions(+), 9 deletions(-)
Changes applied before test
commit 195cf941a03b606ab182c5a7d26f327d402acc8e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Feb 9 11:53:42 2021 +0100

    loader.git: Mark visit status as not_found when relevant
    
    When failing to communicate with the git server, the visits ends up in status
    "not_found".
    
    Related to T3030

commit 11074801b0bb928c62442d4dd74d514cbbdbd0bf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Feb 8 19:07:04 2021 +0100

    loader.git: Explicit the failure test cases
    
    With the new loader.core 0.17, failed or partial status changed slightly.
    This adds the necessary tests to explicit those.
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/78/ for more details.

It's still too generic. This would trigger a NotFound when a packfile is too large.

It's still too generic. This would trigger a NotFound when a packfile is too large.

A priori, no, see [1] which is an IOError.
(this won't be caught by this try except)

>>> isinstance(IOError, Exception)
False

I'll check sentry for what kind of errors we have

This [2] sounds like a good candidate for another early failure caught here as failure though (not a not_found).

This [3] sounds like a good candidate for a not found case as well.

[1] https://sentry.softwareheritage.org/share/issue/12a3987c9a3049c881a0ec536e2fc373/

[2] https://sentry.softwareheritage.org/share/issue/4049ae2a15104df789361e7543dc52ca/

[3] https://sentry.softwareheritage.org/share/issue/42e11d3abdb94632b9eb95e76cf41f4e/

ardumont retitled this revision from loader.git: Mark visit status as not_found when relevant to loader.git: Mark visit status as not_found or failed when relevant.Feb 11 2021, 3:14 PM
ardumont edited the summary of this revision. (Show Details)
ardumont retitled this revision from loader.git: Mark visit status as not_found or failed when relevant to loader.git: Mark visit status as not_found when relevant.
ardumont edited the summary of this revision. (Show Details)

Adapt according to discussion

Build is green

Patch application report for D5041 (id=18068)

Rebasing onto 11074801b0...

Current branch diff-target is up to date.
Changes applied before test
commit 8f65bf3af9bcdc3361f22201bdf8a3987c902637
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Feb 9 11:53:42 2021 +0100

    loader.git: Mark visit status as not_found or failed when relevant
    
    When the initial communication with the git server is failing initially (e.g repository
    is not found), this marks the visit status as not_found.
    
    When the initial communication is ok but a failure occurs during the fetch step (e.g
    pack file too big, ...), the visit status is marked as failed.
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/79/ for more details.

swh/loader/git/tests/test_loader.py
23

injecting fixture in unittest case does not work so i'm using this contraption ^ to simulate it.

50

(pytest.mark.parametrize does not work with unittest class.)

swh/loader/git/tests/test_loader.py
50

Thanks but I don't immediately see what's the benefit of the subtest.

It's still too generic. This would trigger a NotFound when a packfile is too large.

A priori, no, see [1] which is an IOError.
(this won't be caught by this try except)

>>> isinstance(IOError, Exception)
False

oh, indeed

swh/loader/git/loader.py
243–253

could you include the actual errors, so the code is understandable without access to sentry?

swh/loader/git/tests/test_loader.py
50

same as parametrized tests: it runs and reports them independently

swh/loader/git/loader.py
243–253

i did (in parenthesis just after the except).
I guess that means i can drop the sentry urls.

Adapt according to suggestions

Build is green

Patch application report for D5041 (id=18070)

Rebasing onto 11074801b0...

Current branch diff-target is up to date.
Changes applied before test
commit abe863ac1761575fe0fa814e7dd3a0f1743297e9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Feb 9 11:53:42 2021 +0100

    loader.git: Mark visit status as not_found or failed when relevant
    
    When the initial communication with the git server is failing initially (e.g repository
    is not found), this marks the visit status as not_found.
    
    When the initial communication is ok but a failure occurs during the fetch step (e.g
    pack file too big, ...), the visit status is marked as failed.
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/80/ for more details.

This revision is now accepted and ready to land.Feb 11 2021, 5:22 PM

Looking a bit more into sentry, we reworked the exception handling so now:

  • the following exceptions makes the visit status being marked as "not_found"
    • GitProtocolError("Repository unavailable"), # e.g DMCA takedown
    • GitProtocolError("Repository not found"),
    • GitProtocolError("unexpected http resp 401"),
    • NotGitRepository(<not-important-here>),
  • other exceptions (including GitProtocolError) makes the visit being marked as failed.

This will close a lot of sentry errors (\m/)

Build is green

Patch application report for D5041 (id=18074)

Rebasing onto 11074801b0...

Current branch diff-target is up to date.
Changes applied before test
commit c2cd5fed02a70be723323b60ead92081e08a622c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Feb 9 11:53:42 2021 +0100

    loader.git: Mark visit status as not_found or failed when relevant
    
    When the initial communication with the git server is failing initially (e.g repository
    is not found), this marks the visit status as not_found.
    
    When the initial communication is ok but a failure occurs during the fetch step (e.g
    pack file too big, ...), the visit status is marked as failed.
    
    Related to T3030

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/81/ for more details.