Page MenuHomeSoftware Heritage

HgLoaderFromDisk supports corrupted revisions
ClosedPublic

Authored by Alphare on Dec 8 2020, 4:41 PM.

Details

Summary

When a repository has corrupted revision, the revision and its descendants
are not loaded.

This commit only deals with missing filelogs and configures the exclusion
system.

Missing filelogs are recoverable errors that should be skipped or
saved as SkippedContent but being missing the SkippedContent cannot
be calculated. This point is left for future commits.

Diff Detail

Repository
rDLDHG Mercurial loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4688 (id=16637)

Rebasing onto 4bf91cff72...

Current branch diff-target is up to date.
Changes applied before test
commit b4eaece0f83d0d21e9dffe4dbbbe1bfd631ff7b5
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Mon Dec 7 14:44:33 2020 +0100

    HgLoaderFromDisk supports corrupted revisions
    
    When a repository as corrupted revision, the revision and its descendants
    are not loaded.
    
    This commit only deal with missing `filelogs` and setup the exclusion
    system.
    
    Missing `filelogs` are recoverable errors that should be skipped or
    saved as `SkippedContent` but being missing the `SkippedContent` cannot
    be calculated. This point is left to futur commits.

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/136/ for more details.

Hi,

Sorry for the lack of response (assuming you didn't get any from a different channel).

I can't really judge the content of the diff as I don't know if we want to allow holes in the revision history like this; you'll have to ping us when everyone is back on vacation (~ Jan 4th)

swh/loader/mercurial/from_disk.py
279–287

replace this line with assert self._repo (mypy understands assertions)

287

self.log.warning("Corrupted revision: %s", e)

480–485

SkippedContent could be used but need actual content to calculate its id.

You might be able to construct a SkippedContent with no id, but I'm not sure

@Alphare will you continue on this work as well?

In any case, this needs a rebase.

I've added technical notes to what needs to change after the rebase so everything
continues working as before.

swh/loader/mercurial/tests/test_from_disk.py
250

after the rebase, swh_config fixture needs to be changed to the fixture swh_storage and ...

257

... swh_storage needs to be passed along to the constructor loader now...

259

... and the visit_date needs to be a datetime.datetime instance.
(there is possibly a global variable VISIT_DATE in the test module for that purpose).

@Alphare will you continue on this work as well?

Yes, I will be taking over the changes left by Antoine.

In any case, this needs a rebase.

Will do.

I've added technical notes to what needs to change after the rebase so everything
continues working as before.

Thanks!

Alphare added a reviewer: acezar.

Rebase and address some comments

Build has FAILED

Patch application report for D4688 (id=18453)

Rebasing onto 5b8da68a92...

First, rewinding head to replay your work on top of it...
Applying: HgLoaderFromDisk supports corrupted revisions
Changes applied before test
commit 166a998025c775384860720e6a7dc23de12ed3e7
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Fri Feb 26 15:16:24 2021 +0100

    HgLoaderFromDisk supports corrupted revisions
    
    Summary:
    When a repository has corrupted revision, the revision and its descendants
    are not loaded.
    
    This commit only deals with missing `filelogs` and configures the exclusion
    system.
    
    Missing `filelogs` are recoverable errors that should be skipped or
    saved as `SkippedContent` but being missing the `SkippedContent` cannot
    be calculated. This point is left for future commits.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont, Alphare, vlorentz
    
    Differential Revision: https://forge.softwareheritage.org/D4688

Link to build: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/177/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/177/console

Build has failed

oops

...
16:02:37  flake8 run-test: commands[0] | /var/lib/jenkins/workspace/DLDHG/tests-on-diff/.tox/flake8/bin/python -m flake8
16:02:38  ./swh/loader/mercurial/from_disk.py:11:1: F401 'typing.Any' imported but unused
16:02:38  ./swh/loader/mercurial/from_disk.py:271:1: W293 blank line contains whitespace
16:02:38  ./swh/loader/mercurial/tests/test_from_disk.py:282:1: W293 blank line contains whitespace
16:02:38  ./swh/loader/mercurial/tests/test_from_disk.py:283:1: E302 expected 2 blank lines, found 1
16:02:38  ERROR: InvocationError for command /var/lib/jenkins/workspace/DLDHG/tests-on-diff/.tox/flake8/bin/python -m flake8 (exited with code 1)
16:02:38  ___________________________________ summary ____________________________________
16:02:38  ERROR:   flake8: commands failed
...

Prior to update the diff, you might want to either use tox directly, which does among other thing trigger that check ^

or regularly check yourself with:

$ flake8 swh/
$ mypy swh/

Note that also pre-commit should actually have caught this as well ¯\_(ツ)_/¯
(provided it's installed and not bypassed ;)

Cheers,

Prior to update the diff, you might want to either use tox directly, which does among other thing trigger that check ^

I should try tox, I'm only running pytest.

Note that also pre-commit should actually have caught this as well ¯\_(ツ)_/¯
(provided it's installed and not bypassed ;)

Cheers,

Yeah, it's the second time I had this happen because I rectified changes during the rebase, not triggering the pre-commit hook. I got sniped into a meeting before being able to fix this. ;)

Yeah, it's the second time I had this happen because I rectified changes during the rebase, not triggering the pre-commit hook. I got sniped into a meeting before being able to fix this. ;)

oh yeah, there is that, pre-commit is not run during rebase.
(I got hit by that a few times myself already ;)

I should try tox, I'm only running pytest.

well pytest is faster as it tests only the code.
But something like, using pytest during dev and then running tox at the end of it all, when you are ready to update the diff
That definitely will help you catch those paper cuts.

In any case, don't worry, it's fine, jenkins is here to let you know :)

Build is green

Patch application report for D4688 (id=18455)

Rebasing onto 5b8da68a92...

First, rewinding head to replay your work on top of it...
Applying: HgLoaderFromDisk supports corrupted revisions
Changes applied before test
commit 8acbd21db6f410b95c6b41a93d80294b63d50eb5
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Fri Feb 26 15:16:24 2021 +0100

    HgLoaderFromDisk supports corrupted revisions
    
    Summary:
    When a repository has corrupted revision, the revision and its descendants
    are not loaded.
    
    This commit only deals with missing `filelogs` and configures the exclusion
    system.
    
    Missing `filelogs` are recoverable errors that should be skipped or
    saved as `SkippedContent` but being missing the `SkippedContent` cannot
    be calculated. This point is left for future commits.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont, Alphare, vlorentz
    
    Differential Revision: https://forge.softwareheritage.org/D4688

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/178/ for more details.

swh/loader/mercurial/from_disk.py
287

still relevant ^

swh/loader/mercurial/tests/test_from_disk.py
297

What's the visit status for such case?

It should be partial i think since we are not able to load everything.

Please add the missing assertions on those visit (cf. [1] as example)

[1] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/swh/loader/git/tests/test_loader.py$46-52

Alphare added inline comments.
swh/loader/mercurial/from_disk.py
480–485

IMO this is good enough for now to easily log errors and figure out the next steps for improving the robustness/efficiency of the loader. What do you think?

swh/loader/mercurial/tests/test_from_disk.py
297

Should the load status and the visit status be different? I'm not sure I understand the semantics right.

swh/loader/mercurial/tests/test_from_disk.py
297

Yes, they are different. They are not the same and do not share most of their values.

The load status is used by the scheduler, it's only eventful or uneventful (i'm not even sure failed is possible). It's the actual (celery) task status.

It's a remnant from our current scheduler which we are currently reworking.

visit status is the most important one now as it has been more adequately refined in the storage (archive).
If we should choose only one to compare values, that'd be the one.

swh/loader/mercurial/from_disk.py
480–485

yes, plus there is already a todo about it ;)

Fix and test visit status, improve error message

Build is green

Patch application report for D4688 (id=18464)

Rebasing onto c74e5791b9...

First, rewinding head to replay your work on top of it...
Applying: HgLoaderFromDisk supports corrupted revisions
Changes applied before test
commit 62362eeb80b73bd5375b96d0b5a560745426eef6
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Fri Feb 26 15:16:24 2021 +0100

    HgLoaderFromDisk supports corrupted revisions
    
    Summary:
    When a repository has corrupted revision, the revision and its descendants
    are not loaded.
    
    This commit only deals with missing `filelogs` and configures the exclusion
    system.
    
    Missing `filelogs` are recoverable errors that should be skipped or
    saved as `SkippedContent` but being missing the `SkippedContent` cannot
    be calculated. This point is left for future commits.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont, Alphare, vlorentz
    
    Differential Revision: https://forge.softwareheritage.org/D4688

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/180/ for more details.

This revision is now accepted and ready to land.Feb 26 2021, 7:10 PM
This revision was landed with ongoing or failed builds.Feb 26 2021, 7:12 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D4688 (id=18466)

Rebasing onto c74e5791b9...

Current branch diff-target is up to date.
Changes applied before test
commit 1b4780030eb2898726631681ea8d6e2f9c600d6a
Author: Antoine Cezar <antoine.cezar@octobus.net>
Date:   Fri Feb 26 15:16:24 2021 +0100

    HgLoaderFromDisk supports corrupted revisions
    
    Summary:
    When a repository has corrupted revision, the revision and its descendants
    are not loaded.
    
    This commit only deals with missing `filelogs` and configures the exclusion
    system.
    
    Missing `filelogs` are recoverable errors that should be skipped or
    saved as `SkippedContent` but being missing the `SkippedContent` cannot
    be calculated. This point is left for future commits.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont, Alphare, vlorentz
    
    Differential Revision: https://forge.softwareheritage.org/D4688

See https://jenkins.softwareheritage.org/job/DLDHG/job/tests-on-diff/181/ for more details.