Page MenuHomeSoftware Heritage

package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact
ClosedPublic

Authored by ardumont on Apr 7 2020, 5:33 PM.

Details

Summary

Prior to this commit, we did not flush consistently data to storage.

Now for each revision, we flush all data (content, skipped-content, directory,
revision...) to the storage (<- new). If any issue arose during the artifact
loading, the data for the revision is cleared (<- new) and the branch is
skipped (<- as before).

Related to T2352
Related to D2966

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D2973 (id=10567)

Rebasing onto 5921567408...

Current branch diff-target is up to date.
Changes applied before test
commit 792541dd1d8d9995099c5a07964269b55096f1c0
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This needs to flush as soon as a revision is loaded though. Otherwise, we could
    be missing data from a previous snapshot branch (in effect, without the flush
    operations, the tests fail).
    
    This is already covered by the current tests.
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/11/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/11/console

vlorentz added inline comments.
swh/loader/package/loader.py
332

Shouldn't it check the method exists?

445–446

Are you sure we want to flush on every revision?

swh/loader/package/loader.py
332

I hesitated in adding the following in package/loader:

def flush(self) -> None:  # commit?
    """Commit pending writes to the storage

    """
    if hasattr(self.storage, 'flush'):
        self.storage.flush()

def clear_buffers(self) -> None:  # rollback?
    """Clear pending writes from the storage

    """
    self.storage.clear_buffers()
332

No, for this one, it exists in all storages.

445–446

See the main comment ;)

swh/loader/package/loader.py
332

See D2966 btw ;)

ardumont added inline comments.
swh/loader/package/loader.py
332–333

@olasd, from irc discussion, wondering if this would not be better:

except HashCollision as e:
    self.storage.clear_buffers()
    load_exceptions.append(e)
    sentry_sdk.capture_exception(e)
    logger.exception('Failed loading branch %s for %s',
                                  branch_name, self.url)
    continue
except Exception as e:
    load_exceptions.append(e)
    sentry_sdk.capture_exception(e)
    logger.exception('Failed loading branch %s for %s',
                                  branch_name, self.url)
    continue
445

That's also why i proposed on irc to implement flush in all storages (backend and proxies) so we don't need that if check prior to call flush.

swh/loader/package/loader.py
445
ardumont edited the summary of this revision. (Show Details)

Clear buffers when Hash Collision is detected

  • No longer flush at each revision
  • Add a test around the hash collision case

Build has FAILED

Patch application report for D2973 (id=10602)

Rebasing onto cf496e1844...

First, rewinding head to replay your work on top of it...
Applying: package.loader: Clear proxy buffer state when failing to load revision
Using index info to reconstruct a base tree...
M	swh/loader/package/loader.py
M	swh/loader/package/nixguix/tests/test_nixguix.py
Falling back to patching base and 3-way merge...
Auto-merging swh/loader/package/nixguix/tests/test_nixguix.py
CONFLICT (content): Merge conflict in swh/loader/package/nixguix/tests/test_nixguix.py
Auto-merging swh/loader/package/loader.py
CONFLICT (content): Merge conflict in swh/loader/package/loader.py
Patch failed at 0001 package.loader: Clear proxy buffer state when failing to load revision

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto cf496e1844...

Already up to date.
Changes applied before test
commit 22154aa459c8305eb53d2231e90341531eb7ecce
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/12/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/12/console

Need to rebase on black stuff.

Rebase and blackify
tests are still expected to fail.

Build has FAILED

Patch application report for D2973 (id=10603)

Rebasing onto cf496e1844...

Current branch diff-target is up to date.
Changes applied before test
commit ea25b20d879dd3dbabed42eb5ee267062da94d83
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/13/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/13/console

Rebase and actually blackify the diff

Build has FAILED

Patch application report for D2973 (id=10612)

Rebasing onto 89828fa6c3...

Current branch diff-target is up to date.
Changes applied before test
commit a8f725807ec75ee9468fa0acc0e27837b2edfcdd
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/14/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/14/console

swh/loader/package/loader.py
332

on all objects as here or on 'contents' alone?

Build is green

Patch application report for D2973 (id=10612)

Rebasing onto 89828fa6c3...

Current branch diff-target is up to date.
Changes applied before test
commit a8f725807ec75ee9468fa0acc0e27837b2edfcdd
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/15/ for more details.

Are you sure we want to flush on every revision?

See the main comments

(which now got amended so kinda lost)... The point in the original comment,
"well, we have to".

@vlorentz I removed the flush at every revision in the end.

I'm not totally sure it's wise regarding this clear_buffers policy though.

We actually have no idea what's in the buffer at a given time unless we flush
at each revision (as in trigger the writes to storage). If we don't flush after
one revision (like we do at the moment), the buffer actually hold potentially
many revisions information (contents, directory). So if we clear the buffers on
exception, we actually don't know what we cleared. In my mind, that's not good.

If we flush at each revision, then we only clear (if that happens) the revision
information we failed to load. That's better, well that's the only consistent
way, hence the "we have to" from the start.

But if we flush at each revision, what's the point of the proxy buffer storage
in the first place? To be fair, there might be revisions from package loader
which are heavy in terms of data so that might be a good reason ¯\_(ツ)_/¯.

Thus why i'm doing this clear_buffers explicitely only on hash collisions
now. Note that i also ask whether we only clear_buffers on contents... (the
clear_buffers allows to define what types we want to flush).

In the end, I'm kinda not convinced by this diff... I opened it because that's
what we said we'd do (even if the implementation changed slightly, because of
what i just said).

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

What do you think?

Thanks in advance for your input.

ardumont edited the summary of this revision. (Show Details)

Rebase on latest master

Build is green

Patch application report for D2973 (id=10657)

Rebasing onto 63d9baa397...

Current branch diff-target is up to date.
Changes applied before test
commit fbb25a91c6a4582a841584221c9548246fb6e897
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/18/ for more details.

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

I'm currently implementing this retry proxy storage side (well trying to properly test it more like).
That should trigger less reasoning mayhem here...

Are you sure we want to flush on every revision?

See the main comments

(which now got amended so kinda lost)... The point in the original comment,
"well, we have to".

@vlorentz I removed the flush at every revision in the end.

I'm not totally sure it's wise regarding this clear_buffers policy though.

It's critical that the buffers are flushed before a new revision is added to the snapshot. Flushing ensures that all revisions referenced in the (partial) snapshot have all their dependent objects in storage. As we try hard to send the partial snapshot to storage when the load completes or fails with an exception, we really need to make sure that these revisions and all the underlying contents and directories have made it safely, before we even think of adding a reference to them in a snapshot that has a chance of landing in the archive.

Clearing the buffers while having let the automatic flushing work, means that we're very likely to reference revisions that haven't been properly loaded in snapshots.

We actually have no idea what's in the buffer at a given time unless we flush
at each revision (as in trigger the writes to storage). If we don't flush after
one revision (like we do at the moment), the buffer actually hold potentially
many revisions information (contents, directory). So if we clear the buffers on
exception, we actually don't know what we cleared. In my mind, that's not good.

If we flush at each revision, then we only clear (if that happens) the revision
information we failed to load. That's better, well that's the only consistent
way, hence the "we have to" from the start.

Exactly.

But if we flush at each revision, what's the point of the proxy buffer storage
in the first place? To be fair, there might be revisions from package loader
which are heavy in terms of data so that might be a good reason ¯\_(ツ)_/¯.

The original point of the buffer really was splitting the addition of contents and directories into smaller chunks, to avoid overflowing the memory of the backend with large batches of objects (e.g. all the contents of a firefox source tarball at once, to take a random example); I think you're being misled by the name (and grouper or splitter or chunker might have been a better choice). Grouping the addition of objects to reduce the RPC overhead and make larger batches is only an incidental effect.

Thus why i'm doing this clear_buffers explicitely only on hash collisions
now. Note that i also ask whether we only clear_buffers on contents... (the
clear_buffers allows to define what types we want to flush).

In the end, I'm kinda not convinced by this diff... I opened it because that's
what we said we'd do (even if the implementation changed slightly, because of
what i just said).

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

What do you think?

I think that loaders and replayers are different things, with different integrity tradeoffs, and that we don't want their behavior to be the same.

[ snip HashCollision handling discussion, moved to D3012 ]

If we load a revision from a nixguix package URL, without making sure the underlying contents have been properly loaded, then the incremental nature of the loader will mean that there's a good chance we'll never try to load these contents again. If they disappear upstream, we have a hole in the graph forever.

I don't see any problem in flushing after parsing the objects for every revision. It may be slow (and even then, we've never actually profiled it). But this is the only way to keep the archive consistency, while maximizing the amount of data loaded and reference in the partial snapshot.

Revert to first implementation:

  • flush at every revision so we have consistent data stored per artifact
  • if exception during loading a revision is triggered, clear proxy internal state

Build is green

Patch application report for D2973 (id=10719)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.
Changes applied before test
commit c3efa0aab09a303e96035a3940af50f54cda6622
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/20/ for more details.

Build is green

Patch application report for D2973 (id=10720)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.
Changes applied before test
commit 44d2c904de2f862e8b0daa507981690fb4e6c0c8
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/21/ for more details.

Build is green

Patch application report for D2973 (id=10721)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.
Changes applied before test
commit 599b6fbadb9921106d7b95461b0bad865f341f69
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/22/ for more details.

ardumont retitled this revision from package.loader: Clear proxy buffer state when failing to load revision to package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact.Apr 14 2020, 2:05 PM
ardumont edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Apr 14 2020, 2:07 PM