⚙ D2973 package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact

ardumont created this revision.Apr 7 2020, 5:33 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptApr 7 2020, 5:33 PM

Build has FAILED

Patch application report for D2973 (id=10567)

Rebasing onto 5921567408...

Current branch diff-target is up to date.

Changes applied before test

commit 792541dd1d8d9995099c5a07964269b55096f1c0
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This needs to flush as soon as a revision is loaded though. Otherwise, we could
    be missing data from a previous snapshot branch (in effect, without the flush
    operations, the tests fail).
    
    This is already covered by the current tests.
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/11/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/11/console

Harbormaster failed remote builds in B11654: Diff 10567!Apr 7 2020, 5:34 PM

vlorentz added a subscriber: vlorentz.Apr 7 2020, 5:35 PM

vlorentz added inline comments.

swh/loader/package/loader.py
332	Shouldn't it check the method exists?
445–446	Are you sure we want to flush on every revision?

ardumont added inline comments.Apr 7 2020, 5:36 PM

swh/loader/package/loader.py

332

I hesitated in adding the following in package/loader:

def flush(self) -> None:  # commit?
    """Commit pending writes to the storage

    """
    if hasattr(self.storage, 'flush'):
        self.storage.flush()

def clear_buffers(self) -> None:  # rollback?
    """Clear pending writes from the storage

    """
    self.storage.clear_buffers()

332

No, for this one, it exists in all storages.

445–446

See the main comment ;)

ardumont added inline comments.Apr 7 2020, 5:38 PM

swh/loader/package/loader.py
332	See D2966 btw ;)

ardumont edited the summary of this revision. (Show Details)Apr 7 2020, 6:22 PM

ardumont added a subscriber: olasd.Apr 8 2020, 11:08 AM

ardumont added inline comments.

swh/loader/package/loader.py

332–333

@olasd, from irc discussion, wondering if this would not be better:

except HashCollision as e:
    self.storage.clear_buffers()
    load_exceptions.append(e)
    sentry_sdk.capture_exception(e)
    logger.exception('Failed loading branch %s for %s',
                                  branch_name, self.url)
    continue
except Exception as e:
    load_exceptions.append(e)
    sentry_sdk.capture_exception(e)
    logger.exception('Failed loading branch %s for %s',
                                  branch_name, self.url)
    continue

445

That's also why i proposed on irc to implement flush in all storages (backend and proxies) so we don't need that if check prior to call flush.

ardumont added inline comments.Apr 8 2020, 4:18 PM

swh/loader/package/loader.py
445	D2980

Clear buffers when Hash Collision is detected

No longer flush at each revision
Add a test around the hash collision case

Build has FAILED

Patch application report for D2973 (id=10602)

Rebasing onto cf496e1844...

First, rewinding head to replay your work on top of it...
Applying: package.loader: Clear proxy buffer state when failing to load revision
Using index info to reconstruct a base tree...
M	swh/loader/package/loader.py
M	swh/loader/package/nixguix/tests/test_nixguix.py
Falling back to patching base and 3-way merge...
Auto-merging swh/loader/package/nixguix/tests/test_nixguix.py
CONFLICT (content): Merge conflict in swh/loader/package/nixguix/tests/test_nixguix.py
Auto-merging swh/loader/package/loader.py
CONFLICT (content): Merge conflict in swh/loader/package/loader.py
Patch failed at 0001 package.loader: Clear proxy buffer state when failing to load revision

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto cf496e1844...

Already up to date.

Changes applied before test

commit 22154aa459c8305eb53d2231e90341531eb7ecce
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/12/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/12/console

Harbormaster failed remote builds in B11706: Diff 10602!Apr 8 2020, 5:42 PM

ardumont edited the summary of this revision. (Show Details)Apr 8 2020, 5:42 PM

Need to rebase on black stuff.

Rebase and blackify
tests are still expected to fail.

Build has FAILED

Patch application report for D2973 (id=10603)

Rebasing onto cf496e1844...

Current branch diff-target is up to date.

Changes applied before test

commit ea25b20d879dd3dbabed42eb5ee267062da94d83
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/13/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/13/console

Harbormaster failed remote builds in B11707: Diff 10603!Apr 8 2020, 5:48 PM

Rebase and actually blackify the diff

Build has FAILED

Patch application report for D2973 (id=10612)

Rebasing onto 89828fa6c3...

Current branch diff-target is up to date.

Changes applied before test

commit a8f725807ec75ee9468fa0acc0e27837b2edfcdd
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/14/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/14/console

Harbormaster failed remote builds in B11734: Diff 10612!Apr 9 2020, 9:24 AM

ardumont edited the summary of this revision. (Show Details)Apr 9 2020, 9:24 AM

ardumont added inline comments.Apr 9 2020, 10:40 AM

swh/loader/package/loader.py
332	on all objects as here or on 'contents' alone?

Build is green

Patch application report for D2973 (id=10612)

Rebasing onto 89828fa6c3...

Current branch diff-target is up to date.

Changes applied before test

commit a8f725807ec75ee9468fa0acc0e27837b2edfcdd
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/15/ for more details.

Harbormaster completed remote builds in B11734: Diff 10612.Apr 9 2020, 1:17 PM

ardumont edited the summary of this revision. (Show Details)Apr 9 2020, 2:01 PM

Are you sure we want to flush on every revision?

See the main comments

(which now got amended so kinda lost)... The point in the original comment,
"well, we have to".

@vlorentz I removed the flush at every revision in the end.

I'm not totally sure it's wise regarding this clear_buffers policy though.

We actually have no idea what's in the buffer at a given time unless we flush
at each revision (as in trigger the writes to storage). If we don't flush after
one revision (like we do at the moment), the buffer actually hold potentially
many revisions information (contents, directory). So if we clear the buffers on
exception, we actually don't know what we cleared. In my mind, that's not good.

If we flush at each revision, then we only clear (if that happens) the revision
information we failed to load. That's better, well that's the only consistent
way, hence the "we have to" from the start.

But if we flush at each revision, what's the point of the proxy buffer storage
in the first place? To be fair, there might be revisions from package loader
which are heavy in terms of data so that might be a good reason ¯\_(ツ)_/¯.

Thus why i'm doing this clear_buffers explicitely only on hash collisions
now. Note that i also ask whether we only clear_buffers on contents... (the
clear_buffers allows to define what types we want to flush).

In the end, I'm kinda not convinced by this diff... I opened it because that's
what we said we'd do (even if the implementation changed slightly, because of
what i just said).

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

What do you think?

Thanks in advance for your input.

Rebase on latest master

Build is green

Patch application report for D2973 (id=10657)

Rebasing onto 63d9baa397...

Current branch diff-target is up to date.

Changes applied before test

commit fbb25a91c6a4582a841584221c9548246fb6e897
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    due to hash collisions
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/18/ for more details.

Harbormaster completed remote builds in B11776: Diff 10657.Apr 10 2020, 9:32 AM

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

I'm currently implementing this retry proxy storage side (well trying to properly test it more like).
That should trigger less reasoning mayhem here...

ardumont mentioned this in D3012: retry: Make content_add endpoints maximize content writes to storage.Apr 10 2020, 3:50 PM

ardumont mentioned this in D3014: package.loader: Flush and clear regularly internal proxy storage states.Apr 11 2020, 12:14 PM

ardumont added a task: T2352: nixguix: Fails to finish as it's stuck in a loop up to memory error.Apr 14 2020, 10:25 AM

In D2973#72387, @ardumont wrote:

Are you sure we want to flush on every revision?

See the main comments

(which now got amended so kinda lost)... The point in the original comment,
"well, we have to".

@vlorentz I removed the flush at every revision in the end.

I'm not totally sure it's wise regarding this clear_buffers policy though.

It's critical that the buffers are flushed before a new revision is added to the snapshot. Flushing ensures that all revisions referenced in the (partial) snapshot have all their dependent objects in storage. As we try hard to send the partial snapshot to storage when the load completes or fails with an exception, we really need to make sure that these revisions and all the underlying contents and directories have made it safely, before we even think of adding a reference to them in a snapshot that has a chance of landing in the archive.

Clearing the buffers while having let the automatic flushing work, means that we're very likely to reference revisions that haven't been properly loaded in snapshots.

We actually have no idea what's in the buffer at a given time unless we flush
at each revision (as in trigger the writes to storage). If we don't flush after
one revision (like we do at the moment), the buffer actually hold potentially
many revisions information (contents, directory). So if we clear the buffers on
exception, we actually don't know what we cleared. In my mind, that's not good.

If we flush at each revision, then we only clear (if that happens) the revision
information we failed to load. That's better, well that's the only consistent
way, hence the "we have to" from the start.

Exactly.

But if we flush at each revision, what's the point of the proxy buffer storage
in the first place? To be fair, there might be revisions from package loader
which are heavy in terms of data so that might be a good reason ¯\_(ツ)_/¯.

The original point of the buffer really was splitting the addition of contents and directories into smaller chunks, to avoid overflowing the memory of the backend with large batches of objects (e.g. all the contents of a firefox source tarball at once, to take a random example); I think you're being misled by the name (and grouper or splitter or chunker might have been a better choice). Grouping the addition of objects to reduce the RPC overhead and make larger batches is only an incidental effect.

Thus why i'm doing this clear_buffers explicitely only on hash collisions
now. Note that i also ask whether we only clear_buffers on contents... (the
clear_buffers allows to define what types we want to flush).

In the end, I'm kinda not convinced by this diff... I opened it because that's
what we said we'd do (even if the implementation changed slightly, because of
what i just said).

In the current state of affairs regarding how we deal with hash collision in
storage. I'd rather exclude the colliding hash(es) and resent the contents
without those. Like what we actually do in the journal.

What do you think?

I think that loaders and replayers are different things, with different integrity tradeoffs, and that we don't want their behavior to be the same.

[ snip HashCollision handling discussion, moved to D3012 ]

If we load a revision from a nixguix package URL, without making sure the underlying contents have been properly loaded, then the incremental nature of the loader will mean that there's a good chance we'll never try to load these contents again. If they disappear upstream, we have a hole in the graph forever.

I don't see any problem in flushing after parsing the objects for every revision. It may be slow (and even then, we've never actually profiled it). But this is the only way to keep the archive consistency, while maximizing the amount of data loaded and reference in the partial snapshot.

Revert to first implementation:

flush at every revision so we have consistent data stored per artifact
if exception during loading a revision is triggered, clear proxy internal state

Build is green

Patch application report for D2973 (id=10719)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.

Changes applied before test

commit c3efa0aab09a303e96035a3940af50f54cda6622
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/20/ for more details.

Harbormaster completed remote builds in B11838: Diff 10719.Apr 14 2020, 1:48 PM

Update

Build is green

Patch application report for D2973 (id=10720)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.

Changes applied before test

commit 44d2c904de2f862e8b0daa507981690fb4e6c0c8
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/21/ for more details.

Harbormaster completed remote builds in B11839: Diff 10720.Apr 14 2020, 1:59 PM

Fix spurious empty line

Build is green

Patch application report for D2973 (id=10721)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.

Changes applied before test

commit 599b6fbadb9921106d7b95461b0bad865f341f69
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Tue Apr 7 16:25:16 2020 +0200

    package.loader: Clear proxy buffer state when failing to load revision
    
    This also adds the missing flushing step at each revision.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/22/ for more details.

Harbormaster completed remote builds in B11840: Diff 10721.Apr 14 2020, 2:02 PM

ardumont retitled this revision from package.loader: Clear proxy buffer state when failing to load revision to package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact.Apr 14 2020, 2:05 PM

ardumont edited the summary of this revision. (Show Details)

olasd accepted this revision.Apr 14 2020, 2:07 PM

This revision is now accepted and ready to land.Apr 14 2020, 2:07 PM

Closed by commit rDLDBASE599b6fbadb99: package.loader: Clear proxy buffer state when failing to load revision (authored by ardumont). · Explain WhyApr 14 2020, 2:09 PM

This revision was automatically updated to reflect the committed changes.

ardumont added a commit: rDLDBASE599b6fbadb99: package.loader: Clear proxy buffer state when failing to load revision.

package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Patch application report for D2973 (id=10567)

Changes applied before test

Patch application report for D2973 (id=10602)

Changes applied before test

Patch application report for D2973 (id=10603)

Changes applied before test

Patch application report for D2973 (id=10612)

Changes applied before test

Patch application report for D2973 (id=10612)

Changes applied before test

Patch application report for D2973 (id=10657)

Changes applied before test

Patch application report for D2973 (id=10719)

Changes applied before test

Patch application report for D2973 (id=10720)

Changes applied before test

Patch application report for D2973 (id=10721)

Changes applied before test

Revision Contents
Changeset List

Diff 10722

swh/loader/package/loader.py

swh/loader/package/nixguix/tests/test_nixguix.py

package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifactClosedPublicActions

Details

Diff Detail

Event Timeline

Patch application report for D2973 (id=10567)

Changes applied before test

Patch application report for D2973 (id=10602)

Changes applied before test

Patch application report for D2973 (id=10603)

Changes applied before test

Patch application report for D2973 (id=10612)

Changes applied before test

Patch application report for D2973 (id=10612)

Changes applied before test

Patch application report for D2973 (id=10657)

Changes applied before test

Patch application report for D2973 (id=10719)

Changes applied before test

Patch application report for D2973 (id=10720)

Changes applied before test

Patch application report for D2973 (id=10721)

Changes applied before test

Revision ContentsChangeset List

Diff 10722

swh/loader/package/loader.py

swh/loader/package/nixguix/tests/test_nixguix.py

package.loader: Insert consistently data to storage, clear buffer proxy state in case of error when skipping artifact
ClosedPublic
Actions

Revision Contents
Changeset List