Page MenuHomeSoftware Heritage

Add a flag to copy objects only if they don't exist in the destination
ClosedPublic

Authored by olasd on Mar 3 2020, 1:58 PM.

Details

Summary

This trades bandwidth/processing time for more API queries, which can be a win
if your exclusion file is a bit stale.

Depends on D2756.

Test Plan

new test added to check the behavior with/without the flag

Diff Detail

Repository
rDJNL Journal infrastructure
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

What about updating the exclusion file instead?

What about updating the exclusion file instead?

That's not an either/or affair IMO. By definition as long as the replayer is running, the exclusion file is always out of date (and updating it is quite costly).

I also don't know where the code used to generate the exclusion file is...

In D2757#65748, @olasd wrote:

That's not an either/or affair IMO. By definition as long as the replayer is running, the exclusion file is always out of date (and updating it is quite costly).

But if it doesn't matter if it doesn't see the same objects too many times

I also don't know where the code used to generate the exclusion file is...

https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/objstorage-replay-exclusion-file/

but you have to ask @seirl to generate unsorted_inventory.txt.gz for you

In D2757#65748, @olasd wrote:

That's not an either/or affair IMO. By definition as long as the replayer is running, the exclusion file is always out of date (and updating it is quite costly).

But if it doesn't matter if it doesn't see the same objects too many times

I really can't tell if it matters or not until I have a knob I can turn on/off to compare. In any cases, it certainly doesn't hurt to have the option.

I also don't know where the code used to generate the exclusion file is...

https://forge.softwareheritage.org/source/snippets/browse/master/vlorentz/objstorage-replay-exclusion-file/

but you have to ask @seirl to generate unsorted_inventory.txt.gz for you

So this diff is somewhat needed anyway. We're currently excluding around 5-10% of writes, which is substantial (although, overall, we're currently slower than the fastest we've been in the past).

This revision is now accepted and ready to land.Mar 3 2020, 3:55 PM