[WIP] In case of race condition in content_add, raise SerializationFailure instead of HashCollision.
Needs ReviewPublic
Actions

Authored by vlorentz on Nov 11 2019, 12:08 AM.

Details

Reviewers

None

Group Reviewers

Reviewers

Maniphest Tasks

T2019: race condition during concurrent loading of the same objects from multiple origins

Summary

Partially resolves T2019, by raising the appropriate exception.

This keeps the retry logic on the client side. I can add it on the server side with a decorator to both functions, but it won't be pretty because it needs to create a new cursor, so the test won't work as-is.

Diff Detail

Repository

rDSTO Storage manager

Branch

T2019-SerializationFailure

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 8931
Build 13044: tox-on-jenkins	Jenkins
Build 13043: arc lint + arc unit

Event Timeline

vlorentz created this revision.Nov 11 2019, 12:08 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptNov 11 2019, 12:08 AM

remove irrelevant TODO

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/772/ for more details.

Harbormaster completed remote builds in B8930: Diff 7753.Nov 11 2019, 12:10 AM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/773/ for more details.

Harbormaster completed remote builds in B8931: Diff 7754.Nov 11 2019, 12:12 AM

vlorentz edited the summary of this revision. (Show Details)Nov 11 2019, 12:13 AM

vlorentz retitled this revision from In case of race condition in content_add, raise SerializationFailure instead of HashCollision. to [WIP] In case of race condition in content_add, raise SerializationFailure instead of HashCollision..Nov 11 2019, 9:47 AM

Actually, I don't understand why the new SELECT is needed; the one in content_missing_from_list (called by content_missing, indirectly called by content_add) should have the same effect...

It looks like what matters is to SELECT only based on sha1 instead of all hashes. Weird...

Thanks for looking into this.

Forcing a serialization issue instead of a HashCollision is interesting, however...

I don't like that this tweaks the transaction isolation level within the function rather than when we open the transaction: set transaction isolation level must be the first thing called within the transaction, before any queries happen, so this should be handled within the transaction wrapper.

Having to do a bogus select to generate a fake read/write serialization dependency is also quite awful.

While I _think_ serializable is the isolation level that we would want for all queries, I don't know if we really want to eat that cost, for the benefit of "absolute correctness" on one query that fails 1 out of a million times and that we already retry on the client side anyway.

If we really want to go that route, we should also implement a read-only flag, to avoid read queries gratuitously generating serialization dependencies. In addition, the fact that the content_add transactions also include addition to the object storage doesn't play well with the caveats listed in the postgres docs.

Is there anything I'm missing on the status quo on this issue?

This are all good points. They don't seem to hard to address, I just wonder if they are worth spending time on.

Also I don't understand why the extra SELECT changes anything, and I'm hesitant to push a fix I don't fully understand

vlorentz mentioned this in D2783: storage(s): Identify and provide the collision hash(es) in HashCollision exception.Mar 6 2020, 3:49 PM

In D2248#53026, @olasd wrote:

If we really want to go that route, we should also implement a read-only flag, to avoid read queries gratuitously generating serialization dependencies. In addition, the fact that the content_add transactions also include addition to the object storage doesn't play well with the caveats listed in the postgres docs.

As far as I can tell, this particular point should not be an issue anymore, thanks to D2848 (which moved objstorage writes outside the transaction)

bchauvet mentioned this in T3957: Clean sentry reports.Apr 19 2022, 10:06 AM