Page MenuHomeSoftware Heritage

Make postgresql's origin_add not raise an error in case of conflict
ClosedPublic

Authored by douardda on May 4 2021, 4:11 PM.

Details

Summary

there is no need for an url insertion in the origin table to result in a
unicity error. Conflicting insertion of the same URL in this table may
happen in case of concurrent process (loading or in a replayer session).

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5671 (id=20257)

Rebasing onto 051b771523...

Current branch diff-target is up to date.
Changes applied before test
commit 94456f66618b03de818c00ef22b0c122b8140cd4
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 4 16:06:02 2021 +0200

    Make postgresql's origin_add not raise an error in case of conflict
    
    there is no need for an url insertion in the origin table to result in a
    unicity error. Conflicting insertion of the same URL in this table may
    happen in case of concurrent process (loading or in a replayer session).

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1302/ for more details.

ardumont added inline comments.
swh/storage/postgresql/db.py
978

ok for the main change but that one makes me wonder.
If the returned result is uneeded [1], might as well change the signature to None.

Also, please add type to the impacted method.

[1] that's somehow implicit here as the meaning in the returned value is not the same.

swh/storage/postgresql/db.py
978

The return value is used in postgresql.Storage.origin_add to return a counter of added origins:

[...]
added = 0
for url in to_add:
    if db.origin_add(url, cur):
        added += 1
return {"origin:add": added}
This revision is now accepted and ready to land.May 5 2021, 12:17 PM
ardumont added inline comments.
swh/storage/postgresql/db.py
978

right, i had forgotten that part.

Thanks.

So i gather cur.rowcount somehow returns something falsy [1] when the on conflict
got caught which then do not increment stats in the main caller.

[1] this looks fine:

In [11]: 0 if None else 1
Out[11]: 1

In [12]: 0 if False else 1
Out[12]: 1

In [13]: 0 if 0 else 1
Out[13]: 1
swh/storage/postgresql/db.py
978

I was unsure of the behavior of cur.rowcount with the 'on conflict', so I checked (pg13 in a pifpaf session) and it seems to work as expected (aka only count the number of inserted rows):

In [22]: c.execute("select * from origin")
In [23]: c.fetchall()
Out[23]: [(1, 'http://foo'), (2, 'http://bar'), (8, 'http://baz'), (11, 'http://biz')]
In [31]: execute_values(c, "INSERT INTO origin (url) values %s ON CONFLICT DO NOTHING", (("http://bar",),("http://toto",), ("http://foo",), ("http://tutu",) ))
In [32]: c.rowcount
Out[32]: 2

Build is green

Patch application report for D5671 (id=20288)

Rebasing onto ffb38f71d9...

Current branch diff-target is up to date.
Changes applied before test
commit 77ef651d958285c3d91b10961f5a62a7612f9354
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 4 16:06:02 2021 +0200

    Make postgresql's origin_add not raise an error in case of conflict
    
    there is no need for an url insertion in the origin table to result in a
    unicity error. Conflicting insertion of the same URL in this table may
    happen in case of concurrent process (loading or in a replayer session).

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1306/ for more details.