Details

Reviewers

Group Reviewers

Commits

rDJNL94652f51d78e: swh.journal.publisher: adapt data to read as dict of composite pkey
rDJNLaaa283b81a3d: swh.journal.checker: Move fetch function back in checker module
rDJNLb95b03a70aac: swh.journal.checker: Unify option name to temporary_prefix
rDJNLa9eae192b83a: swh.journal.checker: Deal with composite primary key object
rDJNL04773ae7ef3a: swh.journal.checker: Optimize the identifiers reading
rDJNLabd222dcce17: swh.journal.checker: Use the simple checker producer implementation
rDJNL88b85a206b64: swh.journal.checker: Reads and sends back all storage objects to queue

Summary

Reads objects from storage and sends back those to the publisher queue.

Related T529

Diff Detail

Repository

rDJNL Journal infrastructure

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Mar 22 2017, 5:51 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptMar 22 2017, 5:51 PM

Harbormaster completed remote builds in B824: Diff 649.Mar 22 2017, 5:51 PM

Remove unneeded dependency on swh-model

Fix docstring sentence formulation

cursor.execute() on an anonymous (client-side) cursor will fetch all the results in client memory (before you even start getting the results), so that won't work at all.

Two solutions:

we can keep a server side cursor, which makes the client fetch data in blocks
we can use copy which streams data to a file descriptor

If we want to make the dump efficient we will want one connection and therefore one stream per object type (so we can run in parallel).

As we will need to compare two tables the results need to be sorted. The postgres query planner won't use the index unless we force it to with set enable_seqscan to off.

The "backend" doesn't need the autoreconnect logic, or the automatic cursor creation logic, or the autocommit logic. It might as well just be a simple function that connects to the database, creates a cursor and returns the data.

Next steps for iteration in my opinion:

replace the backend with a simple function which connects to the DB and runs a query
use a server-side cursor instead of a client-side cursor (just add a name argument to the db.cursor() call)
make sure the database outputs sorted results fast

Once that works reliably we can look at adding some parallelism.

swh/journal/checker.py
33	this should be the temporary, id-only topic prefix

This revision now requires changes to proceed.Mar 23 2017, 9:25 AM

cursor.execute() on an anonymous (client-side) cursor will fetch all the results in client memory (before you even start getting the results), so that won't work at all.

But of course!

Two solutions:

we can keep a server side cursor, which makes the client fetch data in blocks

we can use copy which streams data to a file descriptor

If we want to make the dump efficient we will want one connection and therefore one stream per object type (so we can run in parallel).

Ack.

As we will need to compare two tables the results need to be sorted.

Yes, I thought of sorting this morning.

The postgres query planner won't use the index unless we force it to with set enable_seqscan to off.

I did not thought of this being a problem since we need to read the complete table.
After discussion, reading the index is faster and less heavy load wise so sure.

The "backend" doesn't need the autoreconnect logic, or the automatic cursor creation logic, or the autocommit logic. It might as well just be a simple function that connects to the database, creates a cursor and returns the data.

ok, clearly, I will simplify that.

Next steps for iteration in my opinion:

replace the backend with a simple function which connects to the DB and runs a query

use a server-side cursor instead of a client-side cursor (just add a name argument to the db.cursor() call)

make sure the database outputs sorted results fast

Right, on it.

Once that works reliably we can look at adding some parallelism.

Ok.

Thanks a bunch.

swh.journal.checker: Optimize the db identifiers reading

Rework according to latest review.

Remains to find the way to pass the option 'set enable_seqscan=off'.
I did not find the right stanza for now and I'm still looking for it.

swh.journal.checker: Deal with composite primary key object
swh.journal.checker: Unify option name to temporary_prefix
swh.journal.publisher: adapt data to read as dict of composite pkey

Related change in storage rDSTO47cb71b1a61c34e9e27c701630ab6374ce8e359d

As the checker does write to the same queue as the listener, those must respect the same contract, that is send the same data.
Turned out that some were dict (origin_visit, skipped_content) and some directly the primary key.
The previous commit adapted that for all objects to send dict of keys compositing the primary key (or acknowledged as such).
Now every object sends those:

content is aligned with skipped_content and have {sha1, sha1_git, sha256}
origin_visit has {origin, visit}
revision, origin, release, dictionary have {id}.

Rebase

swh.journal.checker: Reads and sends back all storage objects to queue
swh.journal.checker: Use the simple checker producer implementation
swh.journal.checker: Optimize the identifiers reading
swh.journal.checker: Deal with composite primary key object
swh.journal.checker: Unify option name to temporary_prefix
swh.journal.publisher: adapt data to read as dict of composite pkey

Remains to find the way to pass the option 'set enable_seqscan=off'.
I did not find the right stanza for now and I'm still looking for it.

No need. Turns out that using the named cursor from psycopg2 (which uses postgresql's server-side cursor) is enough.

$ psql service=swh-dev                                            
master psql (9.6.2)
Type "help" for help.

softwareheritage-dev=# explain declare cur cursor for select id from origin order by id;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using origin_pkey on origin  (cost=0.15..57.30 rows=610 width=8)
(1 row)

This is already using the index.

The so-called backend should really be turned into a function now. Furthermore, it's not a journal backend, so it shouldn't be in swh.journal.backend. It probably won't be used elsewhere, so you migh just put the function in the checker?

I don't think there's much more to say, this should be ready to go in my opinion.

In D199#4091, @olasd wrote:

The so-called backend should really be turned into a function now.

Sure

Furthermore, it's not a journal backend, so it shouldn't be in swh.journal.backend.

I did not mean it as a journal backend indeed.
I meant it as a mean to access the storage backend.

It probably won't be used elsewhere, so you migh just put the function in the checker?

Fair enough.

I don't think there's much more to say, this should be ready to go in my opinion.

Ok. Will adapt, update the diff for the sake of being exhaustive.
And push.

Thanks.

swh.journal.checker: Move fetch function back in checker module

Harbormaster completed remote builds in B832: Diff 655.Mar 23 2017, 4:54 PM

Closed by commit rDJNL88b85a206b64: swh.journal.checker: Reads and sends back all storage objects to queue (authored by ardumont). · Explain WhyMar 23 2017, 4:58 PM

This revision was automatically updated to reflect the committed changes.

ardumont mentioned this in rDJNLaaa283b81a3d: swh.journal.checker: Move fetch function back in checker module.

swh.journal.checker: Create a simple journal checker producer
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 656

debian/control

requirements-swh.txt

swh/journal/backend.py

swh/journal/checker.py

swh.journal.checker: Create a simple journal checker producerClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 656

debian/control

requirements-swh.txt

swh/journal/backend.py

swh/journal/checker.py

swh.journal.checker: Create a simple journal checker producer
ClosedPublic
Actions

Revision Contents
Changeset List