Page MenuHomeSoftware Heritage

swh.journal.checker: Create a simple journal checker producer
ClosedPublic

Authored by ardumont on Mar 22 2017, 5:51 PM.

Diff Detail

Repository
rDJNL Journal infrastructure
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 829
Build 1111: arc lint + arc unit

Event Timeline

Remove unneeded dependency on swh-model

Fix docstring sentence formulation

olasd requested changes to this revision.Mar 23 2017, 9:25 AM

cursor.execute() on an anonymous (client-side) cursor will fetch all the results in client memory (before you even start getting the results), so that won't work at all.

Two solutions:

  • we can keep a server side cursor, which makes the client fetch data in blocks
  • we can use copy which streams data to a file descriptor

If we want to make the dump efficient we will want one connection and therefore one stream per object type (so we can run in parallel).

As we will need to compare two tables the results need to be sorted. The postgres query planner won't use the index unless we force it to with set enable_seqscan to off.

The "backend" doesn't need the autoreconnect logic, or the automatic cursor creation logic, or the autocommit logic. It might as well just be a simple function that connects to the database, creates a cursor and returns the data.

Next steps for iteration in my opinion:

  • replace the backend with a simple function which connects to the DB and runs a query
  • use a server-side cursor instead of a client-side cursor (just add a name argument to the db.cursor() call)
  • make sure the database outputs sorted results fast

Once that works reliably we can look at adding some parallelism.

swh/journal/checker.py
33

this should be the temporary, id-only topic prefix

This revision now requires changes to proceed.Mar 23 2017, 9:25 AM

cursor.execute() on an anonymous (client-side) cursor will fetch all the results in client memory (before you even start getting the results), so that won't work at all.

But of course!

Two solutions:

  • we can keep a server side cursor, which makes the client fetch data in blocks
  • we can use copy which streams data to a file descriptor

If we want to make the dump efficient we will want one connection and therefore one stream per object type (so we can run in parallel).

Ack.

As we will need to compare two tables the results need to be sorted.

Yes, I thought of sorting this morning.

The postgres query planner won't use the index unless we force it to with set enable_seqscan to off.

I did not thought of this being a problem since we need to read the complete table.
After discussion, reading the index is faster and less heavy load wise so sure.

The "backend" doesn't need the autoreconnect logic, or the automatic cursor creation logic, or the autocommit logic. It might as well just be a simple function that connects to the database, creates a cursor and returns the data.

ok, clearly, I will simplify that.

Next steps for iteration in my opinion:

  • replace the backend with a simple function which connects to the DB and runs a query
  • use a server-side cursor instead of a client-side cursor (just add a name argument to the db.cursor() call)
  • make sure the database outputs sorted results fast

Right, on it.

Once that works reliably we can look at adding some parallelism.

Ok.

Thanks a bunch.

ardumont edited edge metadata.

swh.journal.checker: Optimize the db identifiers reading

Rework according to latest review.

Remains to find the way to pass the option 'set enable_seqscan=off'.
I did not find the right stanza for now and I'm still looking for it.

  • swh.journal.checker: Deal with composite primary key object
  • swh.journal.checker: Unify option name to temporary_prefix
  • swh.journal.publisher: adapt data to read as dict of composite pkey
ardumont marked an inline comment as done.EditedMar 23 2017, 3:41 PM

Related change in storage rDSTO47cb71b1a61c34e9e27c701630ab6374ce8e359d

As the checker does write to the same queue as the listener, those must respect the same contract, that is send the same data.
Turned out that some were dict (origin_visit, skipped_content) and some directly the primary key.
The previous commit adapted that for all objects to send dict of keys compositing the primary key (or acknowledged as such).
Now every object sends those:

  • content is aligned with skipped_content and have {sha1, sha1_git, sha256}
  • origin_visit has {origin, visit}
  • revision, origin, release, dictionary have {id}.

Rebase

  • swh.journal.checker: Reads and sends back all storage objects to queue
  • swh.journal.checker: Use the simple checker producer implementation
  • swh.journal.checker: Optimize the identifiers reading
  • swh.journal.checker: Deal with composite primary key object
  • swh.journal.checker: Unify option name to temporary_prefix
  • swh.journal.publisher: adapt data to read as dict of composite pkey

Remains to find the way to pass the option 'set enable_seqscan=off'.
I did not find the right stanza for now and I'm still looking for it.

No need. Turns out that using the named cursor from psycopg2 (which uses postgresql's server-side cursor) is enough.

$ psql service=swh-dev                                            
master psql (9.6.2)
Type "help" for help.

softwareheritage-dev=# explain declare cur cursor for select id from origin order by id;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using origin_pkey on origin  (cost=0.15..57.30 rows=610 width=8)
(1 row)

This is already using the index.

The so-called backend should really be turned into a function now. Furthermore, it's not a journal backend, so it shouldn't be in swh.journal.backend. It probably won't be used elsewhere, so you migh just put the function in the checker?

I don't think there's much more to say, this should be ready to go in my opinion.

In D199#4091, @olasd wrote:

The so-called backend should really be turned into a function now.

Sure

Furthermore, it's not a journal backend, so it shouldn't be in swh.journal.backend.

I did not mean it as a journal backend indeed.
I meant it as a mean to access the storage backend.

It probably won't be used elsewhere, so you migh just put the function in the checker?

Fair enough.

I don't think there's much more to say, this should be ready to go in my opinion.

Ok. Will adapt, update the diff for the sake of being exhaustive.
And push.

Thanks.

  • swh.journal.checker: Move fetch function back in checker module