Reads objects from storage and sends back those to the publisher queue.
Related T529
Differential D199
swh.journal.checker: Create a simple journal checker producer ardumont on Mar 22 2017, 5:51 PM. Authored by Tags None Subscribers None
Details
Reads objects from storage and sends back those to the publisher queue. Related T529
Diff Detail
Event TimelineComment Actions cursor.execute() on an anonymous (client-side) cursor will fetch all the results in client memory (before you even start getting the results), so that won't work at all. Two solutions:
If we want to make the dump efficient we will want one connection and therefore one stream per object type (so we can run in parallel). As we will need to compare two tables the results need to be sorted. The postgres query planner won't use the index unless we force it to with set enable_seqscan to off. The "backend" doesn't need the autoreconnect logic, or the automatic cursor creation logic, or the autocommit logic. It might as well just be a simple function that connects to the database, creates a cursor and returns the data. Next steps for iteration in my opinion:
Once that works reliably we can look at adding some parallelism.
Comment Actions
But of course!
Ack.
Yes, I thought of sorting this morning.
I did not thought of this being a problem since we need to read the complete table.
ok, clearly, I will simplify that.
Right, on it.
Ok. Thanks a bunch. Comment Actions swh.journal.checker: Optimize the db identifiers reading Rework according to latest review. Remains to find the way to pass the option 'set enable_seqscan=off'. Comment Actions
Comment Actions Related change in storage rDSTO47cb71b1a61c34e9e27c701630ab6374ce8e359d As the checker does write to the same queue as the listener, those must respect the same contract, that is send the same data.
Comment Actions Rebase
Comment Actions
No need. Turns out that using the named cursor from psycopg2 (which uses postgresql's server-side cursor) is enough. $ psql service=swh-dev master psql (9.6.2) Type "help" for help. softwareheritage-dev=# explain declare cur cursor for select id from origin order by id; QUERY PLAN ---------------------------------------------------------------------------------- Index Only Scan using origin_pkey on origin (cost=0.15..57.30 rows=610 width=8) (1 row) This is already using the index. Comment Actions The so-called backend should really be turned into a function now. Furthermore, it's not a journal backend, so it shouldn't be in swh.journal.backend. It probably won't be used elsewhere, so you migh just put the function in the checker? I don't think there's much more to say, this should be ready to go in my opinion. Comment Actions Sure
I did not mean it as a journal backend indeed.
Fair enough.
Ok. Will adapt, update the diff for the sake of being exhaustive. Thanks. |