Page MenuHomeSoftware Heritage

swh-journal: Create a journal checker comparing object lists between journal and database
Closed, MigratedEdits Locked

Description

A journal checker would queue all the objects missing from the journal to the journal's new objects queue

Related Objects

Event Timeline

Trying to think about the implementation and clarifying my
understanding.

The checker is a subscriber to the publisher's topics and knows how to
read all the objects (objects being anything from content, origin,
origin_visit, etc...) from them.

Its goal is to compute the difference against the main storage and
sends back to the publisher's topics the objects missed by the
listener. (So it's also a producer.)

The main difficulty here is the diff implementation since we have
quite some volumetry.

Implementations

  1. Using the storage db with temporary table. Reads all from the publisher's topics, write the actual objects' identifier to the associated temporary table (with index). Then executing query to diff the missing objects between tables.
  1. Using the storage db with permanent table. Same implementation except the tables are not temporary so we can leverage the incremental nature of the client to be faster later on.
  1. Using files and diff. Reads all from the publisher's journal, write the actual objects' identifier to a temporary file. Reads all from the db, write the actual objects' identifier to another temporary file. Execute an actual diff on sorted files and sends back the missing objects.
  1. Using ram. Using set of objects' identifiers for both the storage and the journal.

Pros/Cons

As i'm not able to tell which is most reasonable, here is the
pros/cons i see from those possible implementations:

Solutionproscons
1.- supposedly fast (with right indexes)- duplicate the tables on the main objects (+ index)
- miss/failure from previous run should be caught on eventually
2.same pros as 1.- duplicate the tables on the main objects (+ index)
- incremental(- not asked for incremental)
- possible failures from prior run won't be detected
3.- no overhead on db (regarding writings)
- no data duplication
- miss/failure from previous run should be caught on eventually
4.- same pros as 3.- don't think it's possible with the volumetry we have

Apparently, i'm trying too hard to think about it since 4. should be the one (If i understood correctly what was said on irc).

Apparently, i'm trying too hard to think about it since 4. should be the one (If i understood correctly what was said on irc).

That's not what I said; the concept is simple, the implementation is not :)

There's another solution that you didn't consider:

  1. just send all the objects again

The way kafka works is that it will compact messages with duplicate identifiers (removing old versions and keeping only the most recent). So we don't _really_ need to do incremental stuff; we can just push all the objects and be done with it.

On one hand this will cause a lot of churn for consumers that manage to keep up live (as they'll get all the objects again), but on the other hand it's something that we will have to implement anyway when we improve the metadata for objects (e.g. when we do a schema change, when we add identifiers, ...), so we might as well implement it now.

We still need an incremental solution, as the storage listener component is not reliable at all. I think solution 2 is the only reasonable incremental solution, considering that kafka doesn't let us access the index it uses for log compaction. As the catchup operation is something that we don't need to do often nor fast, we can store that data on the spinning storage database cluster, and do the comparison between the two databases on the client side.

To implement solutions 2 and 5, we can break down the checker in two components:

  • a checker producer, which lists the object ids present in the main database, optionally diffs them against the checker index, and pushes the identifiers of missing objects to the temporary topics
  • a checker consumer, which keeps an index of the object ids that exist in the swh journal in permanent storage

Implementing the simple checker producer (which doesn't read any persistent journal index) should be a reasonable first step.

the implementation is not :)

I don't find the implementation easy either :D and I thought i was alone at that irc moment (life crisis, and all :).
So thanks for the clarification.

  1. just send all the objects again

Indeed, missed it. Thanks.

To implement solutions 2 and 5, we can break down the checker in two components:...

I like this. Thanks.

Implementing the simple checker producer (which doesn't read any persistent journal index) should be a reasonable first step.

I assume that by no persistent journal index, you mean not yet using the spinning storage db, thus reading directly the journal (well optionally).

I have almost something like that. I'll make you a diff if you don't mind.

thus reading directly the journal (well optionally).

After discussion, no need for the option to read the journal for the first simple checker producer implementation (5.).

I have almost something like that. I'll make you a diff if you don't mind.

I'm thus simplifying what i have + improving docstrings.

ardumont renamed this task from Create a journal checker that compares object lists from the journal and from the database to swh-journal - Create a journal checker comparing object lists between journal and database.Oct 18 2018, 3:53 PM
ardumont renamed this task from swh-journal - Create a journal checker comparing object lists between journal and database to swh-journal: Create a journal checker comparing object lists between journal and database.
olasd added a subscriber: seirl.

Thanks to @seirl using the full journal to do a graph export (and therefore having the time to check whether all objects were there), we've found a bunch of bugs in the journal backfiller / configuration preventing large objects to be added.

T2350

I've ended up implementing an "id by id" backfiller to resolve T2351. However this happened in the middle of a large refactoring of swh.storage and swh.journal, so the code needs a (painful) rebase, for not much gain.

I'll close this and we can revisit it if we actually need such a component again.