swh-journal: Create a journal checker comparing object lists between journal and database
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Aug 16 2016, 6:34 PM

Description

A journal checker would queue all the objects missing from the journal to the journal's new objects queue

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T424 swh-journal: persistent journal infrastructure to record additions to the swh-storage
		Migrated	gitlab-migration	T529 swh-journal: Create a journal checker comparing object lists between journal and database

Event Timeline

olasd created this task.Aug 16 2016, 6:34 PM

Trying to think about the implementation and clarifying my
understanding.

The checker is a subscriber to the publisher's topics and knows how to
read all the objects (objects being anything from content, origin,
origin_visit, etc...) from them.

Its goal is to compute the difference against the main storage and
sends back to the publisher's topics the objects missed by the
listener. (So it's also a producer.)

The main difficulty here is the diff implementation since we have
quite some volumetry.

Implementations

Using the storage db with temporary table. Reads all from the publisher's topics, write the actual objects' identifier to the associated temporary table (with index). Then executing query to diff the missing objects between tables.

Using the storage db with permanent table. Same implementation except the tables are not temporary so we can leverage the incremental nature of the client to be faster later on.

Using files and diff. Reads all from the publisher's journal, write the actual objects' identifier to a temporary file. Reads all from the db, write the actual objects' identifier to another temporary file. Execute an actual diff on sorted files and sends back the missing objects.

Using ram. Using set of objects' identifiers for both the storage and the journal.

Pros/Cons

As i'm not able to tell which is most reasonable, here is the
pros/cons i see from those possible implementations:

Solution	pros	cons
1.	- supposedly fast (with right indexes)	- duplicate the tables on the main objects (+ index)
	- miss/failure from previous run should be caught on eventually
2.	same pros as 1.	- duplicate the tables on the main objects (+ index)
	- incremental	(- not asked for incremental)
		- possible failures from prior run won't be detected
3.	- no overhead on db (regarding writings)
	- no data duplication
	- miss/failure from previous run should be caught on eventually
4.	- same pros as 3.	- don't think it's possible with the volumetry we have

Apparently, i'm trying too hard to think about it since 4. should be the one (If i understood correctly what was said on irc).

In T529#12609, @ardumont wrote:

Apparently, i'm trying too hard to think about it since 4. should be the one (If i understood correctly what was said on irc).

That's not what I said; the concept is simple, the implementation is not :)

There's another solution that you didn't consider:

just send all the objects again

The way kafka works is that it will compact messages with duplicate identifiers (removing old versions and keeping only the most recent). So we don't _really_ need to do incremental stuff; we can just push all the objects and be done with it.

On one hand this will cause a lot of churn for consumers that manage to keep up live (as they'll get all the objects again), but on the other hand it's something that we will have to implement anyway when we improve the metadata for objects (e.g. when we do a schema change, when we add identifiers, ...), so we might as well implement it now.

We still need an incremental solution, as the storage listener component is not reliable at all. I think solution 2 is the only reasonable incremental solution, considering that kafka doesn't let us access the index it uses for log compaction. As the catchup operation is something that we don't need to do often nor fast, we can store that data on the spinning storage database cluster, and do the comparison between the two databases on the client side.

To implement solutions 2 and 5, we can break down the checker in two components:

a checker producer, which lists the object ids present in the main database, optionally diffs them against the checker index, and pushes the identifiers of missing objects to the temporary topics
a checker consumer, which keeps an index of the object ids that exist in the swh journal in permanent storage

Implementing the simple checker producer (which doesn't read any persistent journal index) should be a reasonable first step.

the implementation is not :)

I don't find the implementation easy either :D and I thought i was alone at that irc moment (life crisis, and all :).
So thanks for the clarification.

just send all the objects again

Indeed, missed it. Thanks.

To implement solutions 2 and 5, we can break down the checker in two components:...

I like this. Thanks.

Implementing the simple checker producer (which doesn't read any persistent journal index) should be a reasonable first step.

I assume that by no persistent journal index, you mean not yet using the spinning storage db, thus reading directly the journal (well optionally).

I have almost something like that. I'll make you a diff if you don't mind.

thus reading directly the journal (well optionally).

After discussion, no need for the option to read the journal for the first simple checker producer implementation (5.).

I have almost something like that. I'll make you a diff if you don't mind.

I'm thus simplifying what i have + improving docstrings.

ardumont mentioned this in D199: swh.journal.checker: Create a simple journal checker producer.Mar 22 2017, 5:51 PM

ardumont mentioned this in rDSTO47cb71b1a61c: swh.storage.listener: Send notify data as dict of composite primary key.Mar 23 2017, 3:17 PM

ardumont mentioned this in rDSTO3fe89249748b: swh.storage.listener: Use swh.journal.serializers.key_to_kafka function.Mar 23 2017, 4:03 PM

ardumont mentioned this in rDJNL88b85a206b64: swh.journal.checker: Reads and sends back all storage objects to queue.Mar 23 2017, 4:58 PM

ardumont mentioned this in rDJNLabd222dcce17: swh.journal.checker: Use the simple checker producer implementation.

ardumont mentioned this in rDJNL04773ae7ef3a: swh.journal.checker: Optimize the identifiers reading.

ardumont mentioned this in rDJNLb95b03a70aac: swh.journal.checker: Unify option name to temporary_prefix.

ardumont mentioned this in rDJNLa9eae192b83a: swh.journal.checker: Deal with composite primary key object.

ardumont mentioned this in rDJNL94652f51d78e: swh.journal.publisher: adapt data to read as dict of composite pkey.

ardumont mentioned this in rDJNLaaa283b81a3d: swh.journal.checker: Move fetch function back in checker module.

ardumont mentioned this in rDJNL3b097e2165bb: swh.journal.checker: Use predictable serialization for dict.Mar 24 2017, 12:54 PM

ardumont mentioned this in rDJNL6a16052afb9d: swh.journal.checker: Fix wrong key name for origin.

ardumont mentioned this in rDJNL93f0d7d19219: swh.journal.checker: Send content with its full composite key.

ardumont mentioned this in rSPPROFd039f578fcdf: deploy::journal_simple_checker_producer: Add manifest.Mar 24 2017, 1:52 PM

ardumont mentioned this in rSPPROFb68088a71603: deploy::journal: Read conf directory from configuration.Mar 24 2017, 1:57 PM

ardumont mentioned this in rSPSITE3d77eff4aaf7: data/defaults: Add journal_simple_checker_producer configuration.Mar 24 2017, 2:02 PM

ardumont mentioned this in rSPSITEd039f578fcdf: deploy::journal_simple_checker_producer: Add manifest.Jun 15 2018, 2:29 PM

ardumont mentioned this in rSPSITEb68088a71603: deploy::journal: Read conf directory from configuration.

ardumont renamed this task from Create a journal checker that compares object lists from the journal and from the database to swh-journal - Create a journal checker comparing object lists between journal and database.Oct 18 2018, 3:53 PM

ardumont renamed this task from swh-journal - Create a journal checker comparing object lists between journal and database to swh-journal: Create a journal checker comparing object lists between journal and database.

ardumont mentioned this in T1603: kafka storage backfiller.Apr 2 2019, 12:11 PM

Thanks to @seirl using the full journal to do a graph export (and therefore having the time to check whether all objects were there), we've found a bunch of bugs in the journal backfiller / configuration preventing large objects to be added.

T2350

I've ended up implementing an "id by id" backfiller to resolve T2351. However this happened in the middle of a large refactoring of swh.storage and swh.journal, so the code needs a (painful) rebase, for not much gain.

I'll close this and we can revisit it if we actually need such a component again.

This task has been migrated to GitLab.

swh-journal: Create a journal checker comparing object lists between journal and databaseClosed, MigratedEdits LockedActions