Page MenuHomeSoftware Heritage

Use journal clients for webapp and deposit to subscribe to events
Closed, MigratedEdits Locked

Description

Now that we started using journal for the dag, we may have a need for more functional
topics.

Currently the webapp and the deposit needs to poll regularly information to update some
status:

  • deposit status
  • save code now status

It'd be interesting to have dedicated topics for those so they can actually refresh
themselves out of a journal client. That would allow to drop the need for regular
polling.

Note: We may be able to use the existing topics (origin_visit_status for example) but
then the journal client would receive too much information that they would need to
filter out. Also poses the problem of eventual backfillings which could put too much
pressure on those dedicated journal clients (save code now today is only around 76k
origin requests...).

Thoughts?

Event Timeline

ardumont triaged this task as Normal priority.Apr 23 2021, 4:48 PM
ardumont created this task.

Moving towards event notifications and stream processing instead of polling sounds worthwhile, before the amount of polling becomes more important than managing the event notification mechanism. For the two systems you've mentioned, I think we're really, really far away from that, but it's still worth considering a way to do event-driven notifications properly, so we don't have to rush it through.

I do not think that "the journal" (as in, the system that records a stream of all the new objects that are archived in and indexed by Software Heritage) is the right place to put ad-hoc event notifications for internal use.

But kafka is probably the most decent solution we have for reliable event notification, and the existing scaffolding we have in place, and the journal client library, may be generic enough to be reused for this purpose too.

I am a bit concerned by the effort needed to properly hook up the events we want to record into a kafka topic, especially for Save Code Now: in practice, the "task execution" code doesn't know where the task comes from, so it's not immediately clear to me how we'd design a way to let the task runner know that it needs to send events to the "dedicated" topic. This caveat doesn't apply to the deposit, as we have a single set of task runners and the client would be interested in all of its events.

Before jumping into implementing dedicated queues, it's probably worth prototyping whether just subscribing an ad-hoc journal client to origin_visit_status (which wouldn't do anything for 99.999% of origins) would be enough: it's easy enough to add a bunch of parallelism to the client when/if we backfill the topic, which makes the concern about "backfill load" less intense, IMO (we could handle a full read of these topics in a couple hours, with enough parallelism, when we had to restart the scheduler journal client from scratch).