Backfilled objects:
- snapshot
- release
Backfilled objects:
I've pulled the list of objects from kafka using @seirl's graph export. I'm now looking to make the diff between postgres and that list of objects.
rDJNL7ff372a02de4 has now been deployed to production
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/408/ for more details.
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/404/ for more details.
Rebase on latest master
<ardumont> val: your D2838 comment, sure <ardumont> but i'd rather do it in another diff if you don't mind
Could you also change the _fix_contents I just added?
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/402/ for more details.
I think it would be ok to write in the journal after adding to the objstorage.
In T2003#41459, @vlorentz wrote:In T2003#41457, @douardda wrote:One question could be 'what is the definitive source of truth in our stack?'
I assumed we wanted to aim for Kafka to be the source of truth
In T2003#41457, @douardda wrote:One question could be 'what is the definitive source of truth in our stack?'
In T2003#41456, @olasd wrote:Now that I think of it, we can decompose this in stages in the storage pipeline:
- add an input validating proxy high up the stack
- replace the journal writer calls sprinkled in all methods with a journal writing proxy
- add a "don't insert objects" filter low down the stack
so we'd end up with the following pipeline for workers:
- input validation proxy
- object bundling proxy
- object deduplication against read-only proxy
- journal writer proxy
- addition-blocking filter
- underlying read-only storage
and the following pipeline for the "main storage replayer":
- underlying read-write storage
(it's a very short pipeline... a pipedash?)
In T2003#41443, @vlorentz wrote:We already discussed this at the time we replaced the journal-publisher with journal-writer. Adding to Kafka after inserting to the DB means that Kafka will be missing some messages, and we would need to run a backfiller on a regular basis to fix it.
Now that I think of it, we can decompose this in stages in the storage pipeline:
In T2003#41443, @vlorentz wrote:@olasd I'm worried that implementing your idea would result in some complex piece of code.
@olasd I'm worried that implementing your idea would result in some complex piece of code. It also adds a new postgresql database and new kafka topics, that will need extra resources and management. And if at some point that queue database becomes too large, the retrier will become slower, causing the queue to grow even more.
In T2003#41428, @olasd wrote:This component would centralize the "has this object already appeared?" logic, as well as the queueing+retry logic, and would replace the current kafka mirror component.
How does that sound?
In T2003#41429, @olasd wrote:Key metrics for the filter component:
- kafka consumer offset
- min(latest_attempt) where in_flight = true (time it takes for a message from submission in the buffer to (re-)processing by the filter; should stay close to the current time)
- count(*) where given_up = false group by topic (number of objects pending a retry, should be small)
- count(*) where in_flight = true group by topic (number of objects buffered for reprocessing, should be small)
- max(latest_attempt) (last processing time by the requeuing process)
- count(*) where given_up = true (checks whether the housekeeping process)
Note: haven't read the other comment below, just reacting at this one as I am reading it.
As for implementing the queue / retry behavior in the filter component:
So, now that T1914 is stuck, I'm giving this a harder think, and I'm wondering whether we shouldn't have a generic buffering/filtering component in the journal instead:
Probably doesn't anymore since we moved to confluent-kafka
I don't think I've actually seen this specific symptom in prod again, and if so only on workers that were hung up on something else already. We can reopen it if we notice it again.
Clients I run myself are no longer affected, I'm guessing it's thanks to one of the diffs linked from this task. But AFAIK, olasd still sees some consumers with this issue.
Since T1914 is high priority, this one is too.
What is the status of this issue? Do we still face this bug?
The backend exception is:
cimpl.KafkaException: KafkaError{code=MSG_SIZE_TOO_LARGE,val=10,str="Unable to produce message: Broker: Message size too large"}.
I've launched 16 content backfillers in parallel for each hex digit prefix which should help with this.
Probably superseded by T2128
I've installed gdb, python3.7-dbg, the debug symbols for librdkafka as well as for libssl1.1 on uffizi.
Looks like attaching a stuck process with remote_pdb doesn't work the first time, then unsticks it when launching the second time. Oops.
I've bodged https://github.com/ionelmc/python-remote-pdb into the process, which will help understand what's up.
So I'm experiencing some of this on uffizi, where I'm running the s3 content copier as a group of 64 coordinated journal clients.
Nope, still not completely fixed
I think both issues have been solved separately.
closed by 1144f7dd1552
The production issue with the save code now feature is another one: the git loaders are all stuck connected to bitbucket.org:443, waiting for it to send them data.
👍 to both your messages
I agree with @douardda that the "failed content" queue + separate processor approach would be the most sensible.
My guts on this task tell me that what we need (what we really really need) is a 3rd solution:
We do have the infra for journal clients now.
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/176/ for more details.
Update help message to add deprecated notice (as reasonably proposed)
Maybe modify help messages the way vlorentz did in D1674
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/174/ for more details.
1 month is good enough. Let's stick to this.
With 16 processes in parallel still, adding more CPUs gives an ETA of ~1 month, which stays pretty bad.
Running the directory backfiller (single instance) against belvedere yields an ETA of 250 days, which is around a 3x speedup from somerset.
have we now any insight on the behavior of the backfiller against belvedere?
I'm enclined to prefer option 2, since performance is an issue we cannot underestimate...
Closing because we removed the journal publisher. (My first WONTFIX <3)