Build is green
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 13 2020
May 5 2020
Adapt according to review
In D3122#75850, @vlorentz wrote:why was get_journal_client removed from swh.journal.cli?
All the code you're adding in swh/search/cli.py should be in swh-journal, so it can be used by other CLIs using a journal client
why was get_journal_client removed from swh.journal.cli?
May 4 2020
Apr 30 2020
Let's consider this is done now.
Apr 29 2020
Apr 28 2020
We've bumped the max message size to 100 MB in all producers.
The kafka producer in swh.journal now reads message receipts and fails if they're negative, or if they didn't arrive within two minutes.
snapshots, releases, revisions and directories have now been completely backfilled, and no objects of these types are (known to be) missing from the kafka cluster on azure.
Apr 24 2020
Apr 23 2020
Apr 22 2020
Apr 17 2020
Backfilled objects:
- snapshot
- release
Apr 15 2020
I've pulled the list of objects from kafka using @seirl's graph export. I'm now looking to make the diff between postgres and that list of objects.
rDJNL7ff372a02de4 has now been deployed to production
Apr 14 2020
Apr 9 2020
Apr 6 2020
Mar 18 2020
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/408/ for more details.
Mar 17 2020
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/404/ for more details.
Rebase on latest master
<ardumont> val: your D2838 comment, sure <ardumont> but i'd rather do it in another diff if you don't mind
Could you also change the _fix_contents I just added?
Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/402/ for more details.
Jan 29 2020
I think it would be ok to write in the journal after adding to the objstorage.
In T2003#41459, @vlorentz wrote:In T2003#41457, @douardda wrote:One question could be 'what is the definitive source of truth in our stack?'
I assumed we wanted to aim for Kafka to be the source of truth
In T2003#41457, @douardda wrote:One question could be 'what is the definitive source of truth in our stack?'
In T2003#41456, @olasd wrote:Now that I think of it, we can decompose this in stages in the storage pipeline:
- add an input validating proxy high up the stack
- replace the journal writer calls sprinkled in all methods with a journal writing proxy
- add a "don't insert objects" filter low down the stack
so we'd end up with the following pipeline for workers:
- input validation proxy
- object bundling proxy
- object deduplication against read-only proxy
- journal writer proxy
- addition-blocking filter
- underlying read-only storage
and the following pipeline for the "main storage replayer":
- underlying read-write storage
(it's a very short pipeline... a pipedash?)
In T2003#41443, @vlorentz wrote:We already discussed this at the time we replaced the journal-publisher with journal-writer. Adding to Kafka after inserting to the DB means that Kafka will be missing some messages, and we would need to run a backfiller on a regular basis to fix it.
Jan 28 2020
Now that I think of it, we can decompose this in stages in the storage pipeline:
In T2003#41443, @vlorentz wrote:@olasd I'm worried that implementing your idea would result in some complex piece of code.
@olasd I'm worried that implementing your idea would result in some complex piece of code. It also adds a new postgresql database and new kafka topics, that will need extra resources and management. And if at some point that queue database becomes too large, the retrier will become slower, causing the queue to grow even more.
In T2003#41428, @olasd wrote:This component would centralize the "has this object already appeared?" logic, as well as the queueing+retry logic, and would replace the current kafka mirror component.
How does that sound?
In T2003#41429, @olasd wrote:Key metrics for the filter component:
- kafka consumer offset
- min(latest_attempt) where in_flight = true (time it takes for a message from submission in the buffer to (re-)processing by the filter; should stay close to the current time)
- count(*) where given_up = false group by topic (number of objects pending a retry, should be small)
- count(*) where in_flight = true group by topic (number of objects buffered for reprocessing, should be small)
- max(latest_attempt) (last processing time by the requeuing process)
- count(*) where given_up = true (checks whether the housekeeping process)
Note: haven't read the other comment below, just reacting at this one as I am reading it.
Jan 27 2020
As for implementing the queue / retry behavior in the filter component:
So, now that T1914 is stuck, I'm giving this a harder think, and I'm wondering whether we shouldn't have a generic buffering/filtering component in the journal instead:
Probably doesn't anymore since we moved to confluent-kafka
I don't think I've actually seen this specific symptom in prod again, and if so only on workers that were hung up on something else already. We can reopen it if we notice it again.
Jan 24 2020
Clients I run myself are no longer affected, I'm guessing it's thanks to one of the diffs linked from this task. But AFAIK, olasd still sees some consumers with this issue.
Jan 23 2020
Since T1914 is high priority, this one is too.
What is the status of this issue? Do we still face this bug?
Jan 14 2020
Dec 16 2019
Dec 11 2019
The backend exception is:
cimpl.KafkaException: KafkaError{code=MSG_SIZE_TOO_LARGE,val=10,str="Unable to produce message: Broker: Message size too large"}.
Dec 7 2019
I've launched 16 content backfillers in parallel for each hex digit prefix which should help with this.
Probably superseded by T2128
Nov 25 2019
I've installed gdb, python3.7-dbg, the debug symbols for librdkafka as well as for libssl1.1 on uffizi.
Nov 16 2019
Looks like attaching a stuck process with remote_pdb doesn't work the first time, then unsticks it when launching the second time. Oops.
Nov 15 2019
I've bodged https://github.com/ionelmc/python-remote-pdb into the process, which will help understand what's up.
So I'm experiencing some of this on uffizi, where I'm running the s3 content copier as a group of 64 coordinated journal clients.
Nov 14 2019
Nope, still not completely fixed
Nov 8 2019
Nov 4 2019
Oct 10 2019
Sep 30 2019
I think both issues have been solved separately.
Sep 18 2019
closed by 1144f7dd1552
The production issue with the save code now feature is another one: the git loaders are all stuck connected to bitbucket.org:443, waiting for it to send them data.
Sep 17 2019
👍 to both your messages
I agree with @douardda that the "failed content" queue + separate processor approach would be the most sensible.
My guts on this task tell me that what we need (what we really really need) is a 3rd solution:
Sep 3 2019
We do have the infra for journal clients now.