In D3122#75850, @vlorentz wrote:

why was get_journal_client removed from swh.journal.cli?

All the code you're adding in swh/search/cli.py should be in swh-journal, so it can be used by other CLIs using a journal client

May 5 2020, 2:00 PM · Journal, Archive search

vlorentz requested changes to D3122: (fix ci) search.cli: Fix journal client instantiation and add config checks.

why was get_journal_client removed from swh.journal.cli?

May 5 2020, 1:14 PM · Journal, Archive search

ardumont updated the summary of D3122: (fix ci) search.cli: Fix journal client instantiation and add config checks.

May 5 2020, 10:34 AM · Journal, Archive search

ardumont retitled D3122: (fix ci) search.cli: Fix journal client instantiation and add config checks from search.cli: Fix journal client instantiation and add config checks to (fix ci) search.cli: Fix journal client instantiation and add config checks.

May 5 2020, 10:28 AM · Journal, Archive search

ardumont updated the summary of D3122: (fix ci) search.cli: Fix journal client instantiation and add config checks.

May 5 2020, 10:27 AM · Journal, Archive search

May 4 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3119: Add skipped_content to the list of accepted objects.

May 4 2020, 5:37 PM · Object storage, Storage manager, Journal

Apr 30 2020

douardda closed T2355: Make swh-journal independent from swh-storage or swh-objstorage as Resolved.

Let's consider this is done now.

Apr 30 2020, 4:09 PM · Object storage, Storage manager, Journal

Apr 29 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3087: Remove the content replayer code.

Apr 29 2020, 1:47 PM · Object storage, Storage manager, Journal

Apr 28 2020

olasd closed T2350: Support large messages in swh.journal / kafka, a subtask of T2348: swh.journal silently loses large objects instead of rejecting them, as Resolved.

Apr 28 2020, 11:28 AM · Mirror, Journal

olasd closed T2350: Support large messages in swh.journal / kafka as Resolved.

We've bumped the max message size to 100 MB in all producers.

Apr 28 2020, 11:28 AM · Journal

olasd closed T2350: Support large messages in swh.journal / kafka, a subtask of T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL, as Resolved.

Apr 28 2020, 11:28 AM · Journal

olasd closed T2348: swh.journal silently loses large objects instead of rejecting them as Resolved.

The kafka producer in swh.journal now reads message receipts and fails if they're negative, or if they didn't arrive within two minutes.

Apr 28 2020, 11:27 AM · Mirror, Journal

olasd closed T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL as Resolved.

snapshots, releases, revisions and directories have now been completely backfilled, and no objects of these types are (known to be) missing from the kafka cluster on azure.

Apr 28 2020, 11:24 AM · Journal

olasd closed T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL, a subtask of T2348: swh.journal silently loses large objects instead of rejecting them, as Resolved.

Apr 28 2020, 11:24 AM · Mirror, Journal

Apr 24 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3062: Move the content of swh/objstorage/__init__.py in swh/objstorage/factory.py.

Apr 24 2020, 3:54 PM · Object storage, Storage manager, Journal

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3056: Deprecate the `config-path` argument of the `swh storage rpc-serve` command.

Apr 24 2020, 11:29 AM · Object storage, Storage manager, Journal

Apr 23 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3058: Adapt journal client loading to swh.journal 0.0.31.

Apr 23 2020, 4:58 PM · Object storage, Storage manager, Journal

Apr 22 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3044: Move get_journal_client function to swh.journal.client.

Apr 22 2020, 4:50 PM · Object storage, Storage manager, Journal

ardumont renamed T2355: Make swh-journal independent from swh-storage or swh-objstorage from Make swh-journal independant from swh-storage or swh-objstorage to Make swh-journal independent from swh-storage or swh-objstorage.

Apr 22 2020, 3:50 PM · Object storage, Storage manager, Journal

douardda renamed T2355: Make swh-journal independent from swh-storage or swh-objstorage from Merge parts of swh-journal in swh-storage to Make swh-journal independant from swh-storage or swh-objstorage.

Apr 22 2020, 3:41 PM · Object storage, Storage manager, Journal

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3043: Extract kafka-related pytest fixtures in a pytest plugin module.

Apr 22 2020, 3:38 PM · Object storage, Storage manager, Journal

Apr 17 2020

olasd added a comment to T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL.

Backfilled objects:

snapshot
release

Apr 17 2020, 3:59 PM · Journal

Apr 15 2020

olasd changed the status of T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL, a subtask of T2348: swh.journal silently loses large objects instead of rejecting them, from Open to Work in Progress.

Apr 15 2020, 10:27 AM · Mirror, Journal

olasd changed the status of T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL from Open to Work in Progress.

I've pulled the list of objects from kafka using @seirl's graph export. I'm now looking to make the diff between postgres and that list of objects.

Apr 15 2020, 10:27 AM · Journal

olasd closed T2349: Make the journal writer reliable, a subtask of T2348: swh.journal silently loses large objects instead of rejecting them, as Resolved.

Apr 15 2020, 10:15 AM · Mirror, Journal

olasd closed T2349: Make the journal writer reliable as Resolved.

rDJNL7ff372a02de4 has now been deployed to production

Apr 15 2020, 10:15 AM · Journal

Apr 14 2020

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3010: Copy the graph replayer component from swh-journal.

Apr 14 2020, 11:14 AM · Object storage, Storage manager, Journal

douardda added a revision to T2355: Make swh-journal independent from swh-storage or swh-objstorage: D3008: Copy the backfiller component from swh-journal.

Apr 14 2020, 11:14 AM · Object storage, Storage manager, Journal

olasd added a revision to T2349: Make the journal writer reliable: D2994: Add delivery notification handling to swh.journal.writer.kafka.

Apr 14 2020, 11:06 AM · Journal

olasd removed a revision from T2349: Make the journal writer reliable: D2994: Add delivery notification handling to swh.journal.writer.kafka.

Apr 14 2020, 11:00 AM · Journal

olasd added a revision to T2349: Make the journal writer reliable: D2994: Add delivery notification handling to swh.journal.writer.kafka.

Apr 14 2020, 10:45 AM · Journal

Apr 9 2020

ardumont updated the task description for T2355: Make swh-journal independent from swh-storage or swh-objstorage.

Apr 9 2020, 4:35 PM · Object storage, Storage manager, Journal

Apr 6 2020

olasd added a subtask for T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL: T2350: Support large messages in swh.journal / kafka.

Apr 6 2020, 10:35 PM · Journal

olasd added a parent task for T2350: Support large messages in swh.journal / kafka: T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL.

Apr 6 2020, 10:35 PM · Journal

olasd triaged T2351: Consider backfilling mistakenly rejected large objects from PostgreSQL as Normal priority.

Apr 6 2020, 10:35 PM · Journal

olasd triaged T2350: Support large messages in swh.journal / kafka as High priority.

Apr 6 2020, 10:31 PM · Journal

olasd triaged T2349: Make the journal writer reliable as High priority.

Apr 6 2020, 10:25 PM · Journal

olasd triaged T2348: swh.journal silently loses large objects instead of rejecting them as High priority.

Apr 6 2020, 10:22 PM · Mirror, Journal

Mar 18 2020

ardumont closed D2839: journal.replay: Inline `_fix_origin_visit` for loop in insert_object.

Mar 18 2020, 7:56 PM · Journal

vlorentz accepted D2839: journal.replay: Inline `_fix_origin_visit` for loop in insert_object.

Mar 18 2020, 7:44 PM · Journal

swh-public-ci added a comment to D2839: journal.replay: Inline `_fix_origin_visit` for loop in insert_object.

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/408/ for more details.

Mar 18 2020, 4:44 PM · Journal

ardumont retitled D2839: journal.replay: Inline `_fix_origin_visit` for loop in insert_object from journal.replay: Make _fix_origin_visit raise earlier to journal.replay: Inline `_fix_origin_visit` for loop in insert_object.

Mar 18 2020, 4:40 PM · Journal

Mar 17 2020

ardumont closed D2838: journal.replay: Align fix revision behavior to other fix methods.

Mar 17 2020, 2:25 PM · Journal

swh-public-ci added a comment to D2838: journal.replay: Align fix revision behavior to other fix methods.

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/404/ for more details.

Mar 17 2020, 2:24 PM · Journal

ardumont updated the diff for D2838: journal.replay: Align fix revision behavior to other fix methods.

Rebase on latest master

Mar 17 2020, 2:20 PM · Journal

vlorentz accepted D2838: journal.replay: Align fix revision behavior to other fix methods.

<ardumont> val: your D2838 comment, sure
<ardumont> but i'd rather do it in another diff if you don't mind

Mar 17 2020, 2:01 PM · Journal

vlorentz requested changes to D2838: journal.replay: Align fix revision behavior to other fix methods.

Could you also change the _fix_contents I just added?

Mar 17 2020, 1:58 PM · Journal

swh-public-ci added a comment to D2838: journal.replay: Align fix revision behavior to other fix methods.

Build is green
See https://jenkins.softwareheritage.org/job/DJNL/job/tox/402/ for more details.

Mar 17 2020, 12:26 PM · Journal

ardumont updated the summary of D2838: journal.replay: Align fix revision behavior to other fix methods.

Mar 17 2020, 12:14 PM · Journal

Jan 29 2020

vlorentz added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

I think it would be ok to write in the journal after adding to the objstorage.

Jan 29 2020, 2:11 PM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41459, @vlorentz wrote:

In T2003#41457, @douardda wrote:

One question could be 'what is the definitive source of truth in our stack?'

I assumed we wanted to aim for Kafka to be the source of truth

Jan 29 2020, 2:00 PM · Journal

vlorentz added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41457, @douardda wrote:

One question could be 'what is the definitive source of truth in our stack?'

Jan 29 2020, 12:27 PM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41456, @olasd wrote:

Now that I think of it, we can decompose this in stages in the storage pipeline:

add an input validating proxy high up the stack

replace the journal writer calls sprinkled in all methods with a journal writing proxy

add a "don't insert objects" filter low down the stack

so we'd end up with the following pipeline for workers:

input validation proxy

object bundling proxy

object deduplication against read-only proxy

journal writer proxy

addition-blocking filter

underlying read-only storage

and the following pipeline for the "main storage replayer":

underlying read-write storage

(it's a very short pipeline... a pipedash?)

Jan 29 2020, 11:45 AM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41443, @vlorentz wrote:

We already discussed this at the time we replaced the journal-publisher with journal-writer. Adding to Kafka after inserting to the DB means that Kafka will be missing some messages, and we would need to run a backfiller on a regular basis to fix it.

Jan 29 2020, 11:40 AM · Journal

Jan 28 2020

olasd added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

Now that I think of it, we can decompose this in stages in the storage pipeline:

Jan 28 2020, 3:38 PM · Journal

olasd added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41443, @vlorentz wrote:

@olasd I'm worried that implementing your idea would result in some complex piece of code.

Jan 28 2020, 3:30 PM · Journal

vlorentz added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

@olasd I'm worried that implementing your idea would result in some complex piece of code. It also adds a new postgresql database and new kafka topics, that will need extra resources and management. And if at some point that queue database becomes too large, the retrier will become slower, causing the queue to grow even more.

Jan 28 2020, 11:39 AM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41428, @olasd wrote:

This component would centralize the "has this object already appeared?" logic, as well as the queueing+retry logic, and would replace the current kafka mirror component.

How does that sound?

Jan 28 2020, 9:37 AM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

In T2003#41429, @olasd wrote:

Key metrics for the filter component:

kafka consumer offset

min(latest_attempt) where in_flight = true (time it takes for a message from submission in the buffer to (re-)processing by the filter; should stay close to the current time)

count(*) where given_up = false group by topic (number of objects pending a retry, should be small)

count(*) where in_flight = true group by topic (number of objects buffered for reprocessing, should be small)

max(latest_attempt) (last processing time by the requeuing process)

count(*) where given_up = true (checks whether the housekeeping process)

Jan 28 2020, 9:30 AM · Journal

douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

Note: haven't read the other comment below, just reacting at this one as I am reading it.

Jan 28 2020, 9:28 AM · Journal

Jan 27 2020

olasd added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

As for implementing the queue / retry behavior in the filter component:

Jan 27 2020, 6:46 PM · Journal

olasd added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

So, now that T1914 is stuck, I'm giving this a harder think, and I'm wondering whether we shouldn't have a generic buffering/filtering component in the journal instead:

Jan 27 2020, 6:09 PM · Journal

vlorentz closed T1575: Kafka clients crash easily in docker-compose environment as Resolved.

Probably doesn't anymore since we moved to confluent-kafka

Jan 27 2020, 4:34 PM · Journal, Docker environment

olasd closed T2034: Unbreak journal clients as Resolved.

I don't think I've actually seen this specific symptom in prod again, and if so only on workers that were hung up on something else already. We can reopen it if we notice it again.

Jan 27 2020, 1:10 PM · Journal

Jan 24 2020

vlorentz added a comment to T2034: Unbreak journal clients.

Clients I run myself are no longer affected, I'm guessing it's thanks to one of the diffs linked from this task. But AFAIK, olasd still sees some consumers with this issue.

Jan 24 2020, 11:30 AM · Journal

Jan 23 2020

douardda raised the priority of T2003: Content replayer may try to copy objects before they are available from an objstorage from Normal to High.

Since T1914 is high priority, this one is too.

Jan 23 2020, 1:53 PM · Journal

douardda added a comment to T2034: Unbreak journal clients.

What is the status of this issue? Do we still face this bug?

Jan 23 2020, 11:20 AM · Journal

Jan 14 2020

vlorentz removed a parent task for T2034: Unbreak journal clients: T2033: Run Cassandra storage backend with production data.

Jan 14 2020, 3:10 PM · Journal

Dec 16 2019

vlorentz placed T2034: Unbreak journal clients up for grabs.

Dec 16 2019, 4:34 PM · Journal

Dec 11 2019

olasd added a comment to T2143: Journal error in production.

The backend exception is:
cimpl.KafkaException: KafkaError{code=MSG_SIZE_TOO_LARGE,val=10,str="Unable to produce message: Broker: Message size too large"}.

Dec 11 2019, 2:39 PM · Journal

anlambert triaged T2143: Journal error in production as Normal priority.

Dec 11 2019, 11:49 AM · Journal

Dec 7 2019

olasd added a parent task for T2003: Content replayer may try to copy objects before they are available from an objstorage: T1914: Keep mirror of contents on S3 up to date.

Dec 7 2019, 6:35 PM · Journal

olasd closed T1827: Tweak content backfill order to help content replayer as Resolved.

I've launched 16 content backfillers in parallel for each hex digit prefix which should help with this.

Dec 7 2019, 6:33 PM · Mirror, Journal

olasd added a comment to T1278: swh-journal: the monitoring tool question!.

Probably superseded by T2128

Dec 7 2019, 6:29 PM · Journal

Nov 25 2019

olasd added a comment to T2034: Unbreak journal clients.

I've installed gdb, python3.7-dbg, the debug symbols for librdkafka as well as for libssl1.1 on uffizi.

Nov 25 2019, 6:25 PM · Journal

Nov 16 2019

olasd added a comment to T2034: Unbreak journal clients.

Looks like attaching a stuck process with remote_pdb doesn't work the first time, then unsticks it when launching the second time. Oops.

Nov 16 2019, 4:30 PM · Journal

Nov 15 2019

olasd added a comment to T2034: Unbreak journal clients.

I've bodged https://github.com/ionelmc/python-remote-pdb into the process, which will help understand what's up.

Nov 15 2019, 5:59 PM · Journal

olasd added a comment to T2034: Unbreak journal clients.

So I'm experiencing some of this on uffizi, where I'm running the s3 content copier as a group of 64 coordinated journal clients.

Nov 15 2019, 3:36 PM · Journal

Nov 14 2019

vlorentz reopened T2034: Unbreak journal clients as "Open".

Nope, still not completely fixed

Nov 14 2019, 3:41 PM · Journal

Nov 8 2019

vlorentz triaged T2074: Publish extrinsic metadata to swh-journal/Kafka as Normal priority.

Nov 8 2019, 5:18 PM · Storage manager, Journal, Metadata workflow

Nov 4 2019

vlorentz closed T2034: Unbreak journal clients as Resolved.

Seems to be fixed by D2182/D2199/D2185/D2186/D2205

Nov 4 2019, 12:22 PM · Journal

Oct 10 2019

vlorentz edited projects for T2034: Unbreak journal clients, added: Journal; removed Storage manager.

Oct 10 2019, 12:07 PM · Journal

Sep 30 2019

olasd closed T2007: Loading tasks are currently hanging as Resolved.

I think both issues have been solved separately.

Sep 30 2019, 12:28 PM · Journal

Sep 18 2019

douardda closed T2006: backfiller is broken for revision as Resolved.

closed by 1144f7dd1552

Sep 18 2019, 5:58 PM · Journal

olasd added a comment to T2007: Loading tasks are currently hanging.

The production issue with the save code now feature is another one: the git loaders are all stuck connected to bitbucket.org:443, waiting for it to send them data.

Sep 18 2019, 3:03 PM · Journal

anlambert triaged T2007: Loading tasks are currently hanging as High priority.

Sep 18 2019, 2:42 PM · Journal

douardda triaged T2006: backfiller is broken for revision as High priority.

Sep 18 2019, 2:31 PM · Journal