Enable the journal-writer for the swh-idx-storage in production
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Nov 16 2020, 1:31 PM

Description

the config is like for swh-storage, so something like this:

journal_writer:
  cls: kafka
  args:
    brokers: "%{alias('swh::deploy::journal::brokers')}"
    prefix: "%{alias('swh::deploy::journal::prefix')}"
    client_id: "swh.indexer.storage.journal_writer.%{::swh_hostname.short}"
    producer_config:
      message.max.bytes: 1000000000

It's unclear what the prefix should be. swh.storage uses swh.journal.objects, we can either use that one too, or a new one, eg. swh.journal.indexed

Revisions and Commits

rSPRE sysadm-provisioning
	D5055	rSPRE764ab396a2ab staging: Dedicate an indexer worker
rSPSITE puppet-swh-site
	D5056	rSPSITEf118b3e4d891 staging: Dedicate an indexer worker
	D5054	rSPSITEa036212264e1 Enable the journal-writer for the swh-idx-storage in production
	D5053	rSPSITEc0517451ad84 staging: Activate swh-search-journal-client@indexed

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1523 Search tools on metadata
Migrated	gitlab-migration	T1117 Origin search is slow when you look for very common words
Migrated	gitlab-migration	T1910 Redesign origin search using a dedicated component (swh-search)
Migrated	gitlab-migration	T2052 Publish swh-search on PyPI
Migrated	gitlab-migration	T2167 Deploy swh-search
Migrated	gitlab-migration	T2174 Add debian package for swh-search
Migrated	gitlab-migration	T2182 Switch production swh-web to use swh-search instead of postgresql search.
Migrated	gitlab-migration	T2590 Finish the indexer -> swh-search pipeline
Migrated	gitlab-migration	T3037 Reschedule origin-intrinsic-metadata tasks for all origins
Migrated	gitlab-migration	T2780 Enable the journal-writer for the swh-idx-storage in production

Event Timeline

vlorentz triaged this task as Normal priority.Nov 16 2020, 1:31 PM

vlorentz created this task.

vsellier mentioned this in T2816: Enable the journal-writer for the swh-idx-storage in staging.Nov 26 2020, 5:40 PM

Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?

Where is the list of topics that need to be created?

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

vsellier added a subscriber: vsellier.Nov 26 2020, 6:07 PM

vsellier mentioned this in T2590: Finish the indexer -> swh-search pipeline.Nov 27 2020, 10:20 AM

vsellier mentioned this in D4620: staging: configure idx-storage to write to kafka.Nov 27 2020, 10:43 AM

It's unclear what the prefix should be. swh.storage uses swh.journal.objects, we can either use that one too, or a new one, eg. swh.journal.indexed

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

so, heads up, the topic prefix swh.journal.indexed has been elected and declared in the current staging diff D4620

Where is the list of topics that need to be created?

I'd say in swh.indexer.storage.__init__.py:

./__init__.py:        self.journal_writer.write_additions("content_mimetype", mimetypes)
./__init__.py:        self.journal_writer.write_additions("content_language", languages)
./__init__.py:        self.journal_writer.write_additions("content_ctags", ctags)
./__init__.py:        self.journal_writer.write_additions("content_fossology_license", licenses)
./__init__.py:        self.journal_writer.write_additions("content_metadata", metadata)
./__init__.py:        self.journal_writer.write_additions("revision_intrinsic_metadata", metadata)
./__init__.py:        self.journal_writer.write_additions("origin_intrinsic_metadata", metadata)

In T2780#53415, @olasd wrote:

Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?

I'd say transient, as we can always recompute it. But this means backfilling the journal every time we add a new client that needs to get all the messages, so I don't know.

Where is the list of topics that need to be created?

Answered by @ardumont

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

Agreed

I propose meeting in the middle and having the following policies:

content topics: transient, bound by volume
revision / origin topics: persistent

I expect the content topics to be the most "volatile" and heavy, and the revision / origin topics to be the most useful to keep in the long term for third party clients.

Does that make sense?

vsellier mentioned this in rSPSITEa2a84c2efb3e: staging: configure idx-storage to write to kafka.Nov 27 2020, 3:47 PM

Is there some remaining blocker on this?
(If not i'll attend to it next week)

I just mention some on T2912#58067 but it's unclear whether that's actually true of me misremembering things.

ardumont moved this task from Backlog to Weekly backlog on the System administration board.Feb 5 2021, 7:26 PM

vlorentz assigned this task to ardumont.Feb 9 2021, 3:30 PM

I just mention some on T2912#58067 but it's unclear whether that's actually true of me misremembering things.

I was not misremembering but T2876 got fixed in between.

Some preparatory work needs to be tested on staging first following T2876.
I'm attending to that.

ardumont added a revision: D5053: staging: Activate swh-search-journal-client@indexed.Feb 9 2021, 5:59 PM

ardumont added a commit: rSPSITEc0517451ad84: staging: Activate swh-search-journal-client@indexed.Feb 9 2021, 6:45 PM

tl; dr deployed on staging and it seems ok.

Point of attention on the index side though (it's growing quite large fast and we are only on staging).

(Details below)

Some preparatory work needs to be tested on staging first following T2876.

Namely checking that it's actually ok and for this.

Actually updating our manifest D5053.

Stop the swh-search-journal-client@objects so it stops writing to the index.

systemctl stop swh-search-journal-client@objects

Backup the current staging index (as a snapshot, just in case the activation of the new service mess things up):

root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/backup.json
{
  "took" : 102654,
  "timed_out" : false,
  "total" : 496619,
  "updated" : 0,
  "created" : 496619,
  "deleted" : 0,
  "batches" : 497,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

Backup the offsets as well just in case (P944). Puppet will start back swh-search-journal-client@objects...
(so eventually if something goes wrong, we'll reset those alongside the snapshot index to install back).

Now, after landing and deploying the diff ^, apply and check everything runs fine:

Snaphot status detail on current indices:

ardumont@search0:~% curl http://search-esnode0.internal.staging.swh.network:9200/_cat/indices\?v
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green open  origin                      xBl67YKsQbWAt7V78UeDLA 80 0 496619 5145 348.7mb 348.7mb
green open  origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 496619    0 156.6mb 156.6mb

After deployment, everything is going fine.

*BUT* the index is growing quite large and fast...

health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin                      xBl67YKsQbWAt7V78UeDLA  80   0     622296        54024        1gb            1gb

Note: consumer group's lag is subsiding (expectedly):

root@journal0:~# /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --describe --group swh.search.journal_client.indexed --all-topics

GROUP                             TOPIC                                         PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
swh.search.journal_client.indexed swh.journal.indexed.origin_intrinsic_metadata 0          10882458        13557845        2675387         rdkafka-7c45245c-814f-47f1-ba67-041e4f426373 /192.168.130.90 rdkafka
...
swh.search.journal_client.indexed swh.journal.indexed.origin_intrinsic_metadata 0          10979458        13558110        2578652         rdkafka-7c45245c-814f-47f1-ba67-041e4f426373 /192.168.130.90 rdkafka
...
swh.search.journal_client.indexed swh.journal.indexed.origin_intrinsic_metadata 0          11274458        13558652        2284194         rdkafka-7c45245c-814f-47f1-ba67-041e4f426373 /192.168.130.90 rdkafka

Note: Regarding the partition (only 1 here), we'll need to create first-hand the consumer group to have a better partition configuration for the production.

Grafana's ETA estimation [1] is ~1h

[1] https://grafana.softwareheritage.org/goto/I0JcyVPGk

swh-search-journal-client@indexed kept up with its topic:

swh.search.journal_client.indexed swh.journal.indexed.origin_intrinsic_metadata 0          13653216        13653216        0               rdkafka-7c45245c-814f-47f1-ba67-041e4f426373 /192.168.130.90 rdkafka

And the index size stabilized at 1Gb (out of an inital 156mb).

health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin                      xBl67YKsQbWAt7V78UeDLA  80   0     786803        85285        1gb            1gb
green  open   origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg  80   0     496619            0    156.6mb        156.6mb

Note that the "docs.count" grew though (from 496619 to 786803) and the reason are
unclear.

The same index is used to store the metadata out of the indexer with the same origin url
as key [1] and we are computing index metadata on origins already seen (thus already present
in the index afaiui). So I would have expect the docs.count stay roughly (or even
exactly?) the same as before?

[1] well the sha1 of the origin computed by search but still

Note that the "docs.count" grew though (from 496619 to 786803) and the reason are
unclear.

The same index is used to store the metadata out of the indexer with the same origin url
as key [1] and we are computing index metadata on origins already seen (thus already present
in the index afaiui). So I would have expect the docs.count stay roughly (or even
exactly?) the same as before?

Well, red herring apparently ¯\_(ツ)_/¯:

ardumont@search0:~% curl -s http://$ES_NODE/origin/_count\?pretty | jq .count
499164
ardumont@search0:~% curl -s http://$ES_NODE/origin-backup-20210209-1736/_count\?pretty | jq .count
496619

The size order is actually roughly the same! And not exactly because we are indexing new
origins along the way as well.

(Thus the ~3k delta between the indices)

I did too much here. Finish the pipeline swh-indexer -> swh-search on staging (so that's good nonetheless)

The point of that task was only about making the indexer storage write to its topics though.
So I'm going to do that now.

ardumont added a revision: D5054: Enable the journal-writer for the swh-idx-storage in production.Feb 10 2021, 11:41 AM

ardumont added a revision: D5055: staging: Dedicate an indexer worker.Feb 10 2021, 12:30 PM

ardumont added a revision: D5056: staging: Dedicate an indexer worker.Feb 10 2021, 12:34 PM

ardumont added a commit: rSPSITEa036212264e1: Enable the journal-writer for the swh-idx-storage in production.Feb 10 2021, 12:37 PM

ardumont added a commit: rSPSITEf118b3e4d891: staging: Dedicate an indexer worker.

We'll prepare the topics with the following first and we'll improve later if need be:

staging:

export SERVER=journal0.internal.staging.swh.network:9092
for topic in content_mimetype content_language content_ctags content_fossology_license content_metadata revision_intrinsic_metadata origin_intrinsic_metadata; do
  /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --create --config cleanup.policy=compact --partitions 64 --replication-factor 1 --topic "swh.journal.indexed.$topic"
done

Run:

prod:

export SERVER=kafka1.internal.softwareheritage.org:9092
for topic in content_mimetype content_language content_ctags content_fossology_license content_metadata revision_intrinsic_metadata origin_intrinsic_metadata; do
  /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --create --config cleanup.policy=compact --partitions 256 --replication-factor 2 --topic "swh.journal.indexed.$topic"
done

Run:

root@kafka1:~# for topic in content_mimetype content_language content_ctags content_fossology_license content_metadata revision_intrinsic_metadata origin_intrinsic_metadata; do
>   /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --create --config cleanup.policy=compact --partitions 256 --replication-factor 2 --topic "swh.journal.indexed.$topic"
> done
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.content_mimetype.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.content_language.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.content_ctags.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.content_fossology_license.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.content_metadata.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.revision_intrinsic_metadata.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic swh.journal.indexed.origin_intrinsic_metadata.

Only noticing now that we have only one indexer currently running in staging
(so only 1 topic is currently written there).

So some more indexer got deployed there to check the journal is holding up ok (it does [1]).

After lunch, on with the production.

[1] https://grafana.softwareheritage.org/goto/GDoGPNEMk

ardumont added a commit: rSPRE764ab396a2ab: staging: Dedicate an indexer worker.Feb 10 2021, 1:16 PM

ardumont moved this task from in-progress to deployed/landed/monitoring on the System administration board.Feb 10 2021, 3:35 PM

Deployed.

Indexer related topics status can be seen in in the indexer ingestion status board [1]

[1] https://grafana.softwareheritage.org/goto/QM8VqNPGk

vlorentz added a parent task: T3037: Reschedule origin-intrinsic-metadata tasks for all origins.Feb 10 2021, 5:14 PM

ardumont closed this task as Resolved.Feb 11 2021, 9:48 AM

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

vsellier mentioned this in T3060: Deploy swh-search v0.6.0 in **staging**.Feb 18 2021, 3:41 PM

This task has been migrated to GitLab.

Enable the journal-writer for the swh-idx-storage in productionClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Enable the journal-writer for the swh-idx-storage in production
Closed, MigratedEdits Locked
Actions

Related Objects
Search...