Page MenuHomeSoftware Heritage
Feed Advanced Search

Aug 25 2021

vsellier requested review of D6130: kafka: increase the open file limit.
Aug 25 2021, 10:25 AM
vsellier added a revision to T3501: Too many open files error on kafka: D6130: kafka: increase the open file limit.
Aug 25 2021, 10:25 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.

ok roger that :).
I will increase to 524288 in the diff

Aug 25 2021, 10:21 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.

all the loaders are restarted on worker01 and workers02, it seems the cluster is ok.

Aug 25 2021, 10:12 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.

The open file limit was manually increased to stabilize the cluster:

# puppet agent --disable T3501
# diff -U3 /tmp/kafka.service kafka.service
--- /tmp/kafka.service	2021-08-25 07:32:28.068928972 +0000
+++ kafka.service	2021-08-25 07:32:31.384955246 +0000
@@ -15,7 +15,7 @@
 Environment='LOG_DIR=/var/log/kafka'
 Type=simple
 ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
-LimitNOFILE=65536
+LimitNOFILE=131072
Aug 25 2021, 9:43 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.
  • Incident created on status.io
  • Loader disabled:
root@pergamon:~# clush -b -w @swh-workers 'puppet agent --disable "Kafka incident T3501"; systemctl stop cron; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@loader_*; do systemctl disable $unit; done; systemctl stop "swh-worker@loader_*"'
Aug 25 2021, 9:15 AM · Journal, System administration
vsellier changed the status of T3501: Too many open files error on kafka from Open to Work in Progress.
Aug 25 2021, 9:04 AM · Journal, System administration

Aug 24 2021

vsellier committed rSENVeb3a616b885b: vagrant: update debian image to debian 10.10 (authored by vsellier).
vagrant: update debian image to debian 10.10
Aug 24 2021, 5:54 PM
vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

Some live data from a git loader with a batch size of 1000 for each object types (with D6118 applied):

"object type";"input count";"missing_id duration (s)";"_missing_id count","_add duration(s)"
content;1000;0.4928;999;35.3384
content;1000;0.4095;1000;34.1440
content;1000;0.4374;998;35.6249
content;492;0.2960;488;16.7028
directory;1000;0.3978;999;71.2518
directory;1000;0.4484;1000;39.6845
directory;1000;0.4356;1000;54.0077
directory;1000;0.3833;1000;36.1437
directory;1000;0.4319;1000;30.5690
directory;402;0.1718;402;19.2335
revision;1000;0.8671;1000;10.3417
revision;575;0.4639;575;4.0819
Aug 24 2021, 3:18 PM · System administration, Storage manager
vsellier accepted D6118: cassandra: Make content_missing query in batches.

The performance are ok now for the read part with a batch size of 1000 for content, directory and revision.

Aug 24 2021, 3:09 PM
vsellier added a revision to T3493: [cassandra] Git loader performance are very bad: D6118: cassandra: Make content_missing query in batches.
Aug 24 2021, 3:06 PM · System administration, Storage manager
vsellier added a task to D6118: cassandra: Make content_missing query in batches: T3493: [cassandra] Git loader performance are very bad.
Aug 24 2021, 3:06 PM
vsellier closed D6127: backfill: add extra where clause to use the right index for extid requests.
Aug 24 2021, 2:57 PM
vsellier committed rDSTO7113198fd65e: backfill: add extra where clause to use the right index for extid requests (authored by vsellier).
backfill: add extra where clause to use the right index for extid requests
Aug 24 2021, 2:57 PM
vsellier changed the status of T3476: One of the system disks of beaubourg is out of order, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.
Aug 24 2021, 2:43 PM · System administration
vsellier changed the status of T3476: One of the system disks of beaubourg is out of order from Open to Work in Progress.

An alert was sent by email the 2021-05-22 at 05:30 AM so the monitoring has well detected the issue ;) :

This message was generated by the smartd daemon running on:
Aug 24 2021, 2:43 PM · System administration
vsellier closed T3499: Move firewall storage to local hypervisor storage as Resolved.
Aug 24 2021, 2:29 PM · System administration
vsellier closed T3499: Move firewall storage to local hypervisor storage, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.
Aug 24 2021, 2:29 PM · System administration
vsellier added a comment to T3499: Move firewall storage to local hypervisor storage.

on hypervisor3 and branly

  • A new lvm volume was created and mounted on /var/lib/vz (40G on hypervisor3 / 100G on branly)
  • local storage type was activated on proxmox via the ui (Datacenter / storage / local, check enable)
  • pushkin and glytotek disks moved via to ui to the local storage (<vm> / hardware click on the disk / move disk button / target storage 'local')
Aug 24 2021, 2:29 PM · System administration
vsellier triaged T3499: Move firewall storage to local hypervisor storage as High priority.
Aug 24 2021, 2:21 PM · System administration
vsellier requested review of D6127: backfill: add extra where clause to use the right index for extid requests.
Aug 24 2021, 2:02 PM
vsellier added a revision to T3485: extid topic is misconfigured in staging and production: D6127: backfill: add extra where clause to use the right index for extid requests.
Aug 24 2021, 1:55 PM · System administration
vsellier renamed T3493: [cassandra] Git loader performance are very bad from Git loader performance are very bad to [cassandra] Git loader performance are very bad.
Aug 24 2021, 12:07 PM · System administration, Storage manager
vsellier accepted D6124: agent_checks: Install check_systemd plugin and command.

LGTM (double checked with @olasd ;) )

Aug 24 2021, 10:58 AM

Aug 23 2021

vsellier accepted D6120: cassandra: Bump next_visit_id when origin_visit_add is called by a replayer.
Aug 23 2021, 2:50 PM
vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

It seems the problem is no longer present now (tested with several origins)

root@parasilo-19:~/swh-environment/docker# docker exec -ti docker_swh-loader_1 bash
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
Aug 23 2021, 2:50 PM · Storage manager
vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

The origin_visit topic was replayed with your diff during the weekend. let's test now if the worker behavior is more deterministic

Aug 23 2021, 11:42 AM · Storage manager

Aug 20 2021

vsellier committed rSENV1c4068100e99: packer/vagrant: upgrade debian buster to version 10.10 (authored by vsellier).
packer/vagrant: upgrade debian buster to version 10.10
Aug 20 2021, 6:59 PM
vsellier committed rSENV023b0c6ca879: packer: update base image url as 10.9.0 is not anymore the current one (authored by vsellier).
packer: update base image url as 10.9.0 is not anymore the current one
Aug 20 2021, 6:16 PM

Aug 19 2021

vsellier changed the status of T3465: Test multidatacenter replication, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, from Open to Work in Progress.
Aug 19 2021, 7:19 PM · System administration, Storage manager
vsellier changed the status of T3465: Test multidatacenter replication from Open to Work in Progress.
Aug 19 2021, 7:19 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

The gros cluster at Nancy[1] has a lot of nodes(124) with small reservable SSD of 960Go. This can be a good candidate to create the second cluster. It will also allow to check the performance with data (and commit logs) on SSDs.
According to the main cluster, a minimum of 8 nodes are necessary to handle the volume of data (7.3 To and growing). Starting with 10 nodes will allow to have some remaining space.

Aug 19 2021, 7:11 PM · System administration, Storage manager
vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

it seems some more precise information can be logged by activating the full query logs without a big performance impact: https://cassandra.apache.org/doc/latest/cassandra/new/fqllogging.html

Aug 19 2021, 6:52 PM · System administration, Storage manager
vsellier added a comment to T3491: Origin visit ids restart from 1 even if there is previous visits.

Should be fixed by T3482

Aug 19 2021, 4:34 PM · System administration, Storage manager
vsellier triaged T3493: [cassandra] Git loader performance are very bad as Normal priority.
Aug 19 2021, 4:32 PM · System administration, Storage manager
vsellier triaged T3491: Origin visit ids restart from 1 even if there is previous visits as Normal priority.
Aug 19 2021, 4:20 PM · System administration, Storage manager
vsellier updated the task description for T3487: Installation of the new provenance server.
Aug 19 2021, 12:29 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

In ~40h, the backfill is done at ~5% for staging and less than 1% for the production

Aug 19 2021, 10:08 AM · System administration

Aug 18 2021

vsellier committed rPTSc2e0c5766b2c: pristine-tar data for tree-sitter_0.19.0.orig.tar.gz (authored by vsellier).
pristine-tar data for tree-sitter_0.19.0.orig.tar.gz
Aug 18 2021, 4:20 PM
vsellier committed rPTS7702e942c810: initialize the backport build configuration (authored by vsellier).
initialize the backport build configuration
Aug 18 2021, 4:20 PM
vsellier committed rPTS08f03d9d6057: Initial packaging for python3-tree-sitter (authored by vsellier).
Initial packaging for python3-tree-sitter
Aug 18 2021, 4:20 PM
vsellier committed rPTS598b5ec8232c: New upstream version 0.19.0 (authored by vsellier).
New upstream version 0.19.0
Aug 18 2021, 4:19 PM
vsellier committed rCJSWH9a81390eda36: Declare debian package build for tree-sitter (authored by vsellier).
Declare debian package build for tree-sitter
Aug 18 2021, 4:11 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The back fill was relaunched using the script pasted in P1124

Aug 18 2021, 2:38 PM · System administration
vsellier created P1124 restart a backfill where it has stopped previously.
Aug 18 2021, 2:37 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The backfill process was interrupted by a restart of kafka on kafka1 (!).

2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 12 identical error(s) suppressed)"}
2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 5 identical error(s) suppressed)"}
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update
...
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) update
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush
    raise self.delivery_error("Failed deliveries after flush()")
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)])
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
...
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)])
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush
    "flush() exceeded timeout (%ss)" % self.flush_timeout,
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])

The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)

Aug 18 2021, 11:53 AM · System administration
vsellier committed rDSNIPf4c8abe97ccc: grid5000/cassandra: add a script to refresh the besteffort node list (authored by vsellier).
grid5000/cassandra: add a script to refresh the besteffort node list
Aug 18 2021, 10:30 AM
vsellier committed rDSNIPdb9574d46037: grid5000/cassadra: declare the best effort nodes only when they are fully… (authored by vsellier).
grid5000/cassadra: declare the best effort nodes only when they are fully…
Aug 18 2021, 10:30 AM
vsellier committed rDSNIP19515afeb074: grid5000/cassadra: count best_effort jobs in waiting/launching state (authored by vsellier).
grid5000/cassadra: count best_effort jobs in waiting/launching state
Aug 18 2021, 10:30 AM
vsellier updated subscribers of T3487: Installation of the new provenance server.

@jayeshv @aeviso @douardda @olasd have you an idea of what should be installed on the server and who will operate what will be on it?

Aug 18 2021, 9:50 AM · System administration
vsellier updated the task description for T3487: Installation of the new provenance server.
Aug 18 2021, 9:46 AM · System administration
vsellier changed the status of T3487: Installation of the new provenance server from Open to Work in Progress.
Aug 18 2021, 9:45 AM · System administration

Aug 17 2021

vsellier added a comment to T3484: Fix the release builds for swh-search.

One very important thing to get right is the Build-Depends line in the source package stanza. setuptools/distribute-based packages have the nasty habit of downloading dependencies from PyPI if they are needed at python setup.py build time. If the package is available from the system (as would be the case when Build-Depends > is up-to-date), then distribute will not try to download the package, otherwise it will try to download it. This is a huge no-no, and pybuild internally sets the http_proxy and https_proxy environment variables (to 127.0.0.1:9) to prevent this from happening.

Aug 17 2021, 6:13 PM · System administration, Archive search
vsellier added a comment to T3484: Fix the release builds for swh-search.

The pypi build is still working well with the 2 last diff.
Now there is a new error during the debian ones:

dh: warning: Compatibility levels before 10 are deprecated (level 9 in use)
   dh_auto_clean -O--buildsystem=pybuild
dh_auto_clean: warning: Compatibility levels before 10 are deprecated (level 9 in use)
I: pybuild base:232: python3.9 setup.py clean 
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101bcd0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101beb0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b850>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b730>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b610>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
ERROR: Could not find a version that satisfies the requirement tree-sitter==0.19.0
ERROR: No matching distribution found for tree-sitter==0.19.0
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/setuptools/installer.py", line 75, in fetch_build_egg
    subprocess.check_call(cmd)
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpdrbws3hq', '--quiet', 'tree-sitter==0.19.0']' returned non-zero exit status 1.
Aug 17 2021, 5:57 PM · System administration, Archive search
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

current status:

Aug 17 2021, 5:41 PM · System administration, Storage manager
vsellier committed rDSNIPa31433b334ad: grid5000/cassandra: replay extid topic (authored by vsellier).
grid5000/cassandra: replay extid topic
Aug 17 2021, 5:27 PM
vsellier committed rDSNIP35813e5a8fcc: grid5000/cassandra: adapt the number of replayers (authored by vsellier).
grid5000/cassandra: adapt the number of replayers
Aug 17 2021, 5:27 PM
vsellier moved T3485: extid topic is misconfigured in staging and production from Backlog to in-progress on the System administration board.
Aug 17 2021, 5:12 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

Production backfill in progress:

root@getty:~/T3485# ./backfill.sh | tee output.log
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f80001
Aug 17 2021, 4:41 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

Production

Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform

Aug 17 2021, 4:31 PM · System administration
vsellier edited P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 3:03 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The backfill is running in staging (launch with the P1121 and P1122 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):

swhstorage@storage1:~$ ./backfill.sh | tee output.log
swhstorage@storage1:~$ grep Starting output.log
Starting  swh storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh storage backfill  extid --start-object f80001
Aug 17 2021, 1:33 PM · System administration
vsellier edited P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 1:17 PM
vsellier edited P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 12:52 PM
vsellier edited P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 12:48 PM
vsellier added a comment to P1121 bakfill script for sha1 based ranges.

to use with P1122

Aug 17 2021, 12:34 PM
vsellier created P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 12:34 PM
vsellier created P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 12:33 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects.extid.
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep ReplicationFactor      
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600
Aug 17 2021, 11:01 AM · System administration
vsellier updated the task description for T3485: extid topic is misconfigured in staging and production.
Aug 17 2021, 11:00 AM · System administration
vsellier changed the status of T3485: extid topic is misconfigured in staging and production from Open to Work in Progress.
Aug 17 2021, 10:57 AM · System administration

Aug 16 2021

vsellier committed rDENV4155fc0087bc: cassandra: use the CASSANDRA_SEED env variable for database initialization (authored by vsellier).
cassandra: use the CASSANDRA_SEED env variable for database initialization
Aug 16 2021, 5:01 PM
vsellier closed D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:36 PM
vsellier committed rDENV1b9307ce7751: storage-cassandra: Remove the default src override (authored by vsellier).
storage-cassandra: Remove the default src override
Aug 16 2021, 4:36 PM
vsellier closed D6092: counters: Match the default configuration to the real production url.
Aug 16 2021, 4:35 PM
vsellier committed rDWAPPSddfb988db5ec: counters: Match the default configuration to the real production url (authored by vsellier).
counters: Match the default configuration to the real production url
Aug 16 2021, 4:35 PM
vsellier requested review of D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:27 PM
vsellier added a revision to T3357: Perform some tests of the cassandra storage on Grid5000: D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:27 PM · System administration, Storage manager
vsellier requested review of D6092: counters: Match the default configuration to the real production url.
Aug 16 2021, 3:58 PM
vsellier renamed T3484: Fix the release builds for swh-search from Fix the pypi-upload build for swh-search to Fix the release builds for swh-search.
Aug 16 2021, 2:54 PM · System administration, Archive search
vsellier accepted D6088: Use setup_requires to install tree-sitter.

Thanks

Aug 16 2021, 2:49 PM
vsellier added a comment to D6088: Use setup_requires to install tree-sitter.

we have tested with @vlorentz , it's ok if the yarn's build target is updated to not call the build-so and build-wasm targets and if the tree-sitter module is kept in the docker image.

Aug 16 2021, 2:31 PM
vsellier committed rDSNIP33178f46ac4b: grid5000/cassandra: adapt number of consummers (authored by vsellier).
grid5000/cassandra: adapt number of consummers
Aug 16 2021, 12:27 PM
vsellier committed rDSNIP41e8ee27b337: grid5000/cassandra: increase message size limit to allow revision replaying (authored by vsellier).
grid5000/cassandra: increase message size limit to allow revision replaying
Aug 16 2021, 12:27 PM
vsellier added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

Ceph status go back to OK with these actions:

  • Cleanup the crash history
    • to check status:
ceph crash ls
cepg crash info <id>
  • to cleanup
ceph crash archive-all
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false
Aug 16 2021, 10:20 AM · System administration

Aug 13 2021

vsellier added a comment to D6088: Use setup_requires to install tree-sitter.

Not sure if I'm missing something or if something is missing on this diff (I can't find it's parent on my repo) but I have applied it and the build is still failing when the yarn command is launched which sound logical as the yarn config is still launching directly the tree-sitter command

Aug 13 2021, 5:37 PM
vsellier closed D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:30 PM
vsellier committed rCDFJ2fea9ce49664: Install a missing python module for the swh-search build (authored by vsellier).
Install a missing python module for the swh-search build
Aug 13 2021, 4:30 PM
vsellier closed D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:30 PM
vsellier committed rDSEA84115fa41877: Document the dependency on the tree-sitter python module (authored by vsellier).
Document the dependency on the tree-sitter python module
Aug 13 2021, 4:30 PM
vsellier added inline comments to D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:24 PM
vsellier updated the diff for D6086: Document the dependency on the tree-sitter python module.

fix version selection

Aug 13 2021, 4:24 PM
vsellier requested review of D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:22 PM
vsellier added a revision to T3484: Fix the release builds for swh-search: D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:18 PM · System administration, Archive search
vsellier added a revision to T3484: Fix the release builds for swh-search: D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:14 PM · System administration, Archive search
vsellier requested review of D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:14 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Current import status before the run of this week-end:

Aug 13 2021, 3:32 PM · System administration, Storage manager
vsellier committed rDSNIPa98c1f651ecb: grid5000/cassandra: improbe cassandra monitoring (authored by vsellier).
grid5000/cassandra: improbe cassandra monitoring
Aug 13 2021, 12:39 PM
vsellier moved T3484: Fix the release builds for swh-search from Backlog to in-progress on the System administration board.
Aug 13 2021, 10:35 AM · System administration, Archive search
vsellier changed the status of T3484: Fix the release builds for swh-search from Open to Work in Progress.
Aug 13 2021, 10:35 AM · System administration, Archive search