Page MenuHomeSoftware Heritage
Feed Advanced Search

Aug 19 2021

vsellier updated the task description for T3487: Installation of the new provenance server.
Aug 19 2021, 12:29 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

In ~40h, the backfill is done at ~5% for staging and less than 1% for the production

Aug 19 2021, 10:08 AM · System administration

Aug 18 2021

vsellier committed rPTSc2e0c5766b2c: pristine-tar data for tree-sitter_0.19.0.orig.tar.gz (authored by vsellier).
pristine-tar data for tree-sitter_0.19.0.orig.tar.gz
Aug 18 2021, 4:20 PM
vsellier committed rPTS7702e942c810: initialize the backport build configuration (authored by vsellier).
initialize the backport build configuration
Aug 18 2021, 4:20 PM
vsellier committed rPTS08f03d9d6057: Initial packaging for python3-tree-sitter (authored by vsellier).
Initial packaging for python3-tree-sitter
Aug 18 2021, 4:20 PM
vsellier committed rPTS598b5ec8232c: New upstream version 0.19.0 (authored by vsellier).
New upstream version 0.19.0
Aug 18 2021, 4:19 PM
vsellier committed rCJSWH9a81390eda36: Declare debian package build for tree-sitter (authored by vsellier).
Declare debian package build for tree-sitter
Aug 18 2021, 4:11 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The back fill was relaunched using the script pasted in P1124

Aug 18 2021, 2:38 PM · System administration
vsellier created P1124 restart a backfill where it has stopped previously.
Aug 18 2021, 2:37 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The backfill process was interrupted by a restart of kafka on kafka1 (!).

2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 12 identical error(s) suppressed)"}
2021-08-18T09:20:05 ERROR    swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connectio
n refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)
2021-08-18T09:20:05 INFO     swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT,
 5 identical error(s) suppressed)"}
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update
...
2021-08-18T09:20:07 WARNING  swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) update
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush
    raise self.delivery_error("Failed deliveries after flush()")
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)])
2021-08-18T09:20:07 INFO     swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128
...
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)])
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush
    "flush() exceeded timeout (%ss)" % self.flush_timeout,
swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])

The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)

Aug 18 2021, 11:53 AM · System administration
vsellier committed rDSNIPf4c8abe97ccc: grid5000/cassandra: add a script to refresh the besteffort node list (authored by vsellier).
grid5000/cassandra: add a script to refresh the besteffort node list
Aug 18 2021, 10:30 AM
vsellier committed rDSNIPdb9574d46037: grid5000/cassadra: declare the best effort nodes only when they are fully… (authored by vsellier).
grid5000/cassadra: declare the best effort nodes only when they are fully…
Aug 18 2021, 10:30 AM
vsellier committed rDSNIP19515afeb074: grid5000/cassadra: count best_effort jobs in waiting/launching state (authored by vsellier).
grid5000/cassadra: count best_effort jobs in waiting/launching state
Aug 18 2021, 10:30 AM
vsellier updated subscribers of T3487: Installation of the new provenance server.

@jayeshv @aeviso @douardda @olasd have you an idea of what should be installed on the server and who will operate what will be on it?

Aug 18 2021, 9:50 AM · System administration
vsellier updated the task description for T3487: Installation of the new provenance server.
Aug 18 2021, 9:46 AM · System administration
vsellier changed the status of T3487: Installation of the new provenance server from Open to Work in Progress.
Aug 18 2021, 9:45 AM · System administration

Aug 17 2021

vsellier added a comment to T3484: Fix the release builds for swh-search.

One very important thing to get right is the Build-Depends line in the source package stanza. setuptools/distribute-based packages have the nasty habit of downloading dependencies from PyPI if they are needed at python setup.py build time. If the package is available from the system (as would be the case when Build-Depends > is up-to-date), then distribute will not try to download the package, otherwise it will try to download it. This is a huge no-no, and pybuild internally sets the http_proxy and https_proxy environment variables (to 127.0.0.1:9) to prevent this from happening.

Aug 17 2021, 6:13 PM · System administration, Archive search
vsellier added a comment to T3484: Fix the release builds for swh-search.

The pypi build is still working well with the 2 last diff.
Now there is a new error during the debian ones:

dh: warning: Compatibility levels before 10 are deprecated (level 9 in use)
   dh_auto_clean -O--buildsystem=pybuild
dh_auto_clean: warning: Compatibility levels before 10 are deprecated (level 9 in use)
I: pybuild base:232: python3.9 setup.py clean 
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101bcd0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101beb0>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b850>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b730>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fef2101b610>: Failed to establish a new connection: [Errno -2] Name or service not known'))': /simple/tree-sitter/
ERROR: Could not find a version that satisfies the requirement tree-sitter==0.19.0
ERROR: No matching distribution found for tree-sitter==0.19.0
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/setuptools/installer.py", line 75, in fetch_build_egg
    subprocess.check_call(cmd)
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpdrbws3hq', '--quiet', 'tree-sitter==0.19.0']' returned non-zero exit status 1.
Aug 17 2021, 5:57 PM · System administration, Archive search
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

current status:

Aug 17 2021, 5:41 PM · System administration, Storage manager
vsellier committed rDSNIPa31433b334ad: grid5000/cassandra: replay extid topic (authored by vsellier).
grid5000/cassandra: replay extid topic
Aug 17 2021, 5:27 PM
vsellier committed rDSNIP35813e5a8fcc: grid5000/cassandra: adapt the number of replayers (authored by vsellier).
grid5000/cassandra: adapt the number of replayers
Aug 17 2021, 5:27 PM
vsellier moved T3485: extid topic is misconfigured in staging and production from Backlog to in-progress on the System administration board.
Aug 17 2021, 5:12 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

Production backfill in progress:

root@getty:~/T3485# ./backfill.sh | tee output.log
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh --log-config /etc/softwareheritage/journal/backfill_logger.yml storage backfill  extid --start-object f80001
Aug 17 2021, 4:41 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

Production

Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform

Aug 17 2021, 4:31 PM · System administration
vsellier edited P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 3:03 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

The backfill is running in staging (launch with the P1121 and P1122 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):

swhstorage@storage1:~$ ./backfill.sh | tee output.log
swhstorage@storage1:~$ grep Starting output.log
Starting  swh storage backfill  extid --start-object 000000 --end-object 080000 
Starting  swh storage backfill  extid --start-object 080001 --end-object 100000 
Starting  swh storage backfill  extid --start-object 100001 --end-object 180000 
Starting  swh storage backfill  extid --start-object 180001 --end-object 200000 
Starting  swh storage backfill  extid --start-object 200001 --end-object 280000 
Starting  swh storage backfill  extid --start-object 280001 --end-object 300000 
Starting  swh storage backfill  extid --start-object 300001 --end-object 380000 
Starting  swh storage backfill  extid --start-object 380001 --end-object 400000 
Starting  swh storage backfill  extid --start-object 400001 --end-object 480000 
Starting  swh storage backfill  extid --start-object 480001 --end-object 500000 
Starting  swh storage backfill  extid --start-object 500001 --end-object 580000 
Starting  swh storage backfill  extid --start-object 580001 --end-object 600000 
Starting  swh storage backfill  extid --start-object 600001 --end-object 680000 
Starting  swh storage backfill  extid --start-object 680001 --end-object 700000 
Starting  swh storage backfill  extid --start-object 700001 --end-object 780000 
Starting  swh storage backfill  extid --start-object 780001 --end-object 800000 
Starting  swh storage backfill  extid --start-object 800001 --end-object 880000 
Starting  swh storage backfill  extid --start-object 880001 --end-object 900000 
Starting  swh storage backfill  extid --start-object 900001 --end-object 980000 
Starting  swh storage backfill  extid --start-object 980001 --end-object a00000 
Starting  swh storage backfill  extid --start-object a00001 --end-object a80000 
Starting  swh storage backfill  extid --start-object a80001 --end-object b00000 
Starting  swh storage backfill  extid --start-object b00001 --end-object b80000 
Starting  swh storage backfill  extid --start-object b80001 --end-object c00000 
Starting  swh storage backfill  extid --start-object c00001 --end-object c80000 
Starting  swh storage backfill  extid --start-object c80001 --end-object d00000 
Starting  swh storage backfill  extid --start-object d00001 --end-object d80000 
Starting  swh storage backfill  extid --start-object d80001 --end-object e00000 
Starting  swh storage backfill  extid --start-object e00001 --end-object e80000 
Starting  swh storage backfill  extid --start-object e80001 --end-object f00000 
Starting  swh storage backfill  extid --start-object f00001 --end-object f80000 
Starting  swh storage backfill  extid --start-object f80001
Aug 17 2021, 1:33 PM · System administration
vsellier edited P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 1:17 PM
vsellier edited P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 12:52 PM
vsellier edited P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 12:48 PM
vsellier added a comment to P1121 bakfill script for sha1 based ranges.

to use with P1122

Aug 17 2021, 12:34 PM
vsellier created P1122 generate backfill command for a given range (for sha1).
Aug 17 2021, 12:34 PM
vsellier created P1121 bakfill script for sha1 based ranges.
Aug 17 2021, 12:33 PM
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK  --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64
WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases.
         Going forward, please use kafka-configs.sh for this functionality
Updated config for topic swh.journal.objects.extid.
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep ReplicationFactor      
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600
Aug 17 2021, 11:01 AM · System administration
vsellier updated the task description for T3485: extid topic is misconfigured in staging and production.
Aug 17 2021, 11:00 AM · System administration
vsellier changed the status of T3485: extid topic is misconfigured in staging and production from Open to Work in Progress.
Aug 17 2021, 10:57 AM · System administration

Aug 16 2021

vsellier committed rDENV4155fc0087bc: cassandra: use the CASSANDRA_SEED env variable for database initialization (authored by vsellier).
cassandra: use the CASSANDRA_SEED env variable for database initialization
Aug 16 2021, 5:01 PM
vsellier closed D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:36 PM
vsellier committed rDENV1b9307ce7751: storage-cassandra: Remove the default src override (authored by vsellier).
storage-cassandra: Remove the default src override
Aug 16 2021, 4:36 PM
vsellier closed D6092: counters: Match the default configuration to the real production url.
Aug 16 2021, 4:35 PM
vsellier committed rDWAPPSddfb988db5ec: counters: Match the default configuration to the real production url (authored by vsellier).
counters: Match the default configuration to the real production url
Aug 16 2021, 4:35 PM
vsellier requested review of D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:27 PM
vsellier added a revision to T3357: Perform some tests of the cassandra storage on Grid5000: D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:27 PM · System administration, Storage manager
vsellier requested review of D6092: counters: Match the default configuration to the real production url.
Aug 16 2021, 3:58 PM
vsellier renamed T3484: Fix the release builds for swh-search from Fix the pypi-upload build for swh-search to Fix the release builds for swh-search.
Aug 16 2021, 2:54 PM · System administration, Archive search
vsellier accepted D6088: Use setup_requires to install tree-sitter.

Thanks

Aug 16 2021, 2:49 PM
vsellier added a comment to D6088: Use setup_requires to install tree-sitter.

we have tested with @vlorentz , it's ok if the yarn's build target is updated to not call the build-so and build-wasm targets and if the tree-sitter module is kept in the docker image.

Aug 16 2021, 2:31 PM
vsellier committed rDSNIP33178f46ac4b: grid5000/cassandra: adapt number of consummers (authored by vsellier).
grid5000/cassandra: adapt number of consummers
Aug 16 2021, 12:27 PM
vsellier committed rDSNIP41e8ee27b337: grid5000/cassandra: increase message size limit to allow revision replaying (authored by vsellier).
grid5000/cassandra: increase message size limit to allow revision replaying
Aug 16 2021, 12:27 PM
vsellier added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

Ceph status go back to OK with these actions:

  • Cleanup the crash history
    • to check status:
ceph crash ls
cepg crash info <id>
  • to cleanup
ceph crash archive-all
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim false
ceph config set mon mon\_warn\_on\_insecure\_global\_id\_reclaim\_allowed false
Aug 16 2021, 10:20 AM · System administration

Aug 13 2021

vsellier added a comment to D6088: Use setup_requires to install tree-sitter.

Not sure if I'm missing something or if something is missing on this diff (I can't find it's parent on my repo) but I have applied it and the build is still failing when the yarn command is launched which sound logical as the yarn config is still launching directly the tree-sitter command

Aug 13 2021, 5:37 PM
vsellier closed D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:30 PM
vsellier committed rCDFJ2fea9ce49664: Install a missing python module for the swh-search build (authored by vsellier).
Install a missing python module for the swh-search build
Aug 13 2021, 4:30 PM
vsellier closed D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:30 PM
vsellier committed rDSEA84115fa41877: Document the dependency on the tree-sitter python module (authored by vsellier).
Document the dependency on the tree-sitter python module
Aug 13 2021, 4:30 PM
vsellier added inline comments to D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:24 PM
vsellier updated the diff for D6086: Document the dependency on the tree-sitter python module.

fix version selection

Aug 13 2021, 4:24 PM
vsellier requested review of D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:22 PM
vsellier added a revision to T3484: Fix the release builds for swh-search: D6086: Document the dependency on the tree-sitter python module.
Aug 13 2021, 4:18 PM · System administration, Archive search
vsellier added a revision to T3484: Fix the release builds for swh-search: D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:14 PM · System administration, Archive search
vsellier requested review of D6085: Install a missing python module for the swh-search build.
Aug 13 2021, 4:14 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Current import status before the run of this week-end:

Aug 13 2021, 3:32 PM · System administration, Storage manager
vsellier committed rDSNIPa98c1f651ecb: grid5000/cassandra: improbe cassandra monitoring (authored by vsellier).
grid5000/cassandra: improbe cassandra monitoring
Aug 13 2021, 12:39 PM
vsellier moved T3484: Fix the release builds for swh-search from Backlog to in-progress on the System administration board.
Aug 13 2021, 10:35 AM · System administration, Archive search
vsellier changed the status of T3484: Fix the release builds for swh-search from Open to Work in Progress.
Aug 13 2021, 10:35 AM · System administration, Archive search
vsellier added a comment to T3373: Metadata search is failing due to a boolean field in the mapping of the metadata fields.

there are no more errors. The fix will deployed in production with the deployment of swh-search:v0.11.0 (T3433)

Aug 13 2021, 10:28 AM · System administration, Archive search
vsellier renamed T3043: journalbeat:/filebeat Add an environment field on the logs from journalbeat: Add an environment field on the logs to journalbeat:/filebeat Add an environment field on the logs.
Aug 13 2021, 10:09 AM · System administration

Aug 12 2021

vsellier committed rSPSITEb972034f8ceb: prometheus/pve_exporter: Split the metrics_path and the parameters (authored by vsellier).
prometheus/pve_exporter: Split the metrics_path and the parameters
Aug 12 2021, 4:01 PM
vsellier closed D6082: prometheus: Support http parameters in exporter configuration.
Aug 12 2021, 4:01 PM
vsellier committed rSPSITE9fe99c19b47a: prometheus: Support http parameters in exporter configuration (authored by vsellier).
prometheus: Support http parameters in exporter configuration
Aug 12 2021, 4:01 PM
vsellier updated the diff for D6082: prometheus: Support http parameters in exporter configuration.

Remove unused import of Set

Aug 12 2021, 3:55 PM
vsellier updated the summary of D6082: prometheus: Support http parameters in exporter configuration.
Aug 12 2021, 3:51 PM
vsellier updated the summary of D6082: prometheus: Support http parameters in exporter configuration.
Aug 12 2021, 3:50 PM
vsellier requested review of D6082: prometheus: Support http parameters in exporter configuration.
Aug 12 2021, 3:50 PM
vsellier added a revision to T3462: Add proxmox / ceph monitoring: D6082: prometheus: Support http parameters in exporter configuration.
Aug 12 2021, 3:50 PM · System administration
vsellier committed rSPSITE4b7ac2269de4: pve-exporter: fix the prometheus scrapping url (authored by vsellier).
pve-exporter: fix the prometheus scrapping url
Aug 12 2021, 10:19 AM
vsellier accepted D6078: pve-exporter: Install properly configuration and service.

Thanks. looks good, some minor formatting suggests inline

Aug 12 2021, 9:51 AM

Aug 11 2021

vsellier accepted D6077: pve-exporter: Install prometheus-pve-exporter on hypervisor nodes.

LGTM

Aug 11 2021, 6:17 PM
vsellier committed rPPPE570833188f29: Fix buster build (authored by vsellier).
Fix buster build
Aug 11 2021, 5:22 PM
vsellier committed rPPPE892866b083e4: pristine-tar data for prometheus-pve-exporter_2.1.2.orig.tar.gz (authored by vsellier).
pristine-tar data for prometheus-pve-exporter_2.1.2.orig.tar.gz
Aug 11 2021, 5:16 PM
vsellier committed rPPPE381b497c01f5: Configure buster build (authored by vsellier).
Configure buster build
Aug 11 2021, 5:16 PM
vsellier committed rPPPE211ce4bbbd47: Initial packaging for prometheus-pve-exporter (authored by vsellier).
Initial packaging for prometheus-pve-exporter
Aug 11 2021, 5:16 PM
vsellier committed rPPPEe46b8aefe236: New upstream version 2.1.2 (authored by vsellier).
New upstream version 2.1.2
Aug 11 2021, 5:16 PM
vsellier accepted D6076: Declare debian package build for proxmox-pve-exporter.

LGTM

Aug 11 2021, 5:09 PM
vsellier accepted D6075: Activate data scraping on hypervisors.

LGTM

Aug 11 2021, 12:31 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The complete import is running almost continuously with 5 cassandra nodes since monday.

Aug 11 2021, 10:21 AM · System administration, Storage manager

Aug 10 2021

vsellier added a comment to T3462: Add proxmox / ceph monitoring.

A prometheus exporter for proxmox is available at https://github.com/prometheus-pve/prometheus-pve-exporter
An interesting reading: https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/

Aug 10 2021, 5:26 PM · System administration
vsellier changed the status of T3462: Add proxmox / ceph monitoring, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.
Aug 10 2021, 5:16 PM · System administration
vsellier changed the status of T3462: Add proxmox / ceph monitoring from Open to Work in Progress.
Aug 10 2021, 5:16 PM · System administration
vsellier closed T3474: Disable swap on workers as Resolved.
Aug 10 2021, 5:15 PM · System administration
vsellier closed T3474: Disable swap on workers, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.
Aug 10 2021, 5:15 PM · System administration
vsellier renamed T3476: One of the system disks of beaubourg is out of order from One of the system disk of beaubourg is out of order to One of the system disks of beaubourg is out of order.
Aug 10 2021, 4:32 PM · System administration
vsellier accepted D6074: loader_git: Decrease concurrency to tentatively decrease oom kill events.

LGTM

Aug 10 2021, 12:45 PM
vsellier triaged T3476: One of the system disks of beaubourg is out of order as High priority.
Aug 10 2021, 12:28 PM · System administration
vsellier added a comment to T3474: Disable swap on workers.

as expected, there is an increase of the number of oom killers on the workers [1]:

Aug 10 2021, 12:19 PM · System administration
vsellier added a comment to T3457: Some git repositories are failing to be ingested because of MemoryError.

Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:

Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git'
Aug 10 08:54:17 worker05 python3[871]: [62B blob data]
Aug 10 08:54:17 worker05 python3[871]: [586B blob data]
Aug 10 08:54:17 worker05 python3[871]: [473B blob data]
Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401                                         
Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs
Aug 10 08:59:21 worker05 kernel: [    871]  1004   871   247194   161634  1826816    46260             0 python3                              
Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped.            
Aug 10 09:15:29 worker05 kernel: [    871]  1004   871   412057   372785  3145728        0             0 python3                              
Aug 10 09:16:57 worker05 kernel: [    871]  1004   871   823648   784496  6443008        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1                   
Aug 10 09:24:44 worker05 kernel: [    871]  1004   871  2800000  2760713 22286336        0             0 python3                              
Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004           
Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0
Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 10 2021, 11:32 AM · Git loader
vsellier changed the status of T3474: Disable swap on workers from Open to Work in Progress.
Aug 10 2021, 9:54 AM · System administration

Aug 9 2021

vsellier updated the task description for T3461: Prepare a quote for bare metal servers for the firewalls.
Aug 9 2021, 10:45 AM · System administration

Aug 6 2021

vsellier closed T2912: Next generation archive counters as Resolved.

The cleanup of the old counters is done so it can be closed

Aug 6 2021, 6:32 PM · Roadmap 2021, System administration, Monitoring, Web app
vsellier closed T3417: Cleanup the old counters environment as Resolved.
Aug 6 2021, 6:31 PM · System administration, Monitoring
vsellier closed T3417: Cleanup the old counters environment, a subtask of T2912: Next generation archive counters, as Resolved.
Aug 6 2021, 6:31 PM · Roadmap 2021, System administration, Monitoring, Web app