- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 12 2021
Oct 11 2021
solved by T3487#71814
solved by T3487#71814
Oct 8 2021
Sure, we should have authentication / rate limit on this.
But if I'm not wrong, the target is to test the mirroring with ENEA.
If we add authentication, we need to improve the objstorage-replayer / objstorage to support it.
rSENV646f62805ef564bceed4d3a4d84d8fb6890f2d19 declares the new certificate for the vagrant tests (wrong task on the commit message)
Oct 6 2021
update commit message
- factorize the exported configuration
- use the right exporter port on met
rebase
I think the issue can be closed.
The pros are:
- it simplify the cluster management (create, configuration and most of all, kubernetes upgrades)
- centralize the global view of the cluster and what is running on it
- OSS and transparent policy
The propose plan looks too naive as some zk configuration also needs to be updated.
A recommended way is to add a new node on the cluster, migrate the partitions on the new node and shutdown the old one.
Doing like this ensure all the data and configuration will be correctly migrated, and without downtime, which is not negligible as the mirror tests of enea are in progress.
The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:
Oct 5 2021
credentials added on the credential database under the refs:
- operations/kafka/credentials/staging/swh-enea
- operations/kafka/credentials/production/swh-enea
Production credentials created:
+ export zookeeper_servers=kafka1.internal.softwareheritage.org:2181 + zookeeper_servers=kafka1.internal.softwareheritage.org:2181 + export bootstrap_servers=kafka1.internal.softwareheritage.org:9092 + bootstrap_servers=kafka1.internal.softwareheritage.org:9092 + '[' -z swh-enea -o -z redacted ']' + set -eu + /opt/kafka/bin/kafka-configs.sh --zookeeper kafka1.internal.softwareheritage.org:2181/kafka/softwareheritage --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=redacted],SCRAM-SHA-512=[password=redacted]' --entity-type users --entity-name swh-enea Warning: --zookeeper is deprecated and will be removed in a future version of Kafka. Use --bootstrap-server instead to specify a broker to connect to. Completed updating config for entity: user-principal 'swh-enea'. + /opt/kafka/bin/kafka-acls.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --add --resource-pattern-type PREFIXED --topic swh.journal.objects. --allow-principal User:swh-enea --operation READ Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: (principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
Credentials create in stagingd:
ACLs for principal `User:swh-enea` Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: (principal=User:swh-enea, host=*, operation=DESCRIBE, permissionType=ALLOW) (principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
export username=swh-enea export password=XXXXX
Actions to perform for the migration:
- add the role::swh_kafka_broker role to the new server in puppet and deploy
- stop the staging workers
- stop kafka on journal0
- rsync the content from journal0 to the new server
- update the staging configurations to use the new server as journal
- update the NAT public staging for staging journal on the firewall (https://192.168.50.1/firewall_nat.php)
LGTM thanks
Oct 4 2021
keda looks promising. P1193 is an example of configuration working for the docker environment. It's able to scale to 0 when no messages are present on the queue.
When messages are present, the loaders are launched progressively until the limit of cpu/memory of the host is reached or the max number of allowed worker is reached.
Oct 1 2021
intermediary status:
- the bench lab is easily deployable on g5k on several workers to distribute the load [1]
- it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5 [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd
and also an unexplained time drift:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync. Current drift is [loaders-77cdd444df-flcv9 loaders] 356 seconds. [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444] [loaders-77cdd444df-flcv9 loaders] [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync. Current drift is [loaders-77cdd444df-flcv9 loaders] 355 seconds. [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983] [loaders-77cdd444df-flcv9 loaders]
Intermediary status:
- We have successfully ran loaders in staging using the helm chart we have wrote [1] and an hardcoded number of worker, It adds the possibility to perform rolling upgrades for example
- We have tried the integrated horizontal pod autoscaler [2], it works pretty well but it's not adapted for our worker scenario. It's based on the cpu consumption(on our test [3], but can be other things) of the pod to decide if the number of running pods must be upscaled or downscaled. It can be very useful to manage classical load like for gunicorn container, but not for the scenario of long running tasks
- Kubernetes also has some functionalities to reduce the pressure on a node when some limts are reached but it looks like it's more emergency actions than proper scaling management. It's configured at the kubelet level and not dynamic at all [4]. It was rapidly tested but we have lost the node due to oom before the node eviction starts.
Sep 30 2021
cc @seirl
Sep 29 2021
After having hard time, we have solved several issues:
- The rancher initialization problem was because we were using a wrong version of k3s compared to the compatibility matrix of rancher.
We installed rancher 2.5.9 on a recent version of k3s installing kubernetes 1.22.2. According to the compatibility matrix of rancher[1], using a older version of k3s solved the problem and the clusters start correctly after that
Sep 28 2021
The zfs pool and dataset are configured:
- pool configuration
## nvme drives pool #zpool create data mirror nvme-eui.36315030525005540025384500000003 nvme-eui.36315030525005800025384500000003 mirror nvme-eui.36315030525005620025384500000003 nvme-eui.36315030525005890025384500000003
I forgot to mention there is a gift from dell on the server: an additional 600Go 10rpm disk
The server is installed. It remains few task to perform manually:
- configure the zfs datasets (will configure 2 mirror pool for ~12To available, tell me if it's not what it's expected)
- build few missing packages for bullseye (relative to the monitoring: prometheus-rabbitmq-exporter, prometheus-statsd-exporter, journalbeat)
- configure a rabbitmq admin user
Sep 27 2021
yes pgbouncer will be used and it's configured by default to 2000 // connections
I don't know the kind of load the provenance client will generate but the default 100 connections allowed by postgres will be probably too short and needed to be increased too
As see with @aeviso , we will install the following components on the server (the os will be debian11)
- rabbitmq
- postgresql:13
- a default swh-storage database will be managed by puppet
- 1000 parallel connections allowed
- shared_buffers 50go
- docker
Sep 24 2021
It looks ok for the puppet code
Sep 23 2021
Interesting documentations on how to manage jobs:
- the different job pattenrs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
- Using controllers to manage the workload: https://kubernetes.io/docs/concepts/architecture/controller/
Sep 22 2021
- Remove useless ending '/'
- define VPN / private meanings
Sep 21 2021
RabbitMq GUI -> RabbitMQ