The 2 disks were removed from the server and packaged to be sent to seagate.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 21 2021
Apr 20 2021
Apr 16 2021
Apr 15 2021
Email sent to the dsi to launch the replacement.
In preparation of the disk replacement, their leds must be activated to make the emplacement identifiable:
- Ensure all the led are off
root@storage1:~# ls /dev/sd* | grep -e "[a-z]$" | xargs -n1 -t -i{} ledctl normal={} ledctl normal=/dev/sda ledctl normal=/dev/sdb ledctl normal=/dev/sdc ledctl normal=/dev/sdd ledctl normal=/dev/sde ledctl normal=/dev/sdf ledctl normal=/dev/sdg ledctl normal=/dev/sdh ledctl normal=/dev/sdi ledctl normal=/dev/sdj ledctl normal=/dev/sdk ledctl normal=/dev/sdl ledctl normal=/dev/sdm ledctl normal=/dev/sdn
- light on
root@storage1:~# ledctl locate=/dev/sdb root@storage1:~# ledctl locate=/dev/sdc
Apr 12 2021
The disks are removed from the zfs pool. The replacement be done
The mirror is removed fro the pool:
root@storage1:~# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT data 21.8T 2.50T 19.3T - - 20% 11% 1.00x ONLINE -
Ticket opened on the seagate site for the replacement of these 2 disks, the information will be transferred to the DSI for the packaging (as soon the disk will be removed from the pool)
The mirror-1 removal is in progress:
root@storage1:~# zpool remove data mirror-1
There are 2 disks with errors that should now be replaced:
- /dev/sdb/wwn-0x5000c500a23e3868 An old one
- /dev/sdc/wwn-0x5000c500a22f48c9 the disk just removed from the pool
The failing disk was removed from the pool:
root@storage1:~# zpool detach data wwn-0x5000c500a22f48c9
The new failing drive is /dev/sdc
root@storage1:~# ls -al /dev/disk/by-id/ | grep wwn-0x5000c500a22f48c9 lrwxrwxrwx 1 root root 9 Apr 11 03:42 wwn-0x5000c500a22f48c9 -> ../../sdc lrwxrwxrwx 1 root root 10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 10 Mar 11 17:08 wwn-0x5000c500a22f48c9-part9 -> ../../sdc9
Mar 22 2021
A new vm counters0.internal.staging.swh.network is deployed and hosting redis, swh-counters and its journal-client.
The lag in staging will be recovered in a couple of hours.
Mar 19 2021
Feb 5 2021
I start to throw some ideas in this document : https://hedgedoc.softwareheritage.org/Fi2pq7zkSw6aVAJwk9Xhqw
Jan 29 2021
awesome, thanks.
- Inventory updated to ensure all the components are associated to the staging environment
- Staging page on the intranet updated [1]
- Staging section on the network page [2] on the intranet updated
Jan 27 2021
This is a tryout to generate a global schema of the staging environment (P929):
Jan 25 2021
Jan 20 2021
Jan 18 2021
Jan 6 2021
The last check no longer appears in icinga.
Jan 4 2021
Closing this task as all the direct work is done.
The documentation will be addressed in T2920
Dec 22 2020
Everything looks good, let's try to add some documentation before closing the issue
Dec 21 2020
- A new vm objstorage0.internal.staging.swh.network is configured with an read-only object storage service
- It's exposed to internet via the reverse proxy at https://objstorage.staging.swh.network (it quite different as the usual objstorage:5003 url but it allow to expose the service without new network configuration)
- DNS entry added on gandi
- Inventory updated
A user was correctly configured and a read test performed :
The network configuration is done. The server is now accessible from the internet at broker0.journal.staging.swh.network:9093
Dec 18 2020
The request to expose the journal to internet was done this afternoon to the dsi.
Dec 17 2020
After one week, the disk used by kafka was around 85% of usage
root@journal0:/tmp# df -h /srv/kafka/logdir Filesystem Size Used Avail Use% Mounted on kafka-volume 481G 409G 73G 85% /srv/kafka/logdir
Compared to the production, the compression was not activated on the zfs pool:
root@kafka1:~# zfs get all data/kafka | grep compress data/kafka compressratio 1.55x - data/kafka compression lz4 inherited from data data/kafka refcompressratio 1.55x -
root@journal0:/tmp# zfs get all | grep compress kafka-volume compressratio 1.00x - kafka-volume compression off default kafka-volume refcompressratio 1.00x -
So the compression was activated :
root@journal0:/tmp# zfs set compression=lz4 kafka-volume root@journal0:/tmp# zfs get all | grep compress kafka-volume compressratio 1.00x - kafka-volume compression lz4 local kafka-volume refcompressratio 1.00x -
As this parameter is only used for the new written data, we have force a compact on the biggest topics : `directory, revision and content`
% ./kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects.revision --config min.cleanable.dirty.ratio=0.01 WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionality Updated config for topic swh.journal.objects.revision. vsellier@journal0 /opt/kafka/bin % ./kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects_privileged.revision --config min.cleanable.dirty.ratio=0.01 WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionality Updated config for topic swh.journal.objects_privileged.revision.
Dec 14 2020
With the "optimized" configuration, the import is quite faster :
root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json { "took" : 10215280, "timed_out" : false, "total" : 91517657, "updated" : 0, "created" : 91517657, "deleted" : 0, "batches" : 91518, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
"took" : 10215280, => 2h45
Dec 11 2020
- diff landed and applied on the server
- VIP 128.93.166.40 configured on the firewall
- NAT Port forward of port 9093 from public ip to internal journal0 declared on the firewall
- DNS declaration of broker0.journal.staging.swh.network in gandi
- Ask to DSI to apply the kafka firewall profile to 128.93.166.40
- Configure a user to test the pipeline
And now spurious logs are gone for the deposit.
Deployed (rp0.staging, webapp0.azure, moma).
I agree for the default site but we have several legit requests from the monitoring not correctly routed so the configuration needs to be adapted.
You could just add a 00-default vhost that shows a generic error message. (that's not even a hack to rely on alphabetical order for vhost configs)
The production index origin was correctly copied from the production cluster but it seems without the configuration to optimize the copy.
We keep this one and try a new optimized copy to check if the server still crash in an OOM with the new cpu and memory settings.
Dec 10 2020
FI: The origin index was recreated with the "official" mapping and a backfill was performed (necessary after the test of the flattened mapping)
The deployment manifest are ok and deployed in staging so this task can be resolved.
We will work on reactivating search-journal-client for the metadata in another task when T2876 is resolved
The copy of the production index is restarted.
To improve the speed of the copy, the index was tuned to reduce the disk pressure (it's a temporary configuration and should not be used in a normal case as it's not safe) :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF
- Parition and memory extended with terraform.
- The disk resize needed some console actions to be extended :
The production index import failed because the limit of 90% of used disk spaces was reached at some time to fall back to around 60G after a compaction
The progression was 80M documents of 91M.
Dec 9 2020
The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]
Dec 8 2020
A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests
Dec 7 2020
Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html
Dec 4 2020
We added a volume of 100Gib to the search-esnode0 through terraform (D4663).
So we could mount the /srv/elasticsearch as zfs volume.
Dec 3 2020
Dec 2 2020
After T2828, It's more clear of what must be deployed to have the counters working on staging:
- the counters can be intialized via the /stat/refresh endpoint of the storage api (Note: It will create more counters than production as directory_entry_* and revision_history are not counted in production)
- Add a script/service to execute the `swh_update_counter_bucketed` in an infinite loop
- Create the buckets in the object_counts_bucketed
- per object type : identifier|bucket_start|bucket_end. value and last_update will be updated be the stored procedures.
- configure prometheus sql exporter for db1.staging [1]
- configure profile_exporter on pergamon
- Update the script to ensure the data are filtered by environments (to avoid staging data to be included in production counts [2])
- Configure a new cron
- loading an empty file for historical data
- creating a new export_file
- update webapp to be able to configure the counter origin