Page MenuHomeSoftware Heritage
Feed Advanced Search

Sep 22 2021

vlorentz triaged T3594: Faithfully store weird git objects as Normal priority.
Sep 22 2021, 1:31 PM · meta-task, Data Model, Storage manager

Sep 20 2021

vlorentz added a revision to T3135: Improve integrity of ingested content: D6281: converters: Recompute hashes and check they match the originals.
Sep 20 2021, 11:05 AM · Storage manager, Roadmap 2021, meta-task

Sep 17 2021

vlorentz updated the task description for T3586: Figure out what to do with 'misordered' directories in Cassandra.
Sep 17 2021, 11:38 AM · Data Model, Storage manager
vlorentz removed a project from T3586: Figure out what to do with 'misordered' directories in Cassandra: meta-task.
Sep 17 2021, 11:37 AM · Data Model, Storage manager
vlorentz placed T3586: Figure out what to do with 'misordered' directories in Cassandra up for grabs.
Sep 17 2021, 11:37 AM · Data Model, Storage manager
vlorentz triaged T3586: Figure out what to do with 'misordered' directories in Cassandra as Normal priority.
Sep 17 2021, 11:37 AM · Data Model, Storage manager
vlorentz added a subtask for T3585: Fix inconsistencies of the Cassandra backend with postgres: T3582: cassandra: Use 'git ordering' for directory entries.
Sep 17 2021, 11:35 AM · meta-task, Storage manager
vlorentz added a parent task for T3582: cassandra: Use 'git ordering' for directory entries: T3585: Fix inconsistencies of the Cassandra backend with postgres.
Sep 17 2021, 11:35 AM · Storage manager
vlorentz triaged T3585: Fix inconsistencies of the Cassandra backend with postgres as Normal priority.
Sep 17 2021, 11:35 AM · meta-task, Storage manager

Sep 16 2021

vlorentz triaged T3582: cassandra: Use 'git ordering' for directory entries as Normal priority.
Sep 16 2021, 5:59 PM · Storage manager
anlambert added a revision to T3413: Fix the inconsistency between snapshot_get_branches function in postgresql.storage and cassandra.storage: D6283: postgresql: Fix get_snapshot_branches return value for empty search.
Sep 16 2021, 2:15 PM · Storage manager
vsellier closed T3493: [cassandra] Git loader performance are very bad as Resolved.

changing the status to resolved as the main issues are solved.
Other tests with more parallel workers will be launched, if other problems will be detected, they will be tracked in new dedicated tickets.

Sep 16 2021, 11:31 AM · System administration, Storage manager
vsellier closed T3493: [cassandra] Git loader performance are very bad, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Sep 16 2021, 11:31 AM · System administration, Storage manager

Sep 15 2021

vlorentz updated the task description for T3010: Enable the validating storage proxy in production.
Sep 15 2021, 6:30 PM · Storage manager, System administration
vsellier updated the task description for T3577: Parallel loaders performances .
Sep 15 2021, 5:49 PM · System administration, Storage manager
vsellier changed the status of T3577: Parallel loaders performances from Open to Work in Progress.
Sep 15 2021, 5:49 PM · System administration, Storage manager
vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.
Sep 15 2021, 5:42 PM · System administration, Storage manager
vsellier added a comment to T3573: [cassandra] directory and content read benchmarks.

Test of a the new D6269 patch:

Sep 15 2021, 5:12 PM · System administration, Storage manager
vsellier added a comment to T3573: [cassandra] directory and content read benchmarks.

2 flame graphs of the previous directory_ls:

  • one-by-one

first run (cache cold):

Sep 15 2021, 2:55 PM · System administration, Storage manager
vsellier added a comment to T3573: [cassandra] directory and content read benchmarks.

This is the results of the different runs:

Sep 15 2021, 2:40 PM · System administration, Storage manager
vsellier added a comment to T3573: [cassandra] directory and content read benchmarks.

The directory_ls and indirectly the get_content performace was tested with this small script: P1163
A cold restart (all buffer cleared, cassandra restarted) is done between each tests (P1164)

Sep 15 2021, 11:20 AM · System administration, Storage manager
vsellier renamed T3573: [cassandra] directory and content read benchmarks from [cassandra] directory and content read benchmarkss to [cassandra] directory and content read benchmarks.
Sep 15 2021, 11:19 AM · System administration, Storage manager
vsellier changed the status of T3573: [cassandra] directory and content read benchmarks from Open to Work in Progress.
Sep 15 2021, 11:11 AM · System administration, Storage manager

Sep 13 2021

vsellier closed T3465: Test multidatacenter replication, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Sep 13 2021, 1:48 PM · System administration, Storage manager
vsellier closed T3465: Test multidatacenter replication as Resolved.
Sep 13 2021, 1:48 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

The new datacenter is active since a couple of week.
It allowed to test:

  • how to declare a new dc and bootstrap it
  • how the data is replicated between the DC
  • how to perform inter/intra DC repairs
  • how to add nodes on a DC on bootstrap it
  • how to remove a datacenter
Sep 13 2021, 1:48 PM · System administration, Storage manager
vsellier closed T3464: Prepare a quote for the cassandra servers, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Sep 13 2021, 1:24 PM · System administration, Storage manager
vsellier closed T3464: Prepare a quote for the cassandra servers as Resolved.

The quote is done and validated, we will launch the command when the new matinfo deal will be available

Sep 13 2021, 1:24 PM · System administration, Storage manager

Sep 10 2021

vlorentz added a parent task for T3552: Fix corrupted releases, revisions, and directories in the storage: T887: Vault: "snapshot" cooker.
Sep 10 2021, 11:24 AM · Storage manager

Sep 8 2021

vlorentz closed T2590: Finish the indexer -> swh-search pipeline, a subtask of T1117: Origin search is *slow* when you look for very common words, as Resolved.
Sep 8 2021, 3:35 PM · Web app, Storage manager
vlorentz closed T2590: Finish the indexer -> swh-search pipeline, a subtask of T2182: Switch production swh-web to use swh-search instead of postgresql search., as Resolved.
Sep 8 2021, 3:35 PM · System administration, Archive search, Storage manager

Sep 3 2021

vlorentz added a subtask for T3552: Fix corrupted releases, revisions, and directories in the storage: T75: Check integrity of directories, revisions, and releases.
Sep 3 2021, 6:28 PM · Storage manager
vlorentz placed T3552: Fix corrupted releases, revisions, and directories in the storage up for grabs.
Sep 3 2021, 6:27 PM · Storage manager
vlorentz triaged T3552: Fix corrupted releases, revisions, and directories in the storage as Normal priority.
Sep 3 2021, 6:27 PM · Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

With the new concurrent replay of the directory, the disk usage grow up rapidly:

Sep 3 2021, 5:15 PM · System administration, Storage manager
vlorentz closed T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting as Resolved.
Sep 3 2021, 1:45 PM · Storage manager
vlorentz added a revision to T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage: D5865: Add endpoints to access REMD by id.
Sep 3 2021, 11:38 AM · Storage manager, Extrinsic metadata
vlorentz closed T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage, a subtask of T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects, as Resolved.
Sep 3 2021, 11:38 AM · Data Model, Storage manager, Extrinsic metadata
vlorentz closed T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage as Resolved.
Sep 3 2021, 11:38 AM · Storage manager, Extrinsic metadata

Aug 31 2021

vsellier closed T3539: snapshot/metadata inversion in origin_visit_status_get_random as Resolved.
Aug 31 2021, 9:19 AM · Storage manager

Aug 30 2021

vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics , a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Aug 30 2021, 6:11 PM · System administration, Storage manager
vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics as Resolved.
Aug 30 2021, 6:11 PM · System administration, Storage manager
vsellier added a revision to T3517: [cassandra] decorate the method calls to have statsd metrics : D6162: cassandra: generate statsd metrics on method calls.
Aug 30 2021, 5:28 PM · System administration, Storage manager
vsellier added a revision to T3539: snapshot/metadata inversion in origin_visit_status_get_random: D6161: postgresql: Fix a column order mismatch between the query and object builder.
Aug 30 2021, 5:06 PM · Storage manager
vsellier changed the status of T3539: snapshot/metadata inversion in origin_visit_status_get_random from Open to Work in Progress.
Aug 30 2021, 5:01 PM · Storage manager

Aug 27 2021

vsellier added a comment to T3465: Test multidatacenter replication.

New cluster state after all the reservation are up:

vsellier@gros-50:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.97.3   1.4 TiB     256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6   1.4 TiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5   1.39 TiB    256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4   1.4 TiB     256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2   1.4 TiB     256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1
Aug 27 2021, 7:35 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.
  • cassandra stopped
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl stop cassandra
  • data cleaned
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} "rm -rf /srv/cassandra/*"
  • Cassandra restarted
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl start cassandra
Aug 27 2021, 6:43 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

well after reflection, it will be probably faster to recreate the second DC from scractch now the configuration is ready.

Aug 27 2021, 6:35 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

5 nodes were added on the cluster:

  • configuration pushed on g5k, disk reserved for 14 days on the new servers, a new reservation was launched with the new nodes
  • each node was started one by one after their status was UN on the nodetool status output
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
DN  172.16.97.3   ?           256     0.0%              a3ae5fa2-c063-4890-87f1-bddfcf293bde  r1
DN  172.16.97.6   ?           256     0.0%              bfe360f1-8fd2-4f4b-a070-8f267eda1e12  r1
DN  172.16.97.5   ?           256     0.0%              478c36f8-5220-4db7-b5c2-f3876c0c264a  r1
DN  172.16.97.4   ?           256     0.0%              b3105348-66b0-4f82-a5bf-31ef28097a41  r1
DN  172.16.97.2   ?           256     0.0%              de866efd-064c-4e27-965c-f5112393dc8f  r1
Aug 27 2021, 6:30 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

10 nodes are not enough, I add 5 additional nodes to reduce the volume per node a little.

Aug 27 2021, 5:24 PM · System administration, Storage manager
vsellier changed the status of T3517: [cassandra] decorate the method calls to have statsd metrics from Open to Work in Progress.
Aug 27 2021, 4:48 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

the lz4 compression was already activated by default. Changing the algo to zstd on the table snapshot was not really significant (initially with lz4: 7Go, zstd: 12Go, go back to lz4: 9Go :) )

Aug 27 2021, 12:10 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

interesting:

Depending on the data characteristics of the table, compressing its data can result in:

25-33% reduction in data size
25-35% performance improvement on reads
5-10% performance improvement on writes
Aug 27 2021, 10:15 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The replaying is currently stopped as the data disks are now almost full.
I will try to activate the compression on some big tables to see if it can help.
I will probably need to start on small tables to recover some space before being able to compress the biggest tables

Aug 27 2021, 10:02 AM · System administration, Storage manager

Aug 26 2021

vsellier added a comment to T3465: Test multidatacenter replication.

These are the steps done to initialized the new cluster [1]:

  • add a file datacenter-rackdc.properties on the server with the according DC
gros-50:~$ cat /etc/cassandra/cassandra-rackdc.properties 
dc=datacenter2
rack=rack1
  • change the value of the properties endpoint_snitch from SimpleSnitch to GossipingPropertyFileSnitch [2].

The recommanded value for production is GossipingPropertyFileSnitch so it should have been this since the beginning

  • configure the disk_optimization_strategy to ssd on the new datacenter
  • update the seed_provider to have one node on each datacenter
  • restart the datacenter1 nodes to apply the new configuration
  • start the datacenter2 nodes one by one, wait until the status of the node is UN (Up and Normal) before starting another one (They can be stay in the UJ (joining) state for a couple of minutes)
  • when done, update the swh keyspace to declare the replication strategy of the second DC
ALTER KEYSPACE swh WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2': 3};

The replication of the new changes starts here but the full table contents need to be copied

  • rebuild the cluster content:
vsellier@fnancy:~/cassandra$ seq 0 9 | parallel -t ssh gros-5{} nodetool rebuild -ks swh -- datacenter1

The progression can be monitored with nodetool command:

gros-50:~$ nodetool netstats                                                                 
Mode: NORMAL                                                                                           
Rebuild e5e64920-0644-11ec-92a6-31a241f39914                                                            
    /172.16.97.4                                                                                                                                      
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57339885570 bytes total (38.76%)
            swh/release-4 1082347/1082347 bytes (100%) received from idx:0/172.16.97.4                                                                           
            swh/content_by_blake2s256-2 3729362955/3729362955 bytes (100%) received from idx:0/172.16.97.4
            swh/release-3 224510803/224510803 bytes (100%) received from idx:0/172.16.97.4                
            swh/content_by_blake2s256-1 240283216/240283216 bytes (100%) received from idx:0/172.16.97.4
            swh/content_by_blake2s256-4 29491504/29491504 bytes (100%) received from idx:0/172.16.97.4
            swh/release-2 6409474/6409474 bytes (100%) received from idx:0/172.16.97.4                
...
Read Repair Statistics:                                                                                     
Attempted: 0                                                                                          
Mismatch (Blocking): 0                                                                                
Mismatch (Background): 0                                                                            
Pool Name                    Active   Pending      Completed   Dropped                                
Large messages                  n/a         0             23         0                                
Small messages                  n/a         3      132753939         0                          
Gossip messages                 n/a         0          43915         0

or to filter only running transfers:

gros-50:~$ nodetool netstats  | grep -v 100%
Mode: NORMAL
Rebuild e5e64920-0644-11ec-92a6-31a241f39914
    /172.16.97.4
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57557961160 bytes total (38.91%)
            swh/directory_entry-7 4819168032/4925484261 bytes (97%) received from idx:0/172.16.97.4
    /172.16.97.2
        Receiving 202 files, 111435975646 bytes total. Already received 139 files (68.81%), 60583670773 bytes total (54.37%)
            swh/directory_entry-12 1631210003/2906113367 bytes (56%) received from idx:0/172.16.97.2
    /172.16.97.6
        Receiving 236 files, 186694443984 bytes total. Already received 142 files (60.17%), 58869656747 bytes total (31.53%)
            swh/snapshot_branch-10 4449235102/7845572885 bytes (56%) received from idx:0/172.16.97.6
    /172.16.97.5
        Receiving 221 files, 143384473640 bytes total. Already received 132 files (59.73%), 58300913015 bytes total (40.66%)
            swh/directory_entry-4 982247023/3492851311 bytes (28%) received from idx:0/172.16.97.5
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0             23         0
Small messages                  n/a         2      135087921         0
Gossip messages                 n/a         0          44176         0
Aug 26 2021, 12:41 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

The second cassandra cluster is finally up and synchronizing with the first one. The rebuild should be done by the end of the day or tomorrow.

Aug 26 2021, 12:05 PM · System administration, Storage manager
vlorentz added a comment to T3493: [cassandra] Git loader performance are very bad.

D6139 should address the bottleneck in the flame graph

Aug 26 2021, 11:20 AM · System administration, Storage manager

Aug 24 2021

vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

Some live data from a git loader with a batch size of 1000 for each object types (with D6118 applied):

"object type";"input count";"missing_id duration (s)";"_missing_id count","_add duration(s)"
content;1000;0.4928;999;35.3384
content;1000;0.4095;1000;34.1440
content;1000;0.4374;998;35.6249
content;492;0.2960;488;16.7028
directory;1000;0.3978;999;71.2518
directory;1000;0.4484;1000;39.6845
directory;1000;0.4356;1000;54.0077
directory;1000;0.3833;1000;36.1437
directory;1000;0.4319;1000;30.5690
directory;402;0.1718;402;19.2335
revision;1000;0.8671;1000;10.3417
revision;575;0.4639;575;4.0819
Aug 24 2021, 3:18 PM · System administration, Storage manager
vsellier added a revision to T3493: [cassandra] Git loader performance are very bad: D6118: cassandra: Make content_missing query in batches.
Aug 24 2021, 3:06 PM · System administration, Storage manager
vsellier renamed T3493: [cassandra] Git loader performance are very bad from Git loader performance are very bad to [cassandra] Git loader performance are very bad.
Aug 24 2021, 12:07 PM · System administration, Storage manager

Aug 23 2021

vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

It seems the problem is no longer present now (tested with several origins)

root@parasilo-19:~/swh-environment/docker# docker exec -ti docker_swh-loader_1 bash
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
Aug 23 2021, 2:50 PM · Storage manager
vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

The origin_visit topic was replayed with your diff during the weekend. let's test now if the worker behavior is more deterministic

Aug 23 2021, 11:42 AM · Storage manager
vlorentz added a revision to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting: D6120: cassandra: Bump next_visit_id when origin_visit_add is called by a replayer.
Aug 23 2021, 10:59 AM · Storage manager

Aug 19 2021

vlorentz added a comment to T3465: Test multidatacenter replication.

Starting with 10 nodes will allow to have some remaining space.

Aug 19 2021, 7:53 PM · System administration, Storage manager
vlorentz added a comment to T3493: [cassandra] Git loader performance are very bad.

Can you try with this patch? P1118

Aug 19 2021, 7:48 PM · System administration, Storage manager
vsellier changed the status of T3465: Test multidatacenter replication, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, from Open to Work in Progress.
Aug 19 2021, 7:19 PM · System administration, Storage manager
vsellier changed the status of T3465: Test multidatacenter replication from Open to Work in Progress.
Aug 19 2021, 7:19 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

The gros cluster at Nancy[1] has a lot of nodes(124) with small reservable SSD of 960Go. This can be a good candidate to create the second cluster. It will also allow to check the performance with data (and commit logs) on SSDs.
According to the main cluster, a minimum of 8 nodes are necessary to handle the volume of data (7.3 To and growing). Starting with 10 nodes will allow to have some remaining space.

Aug 19 2021, 7:11 PM · System administration, Storage manager
vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

it seems some more precise information can be logged by activating the full query logs without a big performance impact: https://cassandra.apache.org/doc/latest/cassandra/new/fqllogging.html

Aug 19 2021, 6:52 PM · System administration, Storage manager
vlorentz merged task T3491: Origin visit ids restart from 1 even if there is previous visits into T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.
Aug 19 2021, 4:50 PM · System administration, Storage manager
vlorentz merged T3491: Origin visit ids restart from 1 even if there is previous visits into T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.
Aug 19 2021, 4:50 PM · Storage manager
vlorentz added a comment to T3491: Origin visit ids restart from 1 even if there is previous visits.

you mean T3492

Aug 19 2021, 4:49 PM · System administration, Storage manager
vsellier added a comment to T3491: Origin visit ids restart from 1 even if there is previous visits.

Should be fixed by T3482

Aug 19 2021, 4:34 PM · System administration, Storage manager
vsellier triaged T3493: [cassandra] Git loader performance are very bad as Normal priority.
Aug 19 2021, 4:32 PM · System administration, Storage manager
vlorentz triaged T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting as Normal priority.
Aug 19 2021, 4:31 PM · Storage manager
vsellier triaged T3491: Origin visit ids restart from 1 even if there is previous visits as Normal priority.
Aug 19 2021, 4:20 PM · System administration, Storage manager

Aug 17 2021

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

current status:

Aug 17 2021, 5:41 PM · System administration, Storage manager

Aug 16 2021

vsellier added a revision to T3357: Perform some tests of the cassandra storage on Grid5000: D6093: storage-cassandra: Remove the default src override.
Aug 16 2021, 4:27 PM · System administration, Storage manager

Aug 13 2021

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Current import status before the run of this week-end:

Aug 13 2021, 3:32 PM · System administration, Storage manager

Aug 11 2021

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The complete import is running almost continuously with 5 cassandra nodes since monday.

Aug 11 2021, 10:21 AM · System administration, Storage manager

Aug 6 2021

vlorentz added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Aug 6 2021, 6:52 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Aug 6 2021, 5:33 PM · System administration, Storage manager
vlorentz added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

Aug 6 2021, 3:48 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.
Aug 6 2021, 3:17 PM · System administration, Storage manager
vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

Aug 6 2021, 3:06 PM · System administration, Storage manager
vlorentz added a revision to T3357: Perform some tests of the cassandra storage on Grid5000: D6067: cassandra: Fix crash when using _missing() functions with more than 100 ids with ScyllaDB..
Aug 6 2021, 3:00 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 6 2021, 2:53 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1
Aug 6 2021, 1:09 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

scylladb test

Aug 6 2021, 11:56 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run7 results - cassandra heap from 16g to 32g

Aug 6 2021, 10:49 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

run6 results - commitlog on a HDD

Aug 6 2021, 10:28 AM · System administration, Storage manager

Aug 5 2021

vsellier triaged T3465: Test multidatacenter replication as Normal priority.
Aug 5 2021, 12:31 PM · System administration, Storage manager
vsellier triaged T3464: Prepare a quote for the cassandra servers as Normal priority.
Aug 5 2021, 12:20 PM · System administration, Storage manager
vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.
Aug 5 2021, 12:18 PM · System administration, Storage manager
vsellier updated the task description for T3357: Perform some tests of the cassandra storage on Grid5000.
Aug 5 2021, 12:17 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Some news about the tests running since the beginning of the week:

  • The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
  • The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
  • With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota
Aug 5 2021, 12:16 PM · System administration, Storage manager

Aug 2 2021

vlorentz moved T3450: 404 error when visiting a successfully archived repository from code-review/await-feedback/pause to done on the System administration board.
Aug 2 2021, 10:38 AM · Storage manager, System administration
vlorentz closed T3450: 404 error when visiting a successfully archived repository as Resolved.

This should be resolved now (actually, on the 31st at 16).

Aug 2 2021, 10:37 AM · Storage manager, System administration

Jul 30 2021

ardumont moved T3450: 404 error when visiting a successfully archived repository from in-progress to code-review/await-feedback/pause on the System administration board.
Jul 30 2021, 11:22 AM · Storage manager, System administration
ardumont added a comment to T3450: 404 error when visiting a successfully archived repository.

Thanks for the heads up @ both of you.

Jul 30 2021, 11:14 AM · Storage manager, System administration