diff --git a/sysadmin/grid5000/cassandra/Readme.md b/sysadmin/grid5000/cassandra/Readme.md index 3a8cd38..5d830d9 100644 --- a/sysadmin/grid5000/cassandra/Readme.md +++ b/sysadmin/grid5000/cassandra/Readme.md @@ -1,250 +1,250 @@ Grid5000 terraform provisioning =============================== - [Grid5000 terraform provisioning](#grid5000-terraform-provisioning) - [Prerequisite](#prerequisite) - [Run](#run) - [Local (on vagrant)](#local-on-vagrant) - [On Grid5000](#on-grid5000) - [Via the custom script](#via-the-custom-script) - [Reservation configuration](#reservation-configuration) - [Nodes configuration](#nodes-configuration) - [Execution](#execution) - [(deprecated) With terraform](#deprecated-with-terraform) - [Cleanup](#cleanup) - [TODO](#todo) - [Possible improvments](#possible-improvments) Prerequisite ------------ Tools ##### terraform >= 13.0 vagrant >= 2.2.3 [for local tests only] Credentials ########### * grid5000 credentials ``` cat < ~/.grid5000.yml uri: https://api.grid5000.fr username: username password: password EOF ``` Theses credentials will be used to interact with the grid5000 api to create the jobs * Private/public key files (id_rsa) in the `~/.ssh` directory The public key will be installed on the nodes Run --- ### Local (on vagrant) The `Vagrantfile` is configured to provision 3 nodes, install cassandra and the configure the cluster using the ansible configuration: ``` vagrant up vagrant ssh cassandra1 sudo -i nodetool status ``` If everything is ok, the `nodetool` command line returns: ``` root@cassandra1:~# nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.168.180.12 15.78 KiB 256 67.9% 05d61a24-832a-4936-b0a5-39926f800d09 rack1 UN 10.168.180.11 73.28 KiB 256 67.0% 23d855cc-37d6-43a7-886e-9446e7774f8d rack1 UN 10.168.180.13 15.78 KiB 256 65.0% c6bc1eff-fa0d-4b67-bc53-fc31c6ced5bb rack1 ``` Cassandra can take some time to start, so you have to wait before the cluster stabilize itself. ### On Grid5000 Useful link: Hardware information: https://www.grid5000.fr/w/Hardware Resources availability: https://www.grid5000.fr/w/Status #### Via the custom script ##### Reservation configuration The configuration is defined on the `environment.cfg` file. In this file, g5k sites, cluster, nodes and reparition can be configured. ##### Nodes configuration The node installation is done by ansible. It needs to know the node topology to correctly configure the tools (zfs pools and dataset, cassandra seed, ...) The configuration is centralized in the `ansible/hosts.yml` file ##### Execution 1. Transfer the files on g5k on the right site: ``` rsync -avP --exclude .vagrant --exclude .terraform cassandra access.grid5000.fr:/ ``` 2. Connect to the right site ``` ssh access.grid5000.fr ssh ``` 3. Reserve the disks The disks must be reserved before the node creation or they will not be detected on the nodes ``` ./00-reserve_disks.sh ``` check the status of the job / the resources status to be sure they are correctly reserved ``` $ oarstat -fj | grep state state = Running ``` The state must be running 4. Launch a complete run ``` ./01-run.sh ``` DISCLAIMER: Actually, it only runs the following steps: - reserve the nodes - install the os on all the nodes - launch ansible on all the nodes The underlying scripts can by run indepedently if they need to be restarted: - `02-reserver-nodes.sh`: Reserve the node resources - `03-deploy-nodes.sh`: Install the os (only one time per reservation) and launch ansible on all the nodes. To force an os resinstalltion, remove the `.os.stamp` file 5. Cleanup the resources To release the nodes: ``` oarstat -u ``` ``` oardel ``` #### (deprecated) With terraform Terraform can be greate to reserve the resources but it doesn't not allow manage the scheduled jobs * Initialize terraform modules (first time only) ``` terraform init ``` * Test the plan It only check the status of the declared resources compared to the grid5000 status. It's a read only operation, no actions on grid5000 will be perform. ``` terraform plan ``` * Execute the plan ``` terraform apply ``` This action creates the job, provisions the nodes according the `main.tf` file content and install the specified linux distribution on it. This command will log the reserved node name in output. For example for a 1 node reservation: ``` grid5000_job.cassandra: Creating... grid5000_job.cassandra: Still creating... [10s elapsed] grid5000_job.cassandra: Creation complete after 11s [id=1814813] grid5000_deployment.my_deployment: Creating... grid5000_deployment.my_deployment: Still creating... [10s elapsed] grid5000_deployment.my_deployment: Still creating... [20s elapsed] grid5000_deployment.my_deployment: Still creating... [30s elapsed] grid5000_deployment.my_deployment: Still creating... [40s elapsed] grid5000_deployment.my_deployment: Still creating... [50s elapsed] grid5000_deployment.my_deployment: Still creating... [1m0s elapsed] grid5000_deployment.my_deployment: Still creating... [1m10s elapsed] grid5000_deployment.my_deployment: Still creating... [1m20s elapsed] grid5000_deployment.my_deployment: Still creating... [1m30s elapsed] grid5000_deployment.my_deployment: Still creating... [1m40s elapsed] grid5000_deployment.my_deployment: Still creating... [1m50s elapsed] grid5000_deployment.my_deployment: Still creating... [2m0s elapsed] grid5000_deployment.my_deployment: Still creating... [2m10s elapsed] grid5000_deployment.my_deployment: Creation complete after 2m12s [id=D-0bb76036-1512-429f-be99-620afa328b26] Apply complete! Resources: 2 added, 0 changed, 0 destroyed. Outputs: nodes = [ "chifflet-6.lille.grid5000.fr", ] ``` It's now possible to connect to the nodes: ``` $ ssh -A access.grid5000.fr $ ssh -A root@chifflet-6.lille.grid5000.fr Linux chifflet-6.lille.grid5000.fr 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 Debian10-x64-base-2021060212 (Image based on Debian Buster for AMD64/EM64T) Maintained by support-staff Doc: https://www.grid5000.fr/w/Getting_Started#Deploying_nodes_with_Kadeploy root@chifflet-6:~# ``` Cleanup ------- To destroy the resources before the end of the job: ``` terraform destroy ``` If the job is stopped, simply remove the `terraform.tfstate` file: ``` rm terraform.tfstate ``` ## TODO [X] variablization of the script [X] Ansible provisionning of the nodes [X] disk initialization [X] support different cluster topologies (nodes / disks / ...) [X] cassandra installation -[ ] Add a tool to erase the reserved disks (useful to avoid zfs to detect the previous pools and be able to restart from scratch) -[ ] swh-storage installation +[X] swh-storage installation [ ] journal client for mirroring [ ] monitoring by prometheus +[ ] Add a tool to erase the reserved disks (useful to avoid zfs to detect the previous pools and be able to restart from scratch) ## Possible improvments [ ] Use several besteffort jobs for cassandra nodes. They can be interrupted but don't have duration restrictions. diff --git a/sysadmin/grid5000/cassandra/Vagrantfile b/sysadmin/grid5000/cassandra/Vagrantfile index ee1bf9e..56dcabd 100644 --- a/sysadmin/grid5000/cassandra/Vagrantfile +++ b/sysadmin/grid5000/cassandra/Vagrantfile @@ -1,69 +1,75 @@ # -*- mode: ruby -*- # vi: set ft=ruby : vms = { "cassandra1" => { :ip => "10.168.180.11", :memory => 2048, :cpus => 2, :type => 'cassandra', }, "cassandra2" => { :ip => "10.168.180.12", :memory => 2048, :cpus => 2, :type => 'cassandra', }, "cassandra3" => { :ip => "10.168.180.13", :memory => 2048, :cpus => 2, :type => 'cassandra', }, + "swh-storage" => { + :ip => "10.168.180.14", + :memory => 1024, + :cpus => 2, + :type => 'swh-storage', + }, } # Images/remote configuration $global_debian10_box = "debian10-20210517-1348" $global_debian10_box_url = "https://annex.softwareheritage.org/public/isos/libvirt/debian/swh-debian-10.9-amd64-20210517-1348.qcow2" vms.each { | vm_name, vm_props | Vagrant.configure("2") do |global_config| unless Vagrant.has_plugin?("libvirt") $stderr.puts <<-MSG vagrant-libvirt plugin is required for this. To install: `$ sudo apt install vagrant-libvirt MSG exit 1 end global_config.vm.define vm_name do |config| config.vm.box = $global_debian10_box config.vm.box_url = $global_debian10_box_url config.vm.box_check_update = false config.vm.hostname = vm_name config.vm.network :private_network, ip: vm_props[:ip], netmask: "255.255.0.0" config.vm.synced_folder ".", "/vagrant", type: 'nfs', nfs_version: 4 config.vm.provision :ansible do |ansible| ansible.verbose = true ansible.become = true ansible.playbook = "ansible/playbook.yml" ansible.inventory_path = "ansible/hosts.yml" end config.vm.provider :libvirt do |provider| provider.memory = vm_props[:memory] provider.cpus = vm_props[:cpus] provider.driver = 'kvm' if vm_props[:type] == "cassandra" provider.storage :file, :size => '1G' provider.storage :file, :size => '1G' provider.storage :file, :size => '1G' end end end end } diff --git a/sysadmin/grid5000/cassandra/ansible/hosts.yml b/sysadmin/grid5000/cassandra/ansible/hosts.yml index f5797af..2004742 100644 --- a/sysadmin/grid5000/cassandra/ansible/hosts.yml +++ b/sysadmin/grid5000/cassandra/ansible/hosts.yml @@ -1,76 +1,86 @@ # Global configuration +swh-storage: + hosts: + parasilo-[20:28].rennes.grid5000,fr: + # local vagrant hosts + swh-storage: + cassandra: hosts: dahu-[1:32].grenoble.grid5000.fr: - parasilo-[2:4].rennes.grid5000.fr: + parasilo-[1:19].rennes.grid5000.fr: # local vagrant hosts cassandra[1:9]: vars: ansible_connection: local cassandra_config_dir: /etc/cassandra cassandra_data_dir_base: /srv/cassandra cassandra_data_dir_system: "{{cassandra_data_dir_base}}/system" cassandra_data_dir: "{{ cassandra_data_dir_base }}/data" cassandra_commitlogs_dir: "{{ cassandra_data_dir_base }}/commitlogs" # Per cluster specificities dahu_cluster_hosts: hosts: dahu[1:32].grenoble.grid5000.fr vars: cassandra_listen_interface: enp24s0f0 parasilo_cluster_hosts: hosts: parasilo-[1:28].rennes.grid5000.fr: vars: cassandra_listen_interface: eno1 zfs_pools: commitlogs: disks: - sdf datasets: commitlogs: /srv/cassandra/commitlogs data: disks: - sdb - sdc - sdd - sde datasets: data: /srv/cassandra/data # Vagrant configuration vagrant_nodes: hosts: cassandra1: ansible_host: 10.168.180.11 ansible_user: vagrant ansible_ssh_private_key_file: .vagrant/machines/cassandra1/libvirt/private_key cassandra2: ansible_host: 10.168.180.12 ansible_user: vagrant ansible_ssh_private_key_file: .vagrant/machines/cassandra2/libvirt/private_key cassandra3: ansible_host: 10.168.180.13 ansible_user: vagrant - ansible_ssh_private_key_file: .vagrant/machines/cassandra2/libvirt/private_key + ansible_ssh_private_key_file: .vagrant/machines/cassandra3/libvirt/private_key + swh-storage: + ansible_host: 10.168.180.14 + ansible_user: vagrant + ansible_ssh_private_key_file: .vagrant/machines/swh-storage/libvirt/private_key vars: ansible_connection: ssh cassandra_listen_interface: eth1 # passed through --extra-vars on grid5000 cassandra_seed_ips: 10.168.180.11,10.168.180.12,10.168.180.13 zfs_pools: commitlogs: disks: - vdb datasets: commitlogs: /srv/cassandra/commitlogs data: disks: - vdc - vdd datasets: data: /srv/cassandra/data diff --git a/sysadmin/grid5000/cassandra/ansible/playbook.yml b/sysadmin/grid5000/cassandra/ansible/playbook.yml index 0355342..b8e8a54 100644 --- a/sysadmin/grid5000/cassandra/ansible/playbook.yml +++ b/sysadmin/grid5000/cassandra/ansible/playbook.yml @@ -1,11 +1,11 @@ --- - name: Install cassandra hosts: cassandra tasks: - include: zfs.yml - include: cassandra.yml - name: Install SWH Storage hosts: swh-storage - task: - - include: swh-storage + tasks: + - include: swh-storage.yml diff --git a/sysadmin/grid5000/cassandra/ansible/swh-storage.yml b/sysadmin/grid5000/cassandra/ansible/swh-storage.yml new file mode 100644 index 0000000..2ae0901 --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/swh-storage.yml @@ -0,0 +1,101 @@ +--- +- name: Add Backports repository + apt_repository: + repo: deb http://deb.debian.org/debian/ buster-backports main contrib non-free + filename: backports.sources + +- name: swhstorage group + group: + name: swhstorage + +- name: swhstorage user + user: + name: swhstorage + group: swhstorage + +- name: Add SWH repository + apt_repository: + repo: deb [trusted=yes] https://debian.softwareheritage.org/ buster-swh main + state: present + filename: cassandra.sources + +- name: Install packages + apt: + name: + - dstat + - python3 + - python3-gunicorn + +- name: Install packages from backports + apt: + name: + - python3-typing-extensions + - gunicorn3 + default_release: buster-backports + +- name: Install swh storage packages + apt: + name: + - python3-swh.storage + +- name: Create directories + file: + state: directory + path: "{{ item }}" + owner: root + group: root + mode: "0755" + with_items: + - /etc/gunicorn + - /etc/gunicorn/instances + - /run/gunicorn + - /run/gunicorn/swh-storage + - /etc/softwareheritage + - /etc/softwareheritage/storage + +- name: Create swh-storage directories + file: + state: directory + path: "{{ item }}" + owner: swhstorage + group: swhstorage + mode: "0755" + with_items: + - /run/gunicorn/swh-storage/ + +- name: Configure gunicorn - default service + template: + src: "templates/gunicorn/gunicorn.service" + dest: "/etc/systemd/system/gunicorn.service" + +- name: Configure gunicorn - log configuration + template: + src: "templates/gunicorn/logconfig.ini" + dest: "/etc/gunicorn/logconfig.ini" + +- name: swh-storage gunicorn instance configuration + template: + src: "templates/gunicorn/gunicorn-instance.cfg" + dest: "/etc/gunicorn/instances/swh-storage.cfg" + +- name: swh-storage configuration directories + template: + src: "templates/swhstorage/storage.yml" + dest: "/etc/softwareheritage/storage/storage.yml" + +- name: swh-storage service configuration + template: + src: "templates/gunicorn/gunicorn-instance-service.cfg" + dest: "/etc/systemd/system/gunicorn-swh-storage.service" # TODO variabilize + +- name: swh-storage service + service: + name: gunicorn-swh-storage + enabled: true + state: started + +- name: swh-storage init cassandra script + template: + src: templates/swhstorage/init-cassandra.sh + dest: /usr/local/bin/swh-storage-init-cassandra.sh + mode: 0755 diff --git a/sysadmin/grid5000/cassandra/ansible/templates/cassandra.yaml b/sysadmin/grid5000/cassandra/ansible/templates/cassandra.yaml index 7a92e70..362d42a 100644 --- a/sysadmin/grid5000/cassandra/ansible/templates/cassandra.yaml +++ b/sysadmin/grid5000/cassandra/ansible/templates/cassandra.yaml @@ -1,37 +1,40 @@ cluster_name: swh-storage # default 'Test Cluster' num_tokens: 256 # default 256 allocate_tokens_for_local_replication_factor: 3 data_file_directories: - {{ cassandra_data_dir }} # TODO use several disks # local_system_data_file_directory: {{ cassandra_data_dir_system }} commitlog_directory: {{ cassandra_commitlogs_dir }} disk_optimization_strategy: spinning # spinning | ssd # listen_address: 0.0.0.0 # always wrong according to the documentation listen_interface: {{ cassandra_listen_interface }} # always wrong according to the documentation concurrent_compactors: 1 # should be min(nb core, nb disks) internode_compression: dc # default dc possible all|dc|none concurrent_reads: 16 # 16 x number of drives concurrent_writes: 32 # 8 x number of cores commitlog_sync: periodic # default periodic commitlog_sync_period_in_ms: 10000 # default 10000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner endpoint_snitch: SimpleSnitch seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: ",," - seeds: "{{ cassandra_seed_ips }}" +# needed by swh-storage +enable_user_defined_functions: true + # TODO Test this options effects # disk_failure_policy: # cdc_enabled #end diff --git a/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance-service.cfg b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance-service.cfg new file mode 100644 index 0000000..63ae9cf --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance-service.cfg @@ -0,0 +1,25 @@ +[Unit] +Description=Gunicorn instance swh-storage +ConditionPathExists=/etc/gunicorn/instances/swh-storage.cfg +PartOf=gunicorn.service +ReloadPropagatedFrom=gunicorn.service +Before=gunicorn.service + +[Service] +User=swhstorage +Group=swhstorage +PIDFile=/run/swh-storage.pid +RuntimeDirectory=gunicorn/swh-storage +WorkingDirectory=/run/gunicorn/swh-storage +Environment=SWH_CONFIG_FILENAME=/etc/softwareheritage/storage/storage.yml +Environment=SWH_LOG_TARGET=journal +Environment=SWH_MAIN_PACKAGE=swh.storage +ExecStart=/usr/bin/gunicorn3 -p /run/gunicorn/swh-storage/pidfile -c /etc/gunicorn/instances/swh-storage.cfg swh.storage.api.server:make_app_from_configfile() +ExecStop=/bin/kill -TERM $MAINPID +ExecReload=/bin/kill -HUP $MAINPID + +Restart=always +RestartSec=10 + +[Install] +WantedBy=multi-user.target diff --git a/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance.cfg b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance.cfg new file mode 100644 index 0000000..abbcfdf --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn-instance.cfg @@ -0,0 +1,51 @@ +# Gunicorn instance configuration. + +# import all settings from the base module +try: + from swh.core.api.gunicorn_config import * +except: + import logging + logging.exception('Failed to import configuration from swh.core.api.gunicorn_config') + +import traceback +import gunicorn.glogging + +class Logger(gunicorn.glogging.Logger): + log_only_errors = True + + def access(self, resp, req, environ, request_time): + """ See http://httpd.apache.org/docs/2.0/logs.html#combined + for format details + """ + + if not (self.cfg.accesslog or self.cfg.logconfig or self.cfg.syslog): + return + + # wrap atoms: + # - make sure atoms will be test case insensitively + # - if atom doesn't exist replace it by '-' + atoms = self.atoms(resp, req, environ, request_time) + safe_atoms = self.atoms_wrapper_class(atoms) + + try: + if self.log_only_errors and str(atoms['s']) == '200': + return + self.access_log.info(self.cfg.access_log_format % safe_atoms, extra={'swh_atoms': atoms}) + except: + self.exception('Failed processing access log entry') + +logger_class = Logger +logconfig = '/etc/gunicorn/logconfig.ini' + +# custom settings +# bind = "unix:/run/gunicorn/swh-storage/gunicorn.sock" +bind = "0.0.0.0:5002" +workers = 10 +worker_class = "sync" +timeout = 3600 +graceful_timeout = 3600 +keepalive = 5 +max_requests = 100000 +max_requests_jitter = 1000 +statsd_host = "127.0.0.1:8125" +statsd_prefix = "swh-storage" diff --git a/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn.service b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn.service new file mode 100644 index 0000000..512251b --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/gunicorn.service @@ -0,0 +1,13 @@ +# File managed by puppet (module swh-gunicorn), changes will be lost + +[Unit] +Description=All gunicorn services + +[Service] +Type=oneshot +ExecStart=/bin/true +ExecReload=/bin/true +RemainAfterExit=on + +[Install] +WantedBy=multi-user.target diff --git a/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/logconfig.ini b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/logconfig.ini new file mode 100644 index 0000000..0f4cdcd --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/gunicorn/logconfig.ini @@ -0,0 +1,51 @@ +[loggers] +keys=root, gunicorn.error, gunicorn.access, azure.storage.common.storageclient, azure.core.pipeline.policies.http_logging_policy + +[handlers] +keys=console, journal + +[formatters] +keys=generic + +[logger_root] +level=INFO +handlers=journal + +[logger_gunicorn.error] +level=INFO +propagate=0 +handlers=journal +qualname=gunicorn.error + +[logger_gunicorn.access] +level=INFO +propagate=0 +handlers=journal +qualname=gunicorn.access + +[logger_azure.storage.common.storageclient] +level=WARN +propagate=0 +handlers=journal +qualname=azure.storage.common.storageclient + +[logger_azure.core.pipeline.policies.http_logging_policy] +level=WARN +propagate=0 +handlers=journal +qualname=azure.core.pipeline.policies.http_logging_policy + +[handler_console] +class=StreamHandler +formatter=generic +args=(sys.stdout, ) + +[handler_journal] +class=swh.core.logger.JournalHandler +formatter=generic +args=() + +[formatter_generic] +format=%(asctime)s [%(process)d] %(name)s:%(levelname)s %(message)s +datefmt=%Y-%m-%d %H:%M:%S +class=logging.Formatter diff --git a/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/init-cassandra.sh b/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/init-cassandra.sh new file mode 100644 index 0000000..d0d62b3 --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/init-cassandra.sh @@ -0,0 +1,5 @@ +#!/bin/bash + +echo " +from swh.storage.cassandra import create_keyspace +create_keyspace({{ cassandra_seed_ips.split(',') }}, 'swh') " | python3 diff --git a/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/storage.yml b/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/storage.yml new file mode 100644 index 0000000..4206a3a --- /dev/null +++ b/sysadmin/grid5000/cassandra/ansible/templates/swhstorage/storage.yml @@ -0,0 +1,8 @@ +storage: + cls: cassandra + args: + hosts: {{cassandra_seed_ips}} + keyspace: swh + objstorage: + cls: memory + args: {}