diff --git a/sysadm/deployment/data-migration.rst b/sysadm/deployment/data-migration.rst deleted file mode 100644 --- a/sysadm/deployment/data-migration.rst +++ /dev/null @@ -1,16 +0,0 @@ -.. _data-migration: - -How to handle data migrations -============================= - -Empty page ----------- - -.. todo:: - This page is a work in progress. - - - - - - diff --git a/sysadm/deployment/index.rst b/sysadm/deployment/index.rst --- a/sysadm/deployment/index.rst +++ b/sysadm/deployment/index.rst @@ -7,4 +7,4 @@ deployment-environments upgrade-swh-service deploy-lister - data-migration + storage-database-migration diff --git a/sysadm/deployment/storage-database-migration.rst b/sysadm/deployment/storage-database-migration.rst new file mode 100644 --- /dev/null +++ b/sysadm/deployment/storage-database-migration.rst @@ -0,0 +1,84 @@ +.. _storage-database-migration: + +How to handle a storage database migration +========================================== + +.. admonition:: Intended audience + :class: important + + sysadm staff members + +If a storage database upgrade is needed, a migration script should already exists in the +*swh-storage* git repository. + +.. _upgrade_version: + +Upgrade version +--------------- + +Check the current database version (first one in desc order): + +.. code:: sql + + select dbversion from dbversion order by version desc limit 1; + +Say, for example that the result is 159 here. + +Check the migration script folder in swh-storage:/sql/upgrades/ (and find the next one, +for example `160.sql +`_). +It's previous version number + 1 from the given db version retrieved (so 160 with the +current example). + +Note: That you could need to run more than one migration. It depends on the current +packaged version and the next version we want to deploy. Check the git history to +determine that. + +Requisite +--------- + +Ensure the migration script runs first in the staging database +(db0.internal.staging.swh.network is the node holding the swh staging database). Then +you can go ahead and run it in production database +(belvedere.internal.softwareheritage.org). + +Connect to the db with the user with write permission, then run the +script: + + $ psql -e ... + > \i sql/upgrades/160.sql + +Note: + +- *-e* so you can see the queries currently running prior to its result + +- For long-running scripts, connect to the remote machine first [5] [6] + +Adaptations +----------- + +Hopefully, in production, the script runs as is without adaptation… + +Otherwise, if the data volume for a given table is large, you may want to adapt. See +`160.sql +`_ +and `its adaptation `_ + +For such a case, consider working on ranges on the table id instead. So it uses index +and keep the transaction short. Long-standing migration query (translates to long +running transaction). This could create too many WALs accumulation (for the +replication), thus disk space starvation issue, etc… + +Note +---- + +We use grafana to ensure everything is fine (for example, for the replication, we use +the `postgresql database dashboard, bottom page to the right +`_). + +We also use it to keep a reference of what happened for a given deployment. For this, +Open a grafana dashboard (for example `worker task processing dashboard +`_) +and add a tag *deployment* (so it's shared across dashboards) with a description on what +is the current deployment about. It's usually a list of module names that gets deployed +and associated version deployed. diff --git a/sysadm/deployment/upgrade-swh-service.rst b/sysadm/deployment/upgrade-swh-service.rst --- a/sysadm/deployment/upgrade-swh-service.rst +++ b/sysadm/deployment/upgrade-swh-service.rst @@ -1,10 +1,151 @@ .. _upgrade-swh-service: -How to upgrade swh service -========================== +Upgrade swh service +=================== -Empty page ----------- +.. admonition:: Intended audience + :class: important + + sysadm staff members + +Workers +------- + +Dedicated workers [1] run our *swh-worker@loader_{git, hg, svn, npm, ...}* services. +When a new version is released, we need to upgrade their package(s). + +[1] Here are the following group name (in `clush +`_ terms): + +- *@swh-workers* for the production workers +- *@azure-workers* for the production ones running on azure +- *@staging-loader-workers* for the staging ones + +See :ref:`deploy-new-lister` for a practical example. + +Code and publish +---------------- + +.. _fix-or-evolve-code: + +Code an evolution of fix an issue in the python code within the git repository's master +branch. Open a diff for review, land it when accepted, and start back at :ref:`tag and push +`. + +.. _tag-and-push: + +Tag and push +~~~~~~~~~~~~ + +When ready, `git tag` and `git push` the new tag of the module. + +.. code:: + + $ git tag vA.B.C + $ git push origin --follow-tags + +.. _publish-and-deploy: + +Publish and deploy +~~~~~~~~~~~~~~~~~~ + +Let jenkins publish and deploy the debian package. + +.. _troubleshoot: + +Troubleshoot +~~~~~~~~~~~~ + +If jenkins fails for some reason, fix the module be it :ref:`python code +` or the :ref:`debian packaging `. + +.. _troubleshoot-debian-package: + +Debian package troubleshoot +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In that case, upgrade and checkout the *debian/unstable-swh* branch, then fix whatever +is not updated or broken due to a change. It's usually a missing new package dependency +to fix in *debian/control*). Add a new entry in *debian/changelog*. Make sure gbp builds +fine. Then tag it. Jenkins will build the package anew. + +.. code:: + + $ gbp buildpackage --git-tag-only --git-sign-tag # tag it + $ git push origin --follow-tags # trigger the build + +Deploy +------ + +.. _nominal_case: + +Nominal case +~~~~~~~~~~~~ + +Update the machine dependencies and restart service. That usually means +as sudo user: + +.. code:: + + $ apt-get update + $ apt-get dist-upgrade -y + $ systemctl restart swh-worker@loader_${type} + +Note that this is for one machine you ssh into. + +We usually wrap those commands from the sysadmin machine pergamon [3] with the *clush* +command, something like: + +.. code:: + + $ sudo clush -b -w @swh-workers 'apt-get update; env DEBIAN_FRONTEND=noninteractive \ + apt-get -o Dpkg::Options::="--force-confdef" \ + -o Dpkg::Options::="--force-confold" -y dist-upgrade'`` + +[3] pergamon is already *clush* configured to allow multiple ssh connections in parallel +on our managed infrastructure nodes. + +.. _configuration-change-required: + +Configuration change required +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either wait for puppet to actually deploy the changes first and then go back to the +nominal case. + +Or force a puppet run: + +.. code:: + + sudo clush -b -w @swh-workers puppet agent -t + +Note: *-t* is not optional + +.. _long-standing-migration: + +Long-standing migration +~~~~~~~~~~~~~~~~~~~~~~~ + +In that case, you may need to stop all services for migration which could take some time +(because lots of data is migrated for example). + +You need to momentarily stop puppet (which runs every 30 min to apply manifest changes) +and the cron service (which restarts down services) on the workers nodes. + +Report yourself to the :ref:`storage database migration ` +for a concrete case of database migration. + +.. code:: + + $ sudo clush -b -w @swh-workers 'systemctl stop cron.service; puppet agent --disable' + +Then: + +- Execute the database migration. +- Go back to the nominal case. +- Restart puppet and the cron on workers + +.. code:: + + $ sudo clush -b -w @swh-workers 'systemctl start cron.service; puppet agent --enable' -.. todo:: - This page is a work in progress.