diff --git a/docs/index.rst b/docs/index.rst index 4ae9a50..021dbac 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,182 +1,183 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or * :ref:`developer-setup` → get a working development setup that allows to hack on the Software Heritage software stack Architecture ------------ * :ref:`architecture` → get a glimpse of the Software Heritage software architecture * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up one Data Model and Specifications ----------------------------- * :ref:`persistent-identifiers` Specifications of the SoftWare Heritage persistent IDentifiers (SWHID). * :ref:`data-model` Documentation of the main |swh| archive data model. * :ref:`journal-specs` Documentation of the Kafka journal of the |swh| archive. Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.search ` search engine for the archive :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Archive ------- * :ref:`Archive ChangeLog `: notable changes to the archive over time Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: architecture getting-started developer-setup API documentation swh.core swh.dataset swh.deposit swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.scanner swh.scheduler swh.search swh.storage swh.vault swh.web swh.web.client + issue-debugging-monitoring diff --git a/docs/issue-debugging-monitoring.md b/docs/issue-debugging-monitoring.md new file mode 100644 index 0000000..aa4fad9 --- /dev/null +++ b/docs/issue-debugging-monitoring.md @@ -0,0 +1,143 @@ +# Issue debugging and monitoring guide + +In order to debug issues happening in production, you need to get as much information as +possible on the issue. It helps reproducing or directly fixing the issue. In addition, +you want to monitor it to see how it evolves or if it is fixed for good. + +The tools used at SWH to get insights on issue happening in production are Sentry and +Kibana. + +## Sentry overview + +SWH instance URL: + +The service requires a login password pair to access, but does not require the SWH VPN +access. To sign up, click "Request to join" and provide your SWH developer email address +for the admins to create the account. + +Official documentation: + +Sentry is specifically geared towards debugging production issues. In the "Issues" pane, +it presents issues grouped by similarity with statistics about their occurrence. Issues +can be filtered by: +- project (i.e. SWH service repository), e.g. "swh-loader-core" or "swh-vault"; +- environment, e.g. "production" or "staging"; +- time range. + +Viewing a particular issue, you can access: + +- the execution trace at the point of error, with pretty-printed local variables at each + stack frame, as you would get in a post-mortem debugging session; +- contextual metadata about the running environment, which includes: + - the first and last occurrence as detected by Sentry, + - corresponding component versions, + - installed packages, + - entrypoint parameters, + - runtime environment such as the interpreter version, the hostname¸ or the logging + configuration. +- the breadcrumbs view, which shows several event log lines produced in the same run + prior to the error. These are not the logs produced by the application, but events + gathered through Sentry integrations. + +## Debugging SWH services with Sentry + +Here we show a specific type of issue that is characteristic of microservice +architectures as implemented at SWH. One difficulty may arise in finding where an issue +originates, because the execution is split between multiple services. It results in a +chain of linked issues, potentially one for each service involved. + +Errors of type `RemoteException` encapsulate an error occurring in the service called +through a RPC mechanism. If the information encapsulated in this top-level error is not +sufficient, one would search for complementary traces by filtering the "Issues" view by +the linked service's project name. + +Example: + +Sentry issue: + +The error appear as `` +A request from a vault cooker to the storage service had a network error. + +Thanks to Sentry we see also which was the specific storage requested: + + `` + +Upon searching in the storage service issues, we find a corresponding `HttpResponseError`: + + +We skip through the error reporting logic in the trace to get to the operation that was +performed. We see that this error comes in turn from a RPC call to the objstorage service: + + HttpResponseError: "Download stream interrupted." at `swh/storage/objstorage.py` in `content_get` at line 41 + +This is a transient network error: it should not persist when retrying. So a solution +might be to add a retrying mechanism somewhere in this chain of RPC calls. + +## Issue monitoring with Sentry + +Aggregated error traces as shown in the "Issues" pane are the primary source of +information for monitoring. This includes the statistics of occurrence for a given +period of time. + +Sentry also comes with issue management features, that notably let you silence or +resolve errors. Silencing means the issue will still be recorded but not notified. +Resolving means the issue will be hidden from the default view, and any new occurrence +of it will specifically notify the issue owner that the issue still arises and is in +fact not resolved. Make sure an owner is associated to the issue, typically through +ownership rules set in the project settings. + +For more info on monitoring issues, refer to: + + +## Kibana overview + +SWH instance URL: +Access to the SWH VPN is needed, but credentials are not. + +Related wiki page: + +Official documentation: + +Kibana is a vizualization UI for searching through indexed logs. You can search through +different sources of logs in the "Discover" pane. The sources configured include +application logs for SWH services and system logs. You can also access dashboards shared +by other on a particular topic or create our own from a saved search. + +There are 2 query languages which are quite similar: Lucene or KQL. Whatever one you +choose, you will have the same querying capabilities. A query tries to match values for +specific keys, and support many predicates and combination of them. See the +documentation for KQL: https://www.elastic.co/guide/en/kibana/current/kuery-query.html + +To get logs for a particular service, you have to know the name of its systemd unit and +the hostname of the production server providing this service. For a worker, switch the +index pattern to "swh_workers-*", for another SWH service switch it to "systemlogs-*". + +Example for getting swh-vault production logs: + +With the index pattern set to "systemlogs-*", enter the KQL query: + + `systemd_unit:"gunicorn-swh-vault.service" AND hostname:"vangogh"` + +Upon expanding a log entry with the leading arrow icon, you can inspect the entry in a +structured way. You can filter on particular values or fields, using the icons that are +left to the desired field. Fields including "message", "hostname" or "systemd_unit" are +often the most informational. You can also view the entry in context, several entries +before and after chronologically. + +## Issue monitoring with Kibana + +You can use Kibana saved searches and dashboards to follow issues based on associated +logs. Of course, we need to have logs produced that are related to the issue we want to +track. + +You can save a search, as opposed to only a query, to easily get back to it or include +it in a dashboard. Just click "Save" in the top toolbar above the search bar. It +includes the query, filters, selected columns, sorting and index pattern. + +Now you may want to have a customizable view of these logs, along with graphical +presentations. In the "Dashboard" pane, create a new dashboard. Click "add" in the top +toolbar and select your saved search. It will appear in resizeable panel. Now doing a +search will restrict the search to the dataset cinfigured for the panels. + +To create more complete vizualizations including graphs, refer to: +