Changeset View
Changeset View
Standalone View
Standalone View
docs/issue-debugging-monitoring.md
- This file was added.
# Tutorial: issue debugging and monitoring | |||||||||||
In order to debug issues happening in production, you need to get as much information as | |||||||||||
possible on the issue. It helps reproducing or directly fixing the issue. In addition, | |||||||||||
you want to monitor it to see how it evolves or if it is fixed for good. | |||||||||||
The tools used at SWH to get insights on issue happening in production are Sentry and | |||||||||||
moranegg: As a developer, the tools used on the SWH infrastructure are Sentry and Kibana, which are… | |||||||||||
Kibana. | |||||||||||
## Sentry overview | |||||||||||
SWH instance URL: <https://sentry.softwareheritage.org/> | |||||||||||
Not Done Inline Actions
vlorentz: | |||||||||||
Not Done Inline Actions"You will need" instead of "Your need" moranegg: "You will need" instead of "Your need" | |||||||||||
The service requires a login password pair to access, but does not require the SWH VPN | |||||||||||
access. To sign up, click "Request to join" and provide your SWH developer email address | |||||||||||
for the admins to create the account. | |||||||||||
Official documentation: <https://docs.sentry.io/product/> | |||||||||||
Sentry is specifically geared towards debugging production issues. In the "Issues" pane, | |||||||||||
it presents issues grouped by similarity with statistics about their occurrence. Issues | |||||||||||
Not Done Inline Actionsmaybe an example for each in a list manner: can be filtered by: - project (e.g _______) - environment (e.g _______) - time range (e.g _______) moranegg: maybe an example for each in a list manner:
```
can be filtered by:
- project (e.g _______)… | |||||||||||
can be filtered by: | |||||||||||
- project (i.e. SWH service repository), e.g. "swh-loader-core" or "swh-vault"; | |||||||||||
Not Done Inline ActionsA suggestion to a list view: On an issue view, you can access: - the particular issue (how is that shown- error message?) - execution trace - error point - contextual metadata (including....) - post mortem (including ...) moranegg: A suggestion to a list view:
```
On an issue view, you can access:
- the particular issue (how… | |||||||||||
- environment, e.g. "production" or "staging"; | |||||||||||
- time range. | |||||||||||
Not Done Inline ActionsEasier to read in a list: The metadata properties that are visible and helpful are: - first and last occurrence - corresponding component versions, - installed packages - server environment. I've added helpful in the text, might be interesting to know why this metadata is helpful moranegg: Easier to read in a list:
```
The metadata properties that are visible and helpful are:
- first… | |||||||||||
Viewing a particular issue, it presents the execution trace at the point of error, and | |||||||||||
loads of contextual metadata about the running environment. You get the post-mortem | |||||||||||
debugging information, including values of local variables, pretty-printed. The metadata | |||||||||||
includes first and last occurrence, corresponding component versions, installed | |||||||||||
packages, server environment. The breadcrumbs view shows several event log lines | |||||||||||
produced in the same run prior to the error. These are not the logs produced by the | |||||||||||
application, but events gathered through Sentry integrations. | |||||||||||
Not Done Inline ActionsHere you can find a step by step example of an issue investigation for the microservice Question: is "architectures" the name of the service? Before continuing with the origin of the issue, maybe a short description of what was supposed to happen and what did happen instead. moranegg: Here you can find a step by step example of an issue investigation for the microservice… | |||||||||||
## Debugging SWH services with Sentry | |||||||||||
Here we show a specific type of issue that is characteristic of microservice | |||||||||||
architectures as implemented at SWH. One difficulty may arise in finding where an issue | |||||||||||
originates, because the execution is split between multiple services. It results in a | |||||||||||
chain of linked issues, potentially one for each service involved. | |||||||||||
Errors of type `RemoteException` encapsulate an error occurring in the service called | |||||||||||
through a RPC mechanism. If the information encapsulated in this top-level error is not | |||||||||||
sufficient, one would search for complementary traces by filtering the "Issues" view by | |||||||||||
the linked service's project name. | |||||||||||
Example: | |||||||||||
Sentry issue: <https://sentry.softwareheritage.org/organizations/swh/issues/5026/?project=11> | |||||||||||
The error appear as `<RemoteException 500 HttpResponseError: ['Download stream interrupted.']>` | |||||||||||
A request from a vault cooker to the storage service had a network error. | |||||||||||
Thanks to Sentry we see also which was the specific storage requested: | |||||||||||
`<RemoteStorage url=http://storage01.euwest.azure.internal.softwareheritage.org:5002/>` | |||||||||||
Upon searching in the storage service issues, we find a corresponding `HttpResponseError`: | |||||||||||
<https://sentry.softwareheritage.org/organizations/swh/issues/3857/?project=3> | |||||||||||
We skip through the error reporting logic in the trace to get to the operation that was | |||||||||||
performed. We see that this error comes in turn from a RPC call to the objstorage service: | |||||||||||
HttpResponseError: "Download stream interrupted." at `swh/storage/objstorage.py` in `content_get` at line 41 | |||||||||||
This is a transient network error: it should not persist when retrying. So a solution | |||||||||||
might be to add a retrying mechanism somewhere in this chain of RPC calls. | |||||||||||
## Issue monitoring with Sentry | |||||||||||
Aggregated error traces as shown in the "Issues" pane are the primary source of | |||||||||||
information for monitoring. This includes the statistics of occurrence for a given | |||||||||||
period of time. | |||||||||||
Sentry also comes with issue management features, that notably let you silence or | |||||||||||
resolve errors. Silencing means the issue will still be recorded but not notified. | |||||||||||
Resolving means the issue will be hidden from the default view, and any new occurrence | |||||||||||
of it will specifically notify the issue owner that the issue still arises and is in | |||||||||||
fact not resolved. Make sure an owner is associated to the issue, typically through | |||||||||||
ownership rules set in the project settings. | |||||||||||
For more info on monitoring issues, refer to: | |||||||||||
<https://docs.sentry.io/product/error-monitoring/> | |||||||||||
## Kibana overview | |||||||||||
Not Done Inline ActionsYou will need access to the SWH VPN and it is available without credentials. moranegg: You will need access to the SWH VPN and it is available without credentials. | |||||||||||
SWH instance URL: <http://kibana0.internal.softwareheritage.org:5601/app/kibana/> | |||||||||||
Access to the SWH VPN is needed, but credentials are not. | |||||||||||
Related wiki page: <https://intranet.softwareheritage.org/wiki/Kibana> | |||||||||||
Official documentation: <https://www.elastic.co/guide/en/kibana/current/index.html> | |||||||||||
Kibana is a vizualization UI for searching through indexed logs. You can search through | |||||||||||
different sources of logs in the "Discover" pane. The sources configured include | |||||||||||
application logs for SWH services and system logs. You can also access dashboards shared | |||||||||||
by other on a particular topic or create our own from a saved search. | |||||||||||
There are 2 query languages which are quite similar: Lucene or KQL. Whatever one you | |||||||||||
choose, you will have the same querying capabilities. A query tries to match values for | |||||||||||
specific keys, and support many predicates and combination of them. See the | |||||||||||
documentation for KQL: https://www.elastic.co/guide/en/kibana/current/kuery-query.html | |||||||||||
To get logs for a particular service, you have to know the name of its systemd unit and | |||||||||||
the hostname of the production server providing this service. For a worker, switch the | |||||||||||
index pattern to "swh_workers-*", for another SWH service switch it to "systemlogs-*". | |||||||||||
Not Done Inline Actionsis that Lucene or KML? better to specify moranegg: is that Lucene or KML? better to specify | |||||||||||
Example for getting swh-vault production logs: | |||||||||||
With the index pattern set to "systemlogs-*", enter the KQL query: | |||||||||||
`systemd_unit:"gunicorn-swh-vault.service" AND hostname:"vangogh"` | |||||||||||
Upon expanding a log entry with the leading arrow icon, you can inspect the entry in a | |||||||||||
structured way. You can filter on particular values or fields, using the icons that are | |||||||||||
left to the desired field. Fields including "message", "hostname" or "systemd_unit" are | |||||||||||
often the most informational. You can also view the entry in context, several entries | |||||||||||
before and after chronologically. | |||||||||||
## Issue monitoring with Kibana | |||||||||||
You can use Kibana saved searches and dashboards to follow issues based on associated | |||||||||||
logs. Of course, we need to have logs produced that are related to the issue we want to | |||||||||||
track. | |||||||||||
Not Done Inline ActionsDo we have custom view to add as examples? moranegg: Do we have custom view to add as examples? | |||||||||||
You can save a search, as opposed to only a query, to easily get back to it or include | |||||||||||
it in a dashboard. Just click "Save" in the top toolbar above the search bar. It | |||||||||||
includes the query, filters, selected columns, sorting and index pattern. | |||||||||||
Now you may want to have a customizable view of these logs, along with graphical | |||||||||||
presentations. In the "Dashboard" pane, create a new dashboard. Click "add" in the top | |||||||||||
toolbar and select your saved search. It will appear in resizeable panel. Now doing a | |||||||||||
search will restrict the search to the dataset cinfigured for the panels. | |||||||||||
To create more complete vizualizations including graphs, refer to: | |||||||||||
<https://www.elastic.co/guide/en/kibana/current/dashboard.html> |
As a developer, the tools used on the SWH infrastructure are Sentry and Kibana, which are specialized tools for debugging in production.