Page MenuHomeSoftware Heritage

icinga end-to-end (loader) check reports are off
Closed, MigratedEdits Locked

Description

Trigger an end-to-end check results in failure [1] even though the actual loading is fine [2].
And the report ui reports some traceback [3] that's nowhere to be found in the actual loader output [2]

06:05 <+swhbot> icinga PROBLEM: service production Check save-code-now panda with type svn end-to-end on pergamon.softwareheritage.org is CRITICAL: SAVECODENOW CRITICAL - Save code now request for origin ('svn', 'https://subversion.renater.fr/anonscm/svn/panda') took 11.53s and failed.

[2]

Feb 02 08:22:07 worker09 python3[753971]: [2022-02-02 08:22:07,282: INFO/ForkPoolWorker-103] Load origin 'https://subversion.renater.fr/anonscm/svn/panda' with type 'svn'
Feb 02 08:22:11 worker09 python3[753971]: [2022-02-02 08:22:11,305: WARNING/ForkPoolWorker-103] Retrying RPC call
Feb 02 08:22:15 worker09 python3[753971]: [2022-02-02 08:22:15,543: WARNING/ForkPoolWorker-103] Retrying RPC call
Feb 02 08:22:21 worker09 python3[753971]: [2022-02-02 08:22:21,054: WARNING/ForkPoolWorker-103] Retrying RPC call
Feb 02 08:22:27 worker09 python3[753971]: [2022-02-02 08:22:27,724: WARNING/ForkPoolWorker-103] file:///tmp/swh.loader.svn.ughfe6v8-753971/swh.loader.svn.prjkpn10-753971/tmp7b0fozij@5 already injected.
Feb 02 08:22:27 worker09 python3[753971]: [2022-02-02 08:22:27,925: INFO/ForkPoolWorker-103] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[8c7d82e3-68fd-41e6-9239-f6f24262e307] succeeded in 20.700052144005895s: {'status': 'uneventful'}

[3]

Traceback (most recent call last):
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: Error reading from remote server

Possibly related to this sentry issue: https://sentry.softwareheritage.org/share/issue/aa281f3b760348e6b79708ce3c93a32d/

Event Timeline

ardumont triaged this task as High priority.Feb 2 2022, 9:29 AM
ardumont created this task.

Since the loader is ok, the issue seems to be around the communication between the node triggering the check and the rest. So let's try triggering things from pergamon.

From pergamon, triggering the icinga command with the hgview origin results with the
same kind of error of the report (with logs this time [1]). The loader output is still fine [2].

So it's a monitoring problem not another kind of issue.

[1]

swhworker@pergamon:~$ /usr/bin/swh icinga_plugins --warning 300 --critical 600 check-savecodenow --swh-web-url https://archive.softwareheritage.org origin https://foss.heptapod.net/mercurial/hgview --visit-type hg
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==1.0.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/cli.py", line 87, in check_scn_origin
    sys.exit(SaveCodeNowCheck(ctx.obj, origin, visit_type).main())
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/save_code_now.py", line 64, in main
    ), f"Unexpected response: {response}, {response.text}"
AssertionError: Unexpected response: <Response [502]>, <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

[2]

Feb 02 08:33:41 worker07 python3[4163934]: [2022-02-02 08:33:41,926: INFO/ForkPoolWorker-89] Load origin 'https://foss.heptapod.net/mercurial/hgview' with type 'hg'
Feb 02 08:33:45 worker07 python3[4163934]: [2022-02-02 08:33:45,940: WARNING/ForkPoolWorker-89] Retrying RPC call
Feb 02 08:33:50 worker07 python3[4163934]: [2022-02-02 08:33:50,064: WARNING/ForkPoolWorker-89] Retrying RPC call
Feb 02 08:33:55 worker07 python3[4166669]: requesting all changes
Feb 02 08:33:55 worker07 python3[4166669]: adding changesets
Feb 02 08:33:55 worker07 python3[4166669]: adding manifests
Feb 02 08:33:55 worker07 python3[4166669]: adding file changes
Feb 02 08:33:56 worker07 python3[4166669]: added 1245 changesets with 2634 changes to 312 files
Feb 02 08:33:56 worker07 python3[4166669]: new changesets 7bf307e75523:6cd8ebdbc158
Feb 02 08:33:59 worker07 python3[4163934]: [2022-02-02 08:33:59,866: INFO/ForkPoolWorker-89] Task swh.loader.mercurial.tasks.LoadMercurial[1eb2c6f6-11db-4652-8c60-38539e994778] succeeded in 18.009380619972944s: {'status': 'uneventful'}

Sometimes the save code now submission breaks (POST), sometimes it's the reading part (GET).
It seems to boil down to http timeouts [2]

[1]

root@moma:~# grep python-requests /var/log/varnish/varnishncsa.log | grep 502
128.93.166.14 - - [02/Feb/2022:02:24:31 +0000] "GET http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:08:22:29 +0000] "GET http://archive.softwareheritage.org/api/1/origin/save/svn/url/https://subversion.renater.fr/anonscm/svn/panda/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:08:33:55 +0000] "GET http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:09:35:11 +0000] "GET http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:09:41:16 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:09:44:36 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"
128.93.166.14 - - [02/Feb/2022:09:48:23 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0"

[2]

128.93.166.14 - - [02/Feb/2022:09:35:11 +0000] "GET http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0" 60058264
128.93.166.14 - - [02/Feb/2022:09:41:16 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0" 60057275
128.93.166.14 - - [02/Feb/2022:09:44:36 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0" 60060106
128.93.166.14 - - [02/Feb/2022:09:48:23 +0000] "POST http://archive.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/ HTTP/1.1" 502 341 "-" "python-requests/2.21.0" 60061227

quoted from P1271 (with a specific varnish format it'd be great to have as default)

Finally reproduced through curl (and subsequent cached queries are fine after that [1]).

$ curl https://webapp1.internal.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

[1] P1272 (it's way too much information for the icinga check btw)

It seems to be related to the origins we are regularly checking.
They have a high number of visits and a high number of task runs.

The congestion seems to occur around the scheduler and the webapp communication.
High number of task runs to convert back and forth...

[1] P1272 (it's way too much information for the icinga check btw)

... for only retrieving the few data point (if not the last one) we are interested in.

It'd be great if the endpoint used to read the origin save request status be paginated.
So we could reduce that congestion.

As another data point, the actual save code now ui is using filtering on the db and
not the same endpoint, hence why it's not impacted by this.

The congestion seems to occur around the scheduler and the webapp communication.
High number of task runs to convert back and forth...

Nope, in the end, it's more the back and forth between swh-web and swh-storage. See
below for details.


The following problem is occuring for origin with a high number of visits. So far the
icinga checks is happening on 3 distinct origins whose number is around 8k each.

The actual webapp code executes an api call to retrieve a list of origin visits [2].
Then for each origin visit, it calls the origin visit status api to retrieve its
associated status [2]. Those endpoints are paginated but we are doing the full retrieval
on of those origin visit statuses [1] and then we sort them. This ends up on a high
number of round-trips requests which exceed the timeout of 60 seconds [5] and then this
issue happens.

We can reproduce the encoutered issue just by browsing the ui [3] on origins with high
number of visits (after a new visit on save code now for example). This problem occurs
both in staging [3] and production [4].

As another data point, it also happens that, from time to time, one of those
origin-visit api calls end up timeouting resulting in yet another error [6] (see the
related sentry link in the description). That's occuring since we have a hard-coded
timeout of 500ms happening storage side and the query for those origins largely exceed
it [6]. That's not something we'll be focusing on immediately.

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_visits.py$42-82

[2] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/archive.py$966-1008

[3] https://webapp.staging.swh.network/browse/origin/visits/?origin_url=https://foss.heptapod.net/mercurial/hgview

[4] https://archive.softwareheritage.org/browse/origin/visits/?origin_url=https://foss.heptapod.net/mercurial/hgview

[5] P1273

[6]

15:48:15 softwareheritage@belvedere:5432=> select * FROM origin_visit ov
        INNER JOIN origin o ON o.id = ov.origin
        INNER JOIN origin_visit_status ovs USING (origin, visit)
        WHERE o.url = 'https://foss.heptapod.net/mercurial/hgview'
        AND ovs.snapshot is not null
        ORDER BY ov.date DESC, ov.visit DESC, ovs.date DESC LIMIT 1;
+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+
|  origin   | visit |             date             | type |    id     |                    url                     |             date              | status | metadata |                  snapshot                  | type |
+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+
| 155049819 | 27873 | 2022-02-02 14:43:06.64555+00 | hg   | 155049819 | https://foss.heptapod.net/mercurial/hgview | 2022-02-02 14:43:20.744576+00 | full   | (null)   | \x94aae2785f73ccaec5983b9f5ce071eed323eccf | hg   |
+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+
(1 row)

Time: 8490.886 ms (00:08.491)  # <----- 8490ms > 500ms so timeout through swh-storage

And the query are also optimized afaict:

explain select * FROM origin_visit ov
        INNER JOIN origin o ON o.id = ov.origin
        INNER JOIN origin_visit_status ovs USING (origin, visit)
        WHERE o.url = 'https://foss.heptapod.net/mercurial/hgview'
        AND ovs.snapshot is not null
        ORDER BY ov.date DESC, ov.visit DESC, ovs.date DESC LIMIT 1;
+---------------------------------------------------------------------------------------------------------------------------+
|                                                        QUERY PLAN                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
| Limit  (cost=658.51..658.51 rows=1 width=173)                                                                             |
|   ->  Sort  (cost=658.51..658.53 rows=6 width=173)                                                                        |
|         Sort Key: ov.date DESC, ov.visit DESC, ovs.date DESC                                                              |
|         ->  Nested Loop  (cost=1.28..658.48 rows=6 width=173)                                                             |
|               Join Filter: (o.id = ovs.origin)                                                                            |
|               ->  Nested Loop  (cost=0.58..639.03 rows=7 width=80)                                                        |
|                     ->  Index Scan using origin_url_idx1 on origin o  (cost=0.00..2.02 rows=1 width=52)                   |
|                           Index Cond: (url = 'https://foss.heptapod.net/mercurial/hgview'::text)                          |
|                     ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.58..630.63 rows=638 width=28)      |
|                           Index Cond: (origin = o.id)                                                                     |
|               ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.70..2.75 rows=2 width=85) |
|                     Index Cond: ((origin = ov.origin) AND (visit = ov.visit))                                             |
|                     Filter: (snapshot IS NOT NULL)                                                                        |
+---------------------------------------------------------------------------------------------------------------------------+
(13 rows)

Time: 6.089 ms

With explain analyze:

explain analyze select * FROM origin_visit ov
        INNER JOIN origin o ON o.id = ov.origin
        INNER JOIN origin_visit_status ovs USING (origin, visit)
        WHERE o.url = 'https://foss.heptapod.net/mercurial/hgview'
        AND ovs.snapshot is not null
        ORDER BY ov.date DESC, ov.visit DESC, ovs.date DESC LIMIT 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                               QUERY PLAN                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit  (cost=658.51..658.52 rows=1 width=173) (actual time=10649.674..10649.682 rows=1 loops=1)                                                                         |
|   ->  Sort  (cost=658.51..658.53 rows=6 width=173) (actual time=10649.670..10649.677 rows=1 loops=1)                                                                    |
|         Sort Key: ov.date DESC, ov.visit DESC, ovs.date DESC                                                                                                            |
|         Sort Method: top-N heapsort  Memory: 25kB                                                                                                                       |
|         ->  Nested Loop  (cost=1.28..658.48 rows=6 width=173) (actual time=4.855..10607.017 rows=27512 loops=1)                                                         |
|               Join Filter: (o.id = ovs.origin)                                                                                                                          |
|               ->  Nested Loop  (cost=0.58..639.03 rows=7 width=80) (actual time=2.507..3529.025 rows=27942 loops=1)                                                     |
|                     ->  Index Scan using origin_url_idx1 on origin o  (cost=0.00..2.02 rows=1 width=52) (actual time=0.047..0.073 rows=1 loops=1)                       |
|                           Index Cond: (url = 'https://foss.heptapod.net/mercurial/hgview'::text)                                                                        |
|                     ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.58..630.63 rows=638 width=28) (actual time=2.456..3518.672 rows=27942 loops=1)   |
|                           Index Cond: (origin = o.id)                                                                                                                   |
|               ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.70..2.75 rows=2 width=85) (actual time=0.251..0.252 rows=1 loops=27942) |
|                     Index Cond: ((origin = ov.origin) AND (visit = ov.visit))                                                                                           |
|                     Filter: (snapshot IS NOT NULL)                                                                                                                      |
|                     Rows Removed by Filter: 1                                                                                                                           |
| Planning Time: 1.206 ms                                                                                                                                                 |
| Execution Time: 10649.763 ms                                                                                                                                            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(17 rows)

Time: 10666.800 ms (00:10.667)

As a first workaround, D7076 improves the cache use which should help when the cache is
populated (prior to that diff we mostly iterated over all visits even with the populated cache).

Another diff D7078 tries to improve the information retrieval but it's revealing
challenging. We actually may be in need of a new storage endpoint to simplify the
overall status computations in the webapp (P1276).

I'll stop there.

Deployed and the icinga checks are getting back green now.

18:04 <+swhbot> icinga RECOVERY: service production Check save-code-now hgview with type hg end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('hg', 'https://foss.heptapod.net/mercurial/hgview') took 10.82s and succeeded.
18:06 <+ardumont> hi ho, it's fixed in the end
18:06 <+ardumont> vsellier: ^
18:06 <+ardumont> \o/
18:06 <+ardumont> "sur le fil"
18:06 <+swhbot> icinga RECOVERY: service staging Check save-code-now hgview with type hg end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('hg', 'https://foss.heptapod.net/mercurial/hgview') took 21.40s and succeeded.
18:13 <+swhbot> icinga RECOVERY: service production Check save-code-now panda with type svn end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('svn', 'https://subversion.renater.fr/anonscm/svn/panda') took 21.50s and succeeded.
ardumont changed the task status from Open to Work in Progress.Feb 11 2022, 6:11 PM
ardumont moved this task from Backlog to in-progress on the System administration board.