Page MenuHomeSoftware Heritage

Deposit tests end-to-end are failing in icinga
Closed, MigratedEdits Locked

Description

The 'Check deposit end-to-end ' [1] probe is failing since the beginning of january.

It looks like there arer 2 issuers :

  • an error occurs during the archive upload
  • the error parsing is failing.

The response received from the server is :

<?xml version="1.0" encoding="utf-8"?>
<sword:error xmlns="http://www.w3.org/2005/Atom"
             xmlns:sword="http://purl.org/net/sword/terms/">
    <summary>The &#39;external_identifier&#39; tag is deprecated, the Slug header should be used instead.</summary>
    <sword:treatment>processing failed</sword:treatment>
    
</sword:error>

The exception is :

Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/cli.py", line 104, in check_deposit_single
    sys.exit(DepositCheck(ctx.obj).main())
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/deposit.py", line 96, in main
    result = self.upload_deposit()
  File "/usr/lib/python3/dist-packages/swh/icinga_plugins/deposit.py", line 43, in upload_deposit
    slug=slug,
  File "/usr/lib/python3/dist-packages/swh/deposit/client.py", line 663, in deposit_create
    metadata_path=metadata,
  File "/usr/lib/python3/dist-packages/swh/deposit/client.py", line 351, in execute
    error = self.parse_result_error(r.text)
  File "/usr/lib/python3/dist-packages/swh/deposit/client.py", line 305, in parse_result_error
    "summary": data["summary"],
KeyError: 'summary'

(failure in the xml parsing[2])

[1] https://icinga.softwareheritage.org/dashboard#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=Check%20deposit%20end-to-end
[2] https://forge.softwareheritage.org/source/swh-deposit/browse/master/swh/deposit/client.py$305

Event Timeline

vsellier triaged this task as Normal priority.Jan 18 2021, 7:57 PM
vsellier created this task.

The package python3-swh.icingaplugins:v0.4.3 is released and deployed on pergamon

The test by itself is still in warning but now due to the duration of the import, not an error :

swh/icinga-plugins/data/deposit/jesuisgpl.tgz --metadata /usr/share/swh/icinga-plugins/data/deposit/jesuisgpl.tgz.xml
DEPOSIT WARNING - Deposit took 226.67s and succeeded.
| 'load_time' = 174.83s
| 'total_time' = 226.67s
| 'upload_time' = 0.53s
| 'validation_time' = 51.31s

220s is above the default duration threshold of 120s

it seems it's the scheduler running that is taking time to scheduler the deposit task :
08:37:53 -> task is created
08:43:05 -> the runner is scheduling the task
08:43:24 -> the worker acknowledge the task

Jan 20 08:37:53 moma python3[759420]: 2021-01-20 08:37:53 [759420] gunicorn.access:INFO 127.0.0.1 - swh [20/Jan/2021:08:37:53 +0000] "POST /1/swh/ HTTP/1.1" 201 1360 "-" "swh-deposit/0.9.2"
Jan 20 08:43:05 saatchi swh[1323451]: INFO:swh.scheduler.celery_backend.runner:Grabbed 4192 tasks load-git
Jan 20 08:43:05 saatchi swh[1323451]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks load-deposit
Jan 20 08:43:24 moma python3[759418]: 2021-01-20 08:43:24 [759418] gunicorn.access:INFO 127.0.0.1 - swhworker [20/Jan/2021:08:43:24 +0000] "PUT /1/private/1402/update/ HTTP/1.1" 204 0 "-" "python-requests/2.21.0"
root@pergamon:~# /usr/bin/swh icinga_plugins check-deposit --server https://deposit.softwareheritage.org/1 --username xxxx --password xxxx --collection swh single --archive /usr/share/swh/icinga-plugins/data/deposit/jesuisgpl.tgz --metadata /usr/share/swh/icinga-plugins/data/deposit/jesuisgpl.tgz.xml
DEPOSIT WARNING - Deposit took 339.52s and succeeded.
| 'load_time' = 185.12s
| 'total_time' = 339.52s
| 'upload_time' = 0.40s
| 'validation_time' = 154.00s

It seems the runner was busy scheduler git task until it takes care of the deposit content to schedule.

As it looks it's not an issue but a high pressure on the runner, let's try to increase the threshold to remove the icinga warning

vsellier changed the task status from Open to Work in Progress.Jan 21 2021, 9:21 AM
vsellier moved this task from Backlog to in-progress on the System administration board.
vsellier claimed this task.

With the longer warning threshold, the monitoring is now green.

vsellier added a revision: Restricted Differential Revision.Jan 25 2021, 11:48 AM