Page MenuHomeSoftware Heritage

pypi loader: Analyze existing errors
Open, NormalPublic

Description

Now that the first loading is done, take a look at the existing errors.

This is the meta-task, will possibly open new subtask per case.

kibana dashboard: http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/32632370-c0bd-11e8-8222-07f3ec376cd5

Event Timeline

ardumont triaged this task as Normal priority.Oct 5 2018, 6:28 PM
ardumont created this task.

kibana dashboard will help in that matters (P311 because it's noisy).

ardumont renamed this task from Analyze pypi errors to pypi loader: Analyze existing errors.Oct 5 2018, 6:31 PM
ardumont updated the task description. (Show Details)
ardumont added a comment.EditedOct 16 2018, 2:03 PM

Here is the pypi report about the loading errors.

The following output has been filtered and manually edited to:

  • remove the most prominent occurrence of 404 errors (origins no longer exist)
  • aggregate sensible identical output

The 'full' output is at P317 (it's derived from P316 for
orchestration):

{
  "googlecode": {
    "total": 3933,
    "errors": {
      "...Reason: 404": 2034,                                       // [1]
      "e 51, in author\n    name = data\nKeyError: 'author'": 1409, // [2]
      "672//: timed out.\nTrying again in x... seconds...\n": 199,  // aggregated manually
      "   f.write(chunk)\nOSError:  Cannot allocate memory": 31,    // [3]
      "meout\nSSL connection has been closed unexpectedly\n": 22,   // [4]
      "CONNRESET\\')\",)', OSError(\"(104, 'ECONNRESET')\",))": 16, // [4]
      "bb3891900f0f86c3c3bcf136fd8eb9a96b4e9a1f5e782287bb": 58,     // [5] aggregated manually
      "eError: Error when checking size: 166344 != 166353": 1,      // [5]
      "xxxxxx-x.x.x-x.xxx.xxx is not a supported archive.": 18,     // [6] aggregated manually
      "server\nERROR:  pgbouncer cannot connect to server\n": 15,   // aggregated manually
      "te 0xd4 in position 219: invalid continuation byte": 17,     // aggregated manually
      "code byte 0xa3 in position 236: invalid start byte": 13,     // aggregated manually
      ") got an unexpected keyword argument 'back_compat'": 13,
      "r.gz blocked. Illegal path to directory metrique-.": 4,      // aggregated manually
      ">nginx/1.10.3</center>\\r\\n</body>\\r\\n</html>\\r\\n')": 11,
      "\nTypeError: 'NoneType' object is not subscriptable": 9,
      ", commands ignored until end of transaction block\n": 7,
      "0.3</center>\\\\r\\\\n</body>\\\\r\\\\n</html>\\\\r\\\\n\\')',)": 7,
      "PermissionError(13, 'Permission denied')": 6,
      "SQL function swh_revision_add() line 3 at PERFORM\n": 6,
      "nectionResetError(104, 'Connection reset by peer')": 5,
      "del dict representing a person.\nKeyError: 'author'": 3,
      "OSError(timeout('timed out',),)": 3,
      "wError: timestamp out of range for platform time_t": 3,
      "   dst.write(buf)\nOSError:  Cannot allocate memory": 2,
      "elf._length+read)\nOSError:  Cannot allocate memory": 2,
      "ection aborted.', OSError(\"(104, 'ECONNRESET')\",))": 2,
      "ocgtk-26460/j2/1.2.1/uncompress/j2-1.2.1/setup.py'": 1,
      "path to file /Users/temek/Downloads/._cronq-0.17.1": 1,
      "batch/0.1.5/uncompress/svgbatch-0.1.5/LICENSE.txt'": 1,
      "ang1_ku7-0.1.0/臺灣言語工具/資料佮語料匯入整合/教育部臺灣\\udce9\\udc96'": 1,
      "nection:  Temporary failure in name resolution',))": 1,
      "725/pyjack/0.3.2/uncompress/pyjack-0.3.2/PKG-INFO'": 1,
      "96kw-8592/cclib/1.0/uncompress/cclib-1.0/ANNOUNCE'": 1,
      "2 for file 'yaxl-0.0.16/docs/dist/yaxl-0.0.16.zip'": 1,
      "tdir(dir_path)\nIndexError: list index out of range": 1,
      " No such file or directory: '/tmp/swh.loader.pypi'": 1,
      "colorama/0.1.8/uncompress/colorama-0.1.8/PKG-INFO'": 1,
      "'Worker exited prematurely: signal 9 (SIGKILL).',)": 1,
      "TimeoutError(110, 'Connection timed out')": 1,
    }
  }
}

[1] Those are the origins removed between the pypi listing and the
pypi loading scheduling

[2] Those are the first issue we had early on when we discovered some
projects were missing author information (already fixed, T1206). As
they are the most prominent occurrences, they will be scheduled back
asap.

[3] Possibly an occurrence of running simultaneously too many services
on the same vm. We should cross-check for example the loader-git
around the same time, it possibly has the same errors.

[4] That's possibly an outage of updating the storage server. It's on
both line as it could happen at different point in time during the
loading.

[5] Those happens when an error is detected by the loader's client
(swh.loader.pypi.client) after the artifact release download. This is
a sign to improve that part to try the download multiple times.

[6] Visibly an improvment around the archive support is needed to
deal with some more formats (rpm, etc...).

The remaining issues are most probably either:

  • error on our side (pgbouncer, worker lost error, etc...). A simple rescheduling could be enough.
  • current limitation in the loader that needs fixing.

Either way, this will need further analysis and dedicated tasks for
them.

In any case, for now, like i said in [2], we will first schedule back
those 1409 origins in error.

Cheers,

In any case, for now, like i said in [2], we will first schedule back
those 1409 origins in error.

Done.

swhscheduler@saatchi:~ $ cat reschedule.pypi.csv | python3 -m swh.scheduler.cli task schedule -c type -c policy -c args -c kwargs --delimiter ';' -

Note: P319

ardumont added a comment.EditedOct 18 2018, 11:27 AM

Ok, so reworked the group_by_exception snippet to have a more sensible output:

cat pypi-origins-error-september-2018.txt | python3 -m group_by_exception --loader-type pypi --no-aggregate | jq .
{
  "total": 3931,
  "errors": {
    "Reason: 404": 2033,
    "KeyError: 'author'": 1412,
    "consumer: Cannot connect to amqp": 199,
    "Checksum mismatched": 58,
    "OSError: [Errno 12] Cannot allocate memory": 35,
    "is not a supported archive": 19,
    "invalid continuation byte": 19,
    "psycopg2.DatabaseError: query_wait_timeout": 19,
    "Unexpected status code for API request": 18,
    "requests.exceptions.ChunkedEncodingError": 16,
    "pgbouncer cannot connect to server": 15,
    "invalid start byte": 13,
    "got an unexpected keyword argument 'back_compat'": 13,
    "TypeError: 'NoneType' object is not subscriptable": 9,
    "psycopg2.InternalError: current transaction is aborted": 7,
    "PermissionError(13, 'Permission denied')": 6,
    "PL/pgSQL function swh_person_add_from_revision": 6,
    "PermissionError: [Errno 13] Permission denied": 5,
    "Illegal path": 5,
    "ConnectionResetError(104, 'Connection reset by peer')": 5,
    "OSError(timeout('timed out'": 3,
    "OverflowError: timestamp out of range for platform time_t": 3,
    "SSL connection has been closed unexpectedly": 3,
    "ConnectionError: ('Connection aborted.', OSError": 2,
    "ValueError: Error when checking size": 1,
    "IsADirectoryError: [Errno 21] Is a directory": 1,
    "ConnectionError: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url:": 1,
    "zipfile.BadZipFile: Bad CRC-32 for file": 1,
    "IndexError: list index out of range": 1,
    "FileNotFoundError: [Errno 2] No such file or directory: '/tmp/swh.loader.pypi'": 1,
    "'Worker exited prematurely: signal 9 (SIGKILL).',)": 1,
    "TimeoutError(110, 'Connection timed out')": 1
  }
}

Note:
Configuration file is P320

ardumont updated the task description. (Show Details)Oct 22 2018, 10:24 AM
zack added a comment.May 25 2019, 5:00 PM

how many are left? can we close this as well as T419 now that the PyPI listers/loaders have been in production for a while?

how many are left?

No idea.

can we close this as well as T419 now that the PyPI listers/loaders have been in production for a while?

Yes, we can remove that subtask from T419.
Close T419. And still keep that one opened and investigate further those errors.

That's how it's done for other loaders.
It's not particularly good but that's factually what happens.