Page MenuHomeSoftware Heritage

git origins: latest failure reports
Open, NormalPublic

Description

Symmetrically to other loader reports, here is one on the current redundant failures we have on our git origins.
The period covered is from August to October 2018:

$ cat git-output-august-october-2018.txt | python3 -m group_by_exception --loader-type git --no-aggregate  | jq .
{
  "total": 1167689,
  "errors": {
    "FileNotFoundError(2, 'No such file or directory')": 648242,   // <- worker03 had a missing mount point, had the issue with mercurial loader
    "dulwich.errors.GitProtocolError: unexpected http resp 401 for": 403090,
    "psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout": 36628,
    "502 Bad Gateway": 34300,
    "gaierror(-3, 'Temporary failure in name resolution')": 17493,
    "consumer: Cannot connect to amqp": 6649,
    "pgbouncer cannot connect to server": 4811,
    "StorageAPIError(ConnectionError(MaxRetryError('None: Max retries exceeded with url: ": 3307,
    "Worker exited prematurely: signal 9": 2749,
    "MemoryError": 2164,
    "ValueError: invalid literal for int() with base 16: b''": 1898,
    "swh.objstorage.exc.ObjStorageAPIError: An unexpected error occurred in the api backend:": 1309,
    "OSError: Pack file too big for repository": 978,
    "TypeError: snapshot_add() got an unexpected keyword argument 'back_compat'": 890,
    "ConnectionResetError: [Errno 104] Connection reset by peer": 597,
    "socket.gaierror: [Errno -3] Temporary failure in name resolution": 563,
    "http.client.IncompleteRead:": 509,
    "dulwich.errors.NotGitRepository": 231,
    "BrokenPipeError: [Errno 32] Broken pipe": 208,
    "dulwich.errors.GitProtocolError: unexpected http resp 503": 168,
    "504 Gateway Time-out": 127,
    "dulwich.errors.ObjectFormatException: invalid literal for int() with base 10:": 115,
    "StorageDBError(OperationalError('could not connect to server: Connection refused": 113,
    "psycopg2.IntegrityError: duplicate key value violates unique constraint ": 102,
    "ConnectionResetError(104, 'Connection reset by peer')": 89,
    "psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block": 53,
    "psycopg2.extensions.TransactionRollbackError: deadlock detected": 39,
    "psycopg2.IntegrityError: null value in column ": 38,
    "SSL connection has been closed unexpectedly": 29,
    "socket.gaierror: [Errno -5] No address associated with hostname": 25,
    "psycopg2.DatabaseError: server conn crashed?": 22,
    "ValueError: invalid literal for int() with base 16:": 20,
    "OSError(ConnectionRefusedError(111, 'Connection refused')": 15,
    "ValueError: year is out of range": 14,
    "KeyError": 14,
    "OSError(timeout('timed out',),)": 12,
    "OSError(113, 'No route to host')": 12,
    "dulwich.errors.GitProtocolError: unexpected http resp 500": 11,
    "TimeoutError: [Errno 110] Connection timed out": 8,
    "Worker exited prematurely: signal 15 (SIGTERM).',)": 6,
    "socket.gaierror: [Errno -2] Name or service not known": 5,
    "dulwich.errors.GitProtocolError: unexpected http resp 502 for": 4,
    "dulwich.errors.HangupException: The remote server unexpectedly closed the connection.": 4,
    "TimeoutError(110, 'Connection timed out')": 4,
    "OSError: [Errno 113] No route to host": 3,
    "ValueError: Wrong content type `text/html` for API response": 3,
    "psycopg2.OperationalError: terminating connection due to administrator command": 2,
    "OverflowError: timestamp out of range for platform time_t": 2,
    "amqp.exceptions.NotFound: Basic.publish: (404) NOT_FOUND": 2,
    "CONNECTION_FORCED - broker forced connection closure with reason": 2,
    "OSError(OSError(113, 'No route to host'),)": 2,
    "Timed out waiting for UP message from": 1,
    "psycopg2.OperationalError: index row size": 1,
    "dulwich.errors.ObjectFormatException: Unknown field": 1,
    "dulwich.errors.GitProtocolError: unexpected http resp 429": 1,
    "gaierror(-2, 'Name or service not known')": 1,
    "OSError(5, 'Input/output error')": 1,
    "msgpack.exceptions.UnpackValueError: Unpack failed: error = 0": 1,
    "ValueError: not enough values to unpack": 1
  }
}

As usual, some are issues, some are not (e.g 401)

Associated to the error is the number of occurrences.
That will help to determine the next issue to fix.

This is a meta-task to open and track the associated issue if it is deemed one.

dashboard: http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/22195930-d36e-11e8-913b-077937c6a5ef

Related T1246#23774
Related T1159#23772

Event Timeline

ardumont triaged this task as Normal priority.
olasd claimed this task.EditedOct 19 2018, 4:08 PM
olasd added a subscriber: olasd.

Thanks for the analysis!

These look like bugs in the loader rather than the environment:

  • "StorageAPIError(ConnectionError(MaxRetryError('None: Max retries exceeded with url: ": 3305, (probably already fixed)
  • "psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout": 2125,
  • "ValueError: invalid literal for int() with base 16: b''": 426, (comes from the HTTP lib)
  • "dulwich.errors.ObjectFormatException: invalid literal for int() with base 10:": 30,
  • "psycopg2.IntegrityError: duplicate key value violates unique constraint ": 17,
  • "ValueError: year is out of range": 7,
  • "psycopg2.IntegrityError: null value in column ": 7,
  • "psycopg2.OperationalError: index row size": 1,
  • "OverflowError: timestamp out of range for platform time_t": 1,
  • "KeyError": 1,
  • "dulwich.errors.ObjectFormatException: Unknown field": 1,

the other ones look like environmental issues

olasd added a comment.Oct 19 2018, 5:53 PM

"dulwich.errors.ObjectFormatException: invalid literal for int() with base 10:" is https://github.com/dulwich/dulwich/pull/664

ardumont updated the task description. (Show Details)Oct 20 2018, 8:38 AM

Heads up, i amended the description.
I did it that way to not destroy the analysis thread (adding a bloated comment in the middle of it seemed wrong to me).

  • "FileNotFoundError(2, 'No such file or directory')": 648242,

For the git loader, that error is raised when trying to save the git pack.
It was an infra problem. worker03 had a missing mount point (/srv/storage/space....).
That one got fixed after rescheduling the mercurial loader origins (and see the errors).

All the errors of type psycopg2.IntegrityError: duplicate key value violates unique constraint "content_pkey" [1] are all about sha1 collisions, mainly from repositories
testing the attack uncovered by SHAttered [2]

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/22195930-d36e-11e8-913b-077937c6a5ef?_g=(refreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-60d%2Cmode%3Aquick%2Cto%3Anow))

[2] https://shattered.io/

Errors of type ValueError: year is out of range [1] are related to commit dates that can not be represented using standard datetime.datetime Python object (minyear = 0, maxyear = 9999).
See for instance:

One solution would be to use time.struct_time [2] as an alternative to represent those weird dates as it does not restrict the years range.
But is it worth it regarding the few encountered cases ?

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/22195930-d36e-11e8-913b-077937c6a5ef?_g=(refreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-60d%2Cmode%3Aquick%2Cto%3Anow))

[2] https://docs.python.org/3.5/library/time.html#time.struct_time

Errors of type dulwich.errors.NotGitRepository [1] are likely related to a bug in dulwich regarding redirected repository urls not correctly handled.
A pull request [2] has been submitted to fix that issue.

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/e3fbbb60-1d9c-11e9-b8ce-cf95f437ce37?_g=h@a0fef55&_a=h@e10a772
[2] https://github.com/dulwich/dulwich/pull/680