Page MenuHomeSoftware Heritage

Google Code SVN import: Examine ingestion logs for errors and list them if any
Closed, ResolvedPublic

Description

When the ingestion is done, retrieve error logs immediately between the start and end date of the ingestion and reference them (e.g paste referenced from that thread or directly as comment here).

The errors should be addressable as per svn repository.

Event Timeline

ingestion starting date: 11/01/2017 13:00

olasd renamed this task from reference errors after ingestion to Google Code: reference errors after ingestion.Feb 10 2017, 2:38 PM
olasd renamed this task from Google Code: reference errors after ingestion to Google Code svn import: reference errors after ingestion.
ardumont renamed this task from Google Code svn import: reference errors after ingestion to Google Code SVN import: Examine ingestion logs for errors and list them if any.Feb 10 2017, 3:17 PM
ardumont updated the task description. (Show Details)
zack lowered the priority of this task from High to Normal.Feb 12 2017, 6:31 PM
ardumont added a comment.EditedOct 4 2017, 12:43 PM

Out of the 140k remaining svn repositories to mount and load, i have 606 errors.

Sum up, 'googlecode' and 'unknown' are the interesting key entry here (unknown is also googlecode in that context):

  "googlecode": {
    "errors": {
      "OSError(12, 'Cannot allocate memory')": 55,
      "OSError(28, 'No space left on device')": 27,
      "StorageAPIError(ConnectionError(ProtocolError('Con": 20,
      "ValueError('Failed to mount the svn dump for proje": 7,
      "SystemError('error return without exception set',)": 5,
      "WorkerLostError('Worker exited prematurely: signal": 3,
      "IntegrityError('duplicate key value violates uniqu": 1
    },
    "total": 118
  },
  "unknown": {
    "errors": {
      "Uneventful visit. Detail: ": 312,                                             
      "Eventful partial visit. Detail: Property 'svn:exte": 110,   
      "No space left on device": 13,                       
      "Eventful partial visit. Detail: too many values to": 10,
      "Eventful partial visit. Detail: error return witho": 6,
      "Eventful partial visit. Detail: ": 3,
      "Eventful partial visit. Detail: Hash tree computat": 2,
      "Can't open file '/tmp/tmp.*swh.loader.svn/": 30,
      "Eventful partial visit. Detail: 'utf-8' codec can'": 1,
      "Eventful partial visit. Detail: need more than 1 v": 1,
    },
    "total": 488
  },
}

Note:
The 'unknown' key reveals a problem in the 2 edge cases scenario in the loader-svn (history altered, partial visit).
Implementation wise, this means that the loader-svn does not log the input arguments of the task (so i cannot reschedule those directly).
This violates the sentence The errors should be addressable as per svn repository..

As a result, i fixed those edge cases in the loader-svn.
Which led me to debian build issues which led me to package back python3-subvertpy for stretch (new platform since 07-08/2017).

Now, since i could not have an exhaustive list of origins to reschedule (through the error logs), I have recomputed 'svn' origins whose last visit resulted in a status other than 'full' (from db softwareheritage).
And i have rescheduled those.

This should go fast and I shall eventually have an exhaustive list of what went wrong where for whom.

First, an important detail, those were disk loading of svn dumps.
So, at first, this mount a gzip dump as an svn repository and then it processes the history log.
So, this can be quite resource consuming (disk, memory).

Detailing the errors's explanation:

  • "OSError(12, 'Cannot allocate memory')": 55,

    -> No more memory available.

    -> This should be dealt at the scheduling (monitoring) and concurrency level (not too much).
  • "OSError(28, 'No space left on device')": 27,

    -> No more space available.

    -> This should be dealt at the scheduling (monitoring) and concurrency level (not too much).
  • "StorageAPIError(ConnectionError(ProtocolError('Con": 20,

    -> problem in connection to the storage

    -> This should be dealt at the scheduling (monitoring) and concurrency level (not too much).
  • "ValueError('Failed to mount the svn dump for proje": 7,

    -> This problem occurs during the mount (svnadmin load) step, this is due to space issue as well.

    -> This should be dealt at the scheduling (monitoring) and concurrency level (not too much).
  • "SystemError('error return without exception set',)": 5,

    -> This error occurs in the guts of the storage serialization discussion between the loader and the storage. When the retrying policy is exhausted, this message seems to be raised without setting properly the error message.

    -> not reproduced.
"{'visit_date': 'Tue, 3 May 2016 17:16:32 +0200'}",
          "message": "[2017-09-19 15:57:21,995: ERROR/MainProcess] Task swh.loader.svn.tasks.MountAndLoadSvnRepositoryTsk[da14ebf6-c81c-4a7c-95e8-6c7280eac62e] raised unexpected: SystemError('error return without exception set',)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/loader/svn/tasks.py", line 51, in run_task
    destination_path=None)
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 576, in load
    self.flush()
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 492, in flush
    self.send_contents(contents)
  File "/usr/lib/python3/dist-packages/retrying.py", line 68, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/retrying.py", line 223, in call
    return attempt.get(self._wrap_exception)
  File "/usr/lib/python3/dist-packages/retrying.py", line 261, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 625, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/retrying.py", line 217, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 189, in send_contents
    self.storage.content_add(content_list)
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 21, in content_add
    return self.post('content/add', {'content': content})
  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 58, in post
    data = encode_data(data)
  File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 18, in encode_data_client
    return msgpack_dumps(data)
  File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 136, in msgpack_dumps
    return msgpack.packb(data, use_bin_type=True, default=encode_types)
  File "/usr/lib/python3/dist-packages/msgpack/__init__.py", line 47, in packb
    return Packer(**kwargs).pack(o)
SystemError: error return without exception set"
  • "WorkerLostError('Worker exited prematurely: signal": 3, -> This error occurs when the worker is Killed In Action
  • "IntegrityError('duplicate key value violates uniqu": 1 -> This one looks like a problem at the orchestration level. As the orchestration was at the time still manual, it may be possible i send the same origin twice. It was then consumed by 2 different workers in a time so close that they have issued the same query. One successfully finished it, and the other got this error.
{'visit_date': 'Tue, 3 May 2016 17:16:32 +0200'}",
          "message": "[2017-10-02 03:45:06,157: ERROR/MainProcess] Task swh.loader.svn.tasks.MountAndLoadSvnRepositoryTsk[71c8c040-3e50-455a-a4ed-4c4d586f871e] raised unexpected: IntegrityError('duplicate key value violates unique constraint "origin_visit_pkey"\nDETAIL:  Key (origin, visit)=(65589852, 1) already exists.\nCONTEXT:  SQL function "swh_origin_visit_add" statement 1\n',)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 437, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/loader/svn/tasks.py", line 51, in run_task
    destination_path=None)
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 560, in load
    visit_date)
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 133, in origin_visit_add
    return self.post('origin/visit/add', {'origin': origin, 'ts': ts})
  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 62, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 91, in _decode_response
    raise pickle.loads(decode_response(response))
psycopg2.IntegrityError: duplicate key value violates unique constraint "origin_visit_pkey"
DETAIL:  Key (origin, visit)=(65589852, 1) already exists.
CONTEXT:  SQL function "swh_origin_visit_add" statement 1
  • "Uneventful visit. Detail: ": 312, -> This is not an error, it means that we already processed the repository.
  • "Eventful partial visit. Detail: Property 'svn:exte": 110, -> This means that the repository was partially processed and stopped when it discovered the svn:external-id property. This is one is the known shortcoming of the svn loader.
  • "No space left on device": 13, -> Same as previously, but this got raised at a later time than usual.

The few remaining smell like problems but without stacktrace, i need to reproduce to dig further.
And of course, i don't have the associated origins.
Still, here is some more details:

  • "Eventful partial visit. Detail: too many values to unpack": 10,

    -> wrong data type to unpack
  • "Eventful partial visit. Detail: error return witho": 6,

    -> exception not being set
  • "Eventful partial visit. Detail: ": 3,

    -> bad logging here since we see nothing.
  • "Eventful partial visit. Detail: Hash tree computat": 2,

    -> hash tree divergence computation check failure.
  • "Eventful partial visit. Detail: 'utf-8' codec can'": 1,
  • "Eventful partial visit. Detail: need more than 1 value to unpack": 1,
  • "Can't open file '/tmp/tmp.*swh.loader.svn/" too many opened files: 30, -> This one smells of ulimit problems though
ardumont updated the task description. (Show Details)Oct 4 2017, 3:59 PM
ardumont added a comment.EditedOct 24 2017, 6:12 PM

The 'googlecode' and 'unknown' key entries are googlecode svn related.
Most are stored in the 'unknown' entry key because, we somehow don't have the task's input args (origin url + dump file), which is unfortunate (for rescheduling those).

snippet/ardumont $  ./kibana_fetch_logs.py --indexes "logstash-2017.10.07,logstash-2017.10.08,logstash-2017.10.09,logstash-2017.10.10,logstash-2017.10.11,logstash-2017.10.12,logstash-2017.10.13,logstash-2017.10.14,logstash-2017.10.15,logstash-2017.10.16,logstash-2017.10.17,logstash-2017.10.18,logstash-2017.10.19,logstash-2017.10.20" > svn-loader-error-logs-from-kibana.txt
snippet/ardumont $ cat svn-loader-error-logs-from-kibana.txt | ./group_by_exception.py --loader-type svn | jq .
  "googlecode": {
    "errors": {
      "ValueError('Failed to mount the svn dump for proje": 4                       # corrupted archives
    },
    "total": 4
  },
  "gitorious": {
    "errors": {},
    "total": 0
  },
  "unknown": {
    "errors": {
      "Uneventful visit. Detail: History of svn file:///t": 229,                           # altered history
      "StorageDBError(OperationalError('ERROR:  pgbouncer": 44,            # db connection issue 
      "Eventful partial visit. Detail: Property 'svn:exte": 38,                            # external id detected 
      "DatabaseError('query_wait_timeout\\nSSL connection ": 16,            # db connection issue
      "StorageDBError(OperationalError('ERROR:  query_wai": 14,             # db connection issue (timeout)
      "Eventful partial visit. Detail: too many values to": 10,                         # fixed in T812 (blank space in symlink)
      "Can't open file '/tmp/tmp.wvse04vr.swh.loader.svn/": 1,                   # T811 (too many opened files)
      "Can't open file '/tmp/tmp.l8j8wtlt.swh.loader.svn/": 1,                      # T811...

      "Can't open file '/tmp/tmp.bh54qppf.swh.loader.svn/": 1,                 # ...
      "Can't open file '/tmp/tmp.03tei_4o.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.2w9wqiaw.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.a6se0ip2.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.gdccaqba.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.4jfjt_5a.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.jf6etf2r.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.09pinr1r.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.4ytrej0y.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.rfxrcby3.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.jksh8u1q.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.l39g7of1.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.x67iiwy0.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.o9rr1bnt.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.99vvrsoy.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.pfhrb79j.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.sr8987te.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.z4p2k5z2.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.68fa8mkv.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.bg7uz_z8.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.ggbly1u3.swh.loader.svn/": 1,
      "Can't open file '/tmp/tmp.ir46500j.swh.loader.svn/": 1
    },
    "total": 375
  }
}
ardumont added a comment.EditedOct 26 2017, 9:58 AM

Heads up on this, i fixed some wrong behavior or bugs:

  • uneventful visit (with no more changes since last time) were rendered as 'partial' visits instead of 'full'.
  • space in symlink path broke the loading (T812)
  • too many opened file issue (T811)

I've packaged the latest loader-svn's code but cannot deploy it for now.
As soon as it's possible (workers are currently in migration), i'll reschedule all last visits whose status is not 'full'.

(Heading for the loader-git-disk and gitorious origins now)

Latest run from last week-end, some new bugs (new tasks are or will be opened for detailed analysis and fix):

"unknown": {
  "total": 184,
  "errors": {
    "History altered. Detail: History of svn file:///tm": 141,
    "Eventful partial visit. Detail: Property 'svn:exte": 33,           
    "Eventful partial visit. Detail: symlink: embedded ": 2,         # I don't know and i don't have the input so cannot reproduce yet...
    "*********\\nc*********************************\\n\\nc": 1,  # filename too long T838
    "File exists: b'/tmp/tmp.2ka37njs.swh.loader/bom-br": 1,  # T840
    "File exists: b'/tmp/tmp.qjfb6tfj.swh.loader/hodoph": 1,    # T840
    "File name too long: b'/tmp/tmp.l_arjv2y.swh.loader": 1,   # T839
    "Eventful partial visit. Detail: 246 != 259": 1,
    "ValueError('Failed to mount the svn dump for proje": 1,    # not an svn dump
    "File exists: b'/tmp/tmp.8eevawds.swh.loader/poorde": 1,  # T840
    "Eventful partial visit. Detail: Hash tree computat": 1           # i believe it's one log too many
  }
},

I close this in favor of T879

ardumont closed this task as Resolved.Dec 14 2017, 3:24 PM