Page MenuHomeSoftware Heritage

Inject Google Code Mercurial repositories
Started, Work in Progress, NormalPublic

Description

We have retrieved the hg repositories and not yet ingested them.
This task is about the actual ingestion using our loader-mercurial (T329).

(Equivalent task as the git repositories T673 and svn ones T617)

Note:

  • As in T617, the origin date to use for injection is 'Tue, 3 May 2016 17:16:32 +0200'. We retrieved all googlecode repositories together (git, svn, hg).
  • Index of current retrieved mercurial repositories is at uffizi:/srv/storage/space/mirrors/code.google.com/sources/INDEX-hg

Format: <origin_url> <path-to-hg-archive>

  • urls recreated with https://<project-name>.googlecode.com/hg/ scheme (it's actually stored in the previously mentioned index file).

Related Objects

StatusAssignedTask
OpenNone
Work in Progressardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Wontfixardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
OpenNone
Resolvedardumont
OpenNone
Resolvedanlambert

Event Timeline

ardumont created this task.Feb 15 2017, 1:58 PM
ardumont renamed this task from Inject Google Code mercurial repositories to Inject Google Code Mercurial repositories.
ardumont updated the task description. (Show Details)Feb 15 2017, 2:02 PM
ardumont changed the status of subtask T329: hg / mercurial loader from Open to Work in Progress.Dec 20 2017, 11:42 AM
ardumont updated the task description. (Show Details)Dec 21 2017, 10:51 AM
ardumont added a subscriber: fiendish.EditedFeb 9 2018, 5:48 PM

This is now running on our swh-workers, scheduling running on saatchi:

cat /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg | ./schedule_with_queue_length_check.py --queue-name mercurial --threshold 1000 --waiting-time 60 | tee -a scheduling-mercurial-googlecode

Around ~127k repositories to go (127048).

@fiendish ;)

ardumont changed the task status from Open to Work in Progress.Feb 9 2018, 5:49 PM

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Checking the error log messages, here is the error repartition:

...snippets/ardumont $ ./kibana_fetch_logs.py > hg-loader-error-logs-from-kibana-20180209-20180212.txt
...snippets/ardumont $ cat hg-loader-error-logs-from-kibana-20180209-20180212.txt | ./group_by_exception.py --loader-type hg | jq .
{
  "gitorious": {
    "total": 0,
    "errors": {}
  },
  "googlecode": {
    "total": 381,
    "errors": {
      "FileNotFoundError(2, 'No such file or directory')": 359,
      "OSError(12, 'Cannot allocate memory')": 13,
      "IntegrityError('duplicate key value violates uniqu": 3,
      "PatoolError('error extracting /srv/storage/space/m": 2,
      "WorkerLostError('Worker exited prematurely: signal": 2,
      "CommandError(b'bundle', b'-t', b'none-v2', b'-a', ": 1,
      "PatoolError(\"error extracting /srv/storage/space/m": 1
    }
  },
  "unknown": {
    "total": 15,
    "errors": {
      "Loading failure, updating to `partial` status\nTrac": 15
    }
  }
}

reproductibility:

It is still not enough to explain the gap though - 124899 + 396 = 125295, we are missing some origins.

Some error speaks for themselves (OSError, error during extraction), some are not.
I'm currently digging into this and will open dedicated tasks when deemed necessary.

So far, from afar:

"FileNotFoundError(2, 'No such file or directory')": 359,

Possibly wrong archive (corruption or something near).

IntegrityError('duplicate key value violates uniqu

Mmm, this one smells like T896 (wrong input in loader, results in origin clashes).

olasd added a subscriber: olasd.Feb 12 2018, 5:26 PM

FYI all loaded repositories point to an empty snapshot.

select * from origin_visit inner join origin on origin_visit.origin = origin.id where origin.type = 'hg' and snapshot_id != 16;

⇒ 0 lines returned

FYI all loaded repositories point to an empty snapshot.

Right! That was the next point ;)
Thanks.

FYI all loaded repositories point to an empty snapshot.

Right! That was the next point ;)

Turns out the code is smart about visit date and in effect, inhibited the repositories' loading (since we are loading those origins with a past fixed date in time).
That's not what we want in this case. We want the visit to be in the past.
This commit (51b089d911c96eb5fac7605e8a653b64c6c15516) permits to make that behavior optional.

So, i'll take this as an opportunity to fix the new encountered bugs, deploy and schedule everything back.

ardumont added a comment.EditedFeb 13 2018, 2:15 PM

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Ok, i misread the db the first time around, i should have read 126678 origins in the db (I was missing a join on the origin_visit table).
The previously mentioned hole is filled.

127048 - 126678 = 370.
This is the magnitude order of the number of errors.

Actually, I have now have more errors in logs than the actual hole in the db...
Some can be explained by redundancy in error messages.

ardumont changed the status of subtask T957: googlecode import: Check for origin clashes and fix if any from Open to Work in Progress.Feb 13 2018, 3:33 PM

Rescheduled!

ardumont changed the status of subtask T976: google import: Clean up wrong revisions from Open to Work in Progress.Feb 23 2018, 10:30 AM
ardumont added a comment.EditedMar 14 2018, 2:17 PM

As in https://forge.softwareheritage.org/T879#16396, a limit of 2Gib on dump size was used to separate origins.
The current lists are stored at:

ardumont@uffizi:~% wc -l /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-{inferior,superior}-than-2gib.txt
  126981 /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-inferior-than-2gib.txt
      67 /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-superior-than-2gib.txt
  127048 total

And the number of origins is still the same.

Finally, rescheduled using swh-scheduler.
Heading towards T986.

Current status, the queue is empty.

I have no error reported in the kibana dashboard (logs).

And 126920 (ouf of 126981) have their status full:

softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin and ov.visit = (select max(visit) from origin_visit where origin=o.id) and o.type='hg' and ov.status = 'full';
 count
--------
 126920
(1 row)

softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin and ov.visit = (select max(visit) from origin_visit where origin=o.id) and o.type='hg' and ov.status <> 'full';
 count
-------
    56
(1 row)

Remains to investigate the:

  • 56 with their status not full (partial or ongoing...)
  • 5 missing origins (inconsistent hole of 5 origins)
  • why no errors reported at all in logs (or logs for that matters..., removing all filters, this seems to stop around the 7th of march 2018)

why no errors reported at all in logs (or logs for that matters..., removing all filters, this seems to stop around the 7th of march 2018)

Well, it's because prior to 7th march 2018, elasticsearch related, we used another index (named logstash-*, and now we use swh_workers-*).
Kibana has been properly setuped regarding that change for the visualization ui.
Except for the dashboard part, those use a saved search in their gut (which is bound to the old index).
Adapting the current dashboard with a new saved search using the right index, i now see the errors \m/.

$ cat ~/.config/swh/kibana/query.yml
indexes:
  - swh_workers-2018.03.*

size: 100
from: 0
_source:
  - message
  - swh_logging_args_args
  - swh_logging_args_exc
  - swh_logging_args_kwargs

query:
  bool:
    must:
    - match:
        systemd_unit:
          query: 'swh-worker@swh_loader_mercurial.service'
          type: phrase
    - term:
        priority: '3'
    # must_not:
    # - match:
    #     message:
    #       query: '[.*] consumer: Cannot connect to amqp.*'
    #       type: phrase
    # - match:
    #     message:
    #       query: '[.*] pidbox command error.*'
    #       type: phrase

sort:
- '@timestamp': 'asc'
$ ./kibana_fetch_logs.py > hg-loader-error-logs-from-kibana-201803.txt
$ cat hg-loader-error-logs-from-kibana-201803.txt| ./group_by_exception.py --loader-type hg | jq . 
{
  "gitorious": {
    "total": 0,
    "errors": {}
  },
  "googlecode": {
    "total": 5,
    "errors": {
      "'Worker exited prematurely: signal 9 (SIGKILL).',)": 5
    }
  },
  "unknown": {
    "total": 51,
    "errors": {
      "sion, preexec_fn)\nOSError:  Cannot allocate memory": 32,
      "impleBlob' object does not support item assignment": 3,
      "v)\nValueError: could not convert string to float: ": 3,
      "nction swh_revision_add() line 5 at SQL statement\n": 1,
      ">nginx/1.10.3</center>\\r\\n</body>\\r\\n</html>\\r\\n')": 1,
      "node.remove_tree_node_for_path(rest)\nKeyError: b''": 1,
      "ion aborted.', BrokenPipeError(32, 'Broken pipe'))": 1,
      "ist O. D. - Google Chrome 2014-02-20 17.07.29.png'": 1,
      "r: b'\\x90\\x90t`\\xf6\\x7fkJ@Z\\x86M-\\xf9BV\\xd3\\xae$D'": 1,
      "ing-gfd-source-archive.zip: File is not a zip file": 1,
      "-slides-source-archive.zip: File is not a zip file": 1,
      "or missing revlog for data/sword/mods.d/kjv.conf')": 1,
      "ezeus-source-archive.zip:  No space left on device": 1,
      "ce-ac-source-archive.zip:  No space left on device": 1,
      "redit-source-archive.zip:  No space left on device": 1,
      "ce: '/tmp/swh.loader.mercurial.u7i3awxx.dump-8805'": 1
    }
  }
}

Note:

  • Changed the group_by_exception.py script to group by on the 50 last characters from the exception.
  • group_by_exception.py, kibana_fetch_logs.py in snippets repository

Details:

First pass have been done complete a while back.

Unfortunately, or fortunately, a new bug was discovered (T1156).

This is now fixed in the git repository.

But we need to clean up the wrong releases and reinject those.
This is in pending state.

So this task remains a work-in-progress for now.