Ingest Google Code Mercurial repositories
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Feb 15 2017, 1:58 PM

Description

We have retrieved the hg repositories and not yet ingested them.
This task is about the actual ingestion using our loader-mercurial (T329).

(Equivalent task as the git repositories T673 and svn ones T617)

Note:

As in T617, the origin date to use for injection is 'Tue, 3 May 2016 17:16:32 +0200'. We retrieved all googlecode repositories together (git, svn, hg).

Index of current retrieved mercurial repositories is at uffizi:/srv/storage/space/mirrors/code.google.com/sources/INDEX-hg

Format: <origin_url> <path-to-hg-archive>

urls recreated with https://<project-name>.googlecode.com/hg/ scheme (it's actually stored in the previously mentioned index file).

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T367 ingest Google Code repositories
Migrated	gitlab-migration	T682 Ingest Google Code Mercurial repositories
Migrated	gitlab-migration	T329 hg / mercurial loader
Migrated	gitlab-migration	T906 mercurial loader: Debian package
Migrated	gitlab-migration	T907 mercurial loader: Align mercurial loader with other loaders
Migrated	gitlab-migration	T908 mercurial loader: Define scheduler task(s)
Migrated	gitlab-migration	T909 mercurial loader: Define puppet manifest for actual deployment
Migrated	gitlab-migration	T964 2018-02-16 worker disk full postmortem
Migrated	gitlab-migration	T982 failing worker consumes remaining tasks without processing them
Migrated	gitlab-migration	T985 loader*: Make prepare method resilient to error and origin visit status compliant
Migrated	gitlab-migration	T955 googlecode import: hglib.error.CommandError during loading
Migrated	gitlab-migration	T956 googlecode import: Clean up visit wrongly targetting empty snapshot
Migrated	gitlab-migration	T957 googlecode import: Check for origin clashes and fix if any
Migrated	gitlab-migration	T965 googlecode import: Analyze and fix errors
Migrated	gitlab-migration	T970 mercurial loader: What to do in case of .hgtags?
Migrated	gitlab-migration	T976 google import: Clean up wrong revisions
Migrated	gitlab-migration	T1156 Fix release targets of already loaded mercurial type origins
Migrated	gitlab-migration	T1158 hg loader: Clean up wrong snapshots/releases during hg loading of googlecode
Migrated	gitlab-migration	T1159 hg loader: Schedule oneshot tasks for googlecode origin ingestion
Migrated	gitlab-migration	T1155 Mercurial loader: release target is invalid

Event Timeline

ardumont renamed this task from Inject Google Code mercurial repositories to Inject Google Code Mercurial repositories.Feb 15 2017, 1:58 PM

ardumont created this task.

ardumont added parent tasks: T329: hg / mercurial loader, T673: ingest Google Code Git repositories.

ardumont updated the task description. (Show Details)Feb 15 2017, 2:02 PM

zack removed parent tasks: T673: ingest Google Code Git repositories, T329: hg / mercurial loader.Feb 15 2017, 4:03 PM

zack added a parent task: T367: ingest Google Code repositories.Feb 15 2017, 4:05 PM

zack added a subtask: T329: hg / mercurial loader.Feb 15 2017, 4:15 PM

zack added a project: Archive content.Apr 7 2017, 11:06 AM

ardumont changed the status of subtask T329: hg / mercurial loader from Open to Work in Progress.Dec 20 2017, 11:42 AM

ardumont updated the task description. (Show Details)Dec 21 2017, 10:51 AM

ardumont mentioned this in rDSNIP8bea10f099a3: Update filtering tool to integrate the nature in the built url.Dec 21 2017, 10:53 AM

ardumont mentioned this in rDSNIP4b22400565b7: Prepare googlecode mercurial origins scheduling.

ardumont mentioned this in rSPSITE19b08e79b6ed: data/defaults: Deploy mercurial loader.Feb 9 2018, 5:38 PM

ardumont mentioned this in rSPSITEf581155266cc: data/defaults: Scheduler: Reference the loader-mercurial dependency.

This is now running on our swh-workers, scheduling running on saatchi:

cat /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg | ./schedule_with_queue_length_check.py --queue-name mercurial --threshold 1000 --waiting-time 60 | tee -a scheduling-mercurial-googlecode

Around ~127k repositories to go (127048).

@fiendish ;)

ardumont changed the task status from Open to Work in Progress.Feb 9 2018, 5:49 PM

rDSNIP26ea29b2d2abf9c931ba5efcf0f49d4194254e79

yay

ardumont mentioned this in rDSNIP1cb7d4097710: kibana_fetch_logs: Add configuration abilities to fetching tool.Feb 12 2018, 3:27 PM

ardumont mentioned this in rDSNIPf2dbd1495fb4: group_by_exception: Add mercurial loader type.Feb 12 2018, 3:47 PM

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Checking the error log messages, here is the error repartition:

...snippets/ardumont $ ./kibana_fetch_logs.py > hg-loader-error-logs-from-kibana-20180209-20180212.txt
...snippets/ardumont $ cat hg-loader-error-logs-from-kibana-20180209-20180212.txt | ./group_by_exception.py --loader-type hg | jq .
{
  "gitorious": {
    "total": 0,
    "errors": {}
  },
  "googlecode": {
    "total": 381,
    "errors": {
      "FileNotFoundError(2, 'No such file or directory')": 359,
      "OSError(12, 'Cannot allocate memory')": 13,
      "IntegrityError('duplicate key value violates uniqu": 3,
      "PatoolError('error extracting /srv/storage/space/m": 2,
      "WorkerLostError('Worker exited prematurely: signal": 2,
      "CommandError(b'bundle', b'-t', b'none-v2', b'-a', ": 1,
      "PatoolError(\"error extracting /srv/storage/space/m": 1
    }
  },
  "unknown": {
    "total": 15,
    "errors": {
      "Loading failure, updating to `partial` status\nTrac": 15
    }
  }
}

reproductibility:

filtering log tool
with P221 configuration sample
grouping error tool

It is still not enough to explain the gap though - 124899 + 396 = 125295, we are missing some origins.

Some error speaks for themselves (OSError, error during extraction), some are not.
I'm currently digging into this and will open dedicated tasks when deemed necessary.

So far, from afar:

"FileNotFoundError(2, 'No such file or directory')": 359,

Possibly wrong archive (corruption or something near).

IntegrityError('duplicate key value violates uniqu

Mmm, this one smells like T896 (wrong input in loader, results in origin clashes).

FYI all loaded repositories point to an empty snapshot.

select * from origin_visit inner join origin on origin_visit.origin = origin.id where origin.type = 'hg' and snapshot_id != 16;

⇒ 0 lines returned

FYI all loaded repositories point to an empty snapshot.

Right! That was the next point ;)
Thanks.

FYI all loaded repositories point to an empty snapshot.

Right! That was the next point ;)

Turns out the code is smart about visit date and in effect, inhibited the repositories' loading (since we are loading those origins with a past fixed date in time).
That's not what we want in this case. We want the visit to be in the past.
This commit (51b089d911c96eb5fac7605e8a653b64c6c15516) permits to make that behavior optional.

So, i'll take this as an opportunity to fix the new encountered bugs, deploy and schedule everything back.

ardumont created subtask T955: googlecode import: hglib.error.CommandError during loading.Feb 13 2018, 12:11 PM

ardumont created subtask T956: googlecode import: Clean up visit wrongly targetting empty snapshot.Feb 13 2018, 12:19 PM

ardumont created subtask T957: googlecode import: Check for origin clashes and fix if any.Feb 13 2018, 12:26 PM

ardumont mentioned this in T329: hg / mercurial loader.Feb 13 2018, 12:38 PM

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Ok, i misread the db the first time around, i should have read 126678 origins in the db (I was missing a join on the origin_visit table).
The previously mentioned hole is filled.

127048 - 126678 = 370.
This is the magnitude order of the number of errors.

Actually, I have now have more errors in logs than the actual hole in the db...
Some can be explained by redundancy in error messages.

ardumont closed subtask T956: googlecode import: Clean up visit wrongly targetting empty snapshot as Resolved.Feb 13 2018, 2:26 PM

ardumont changed the status of subtask T957: googlecode import: Check for origin clashes and fix if any from Open to Work in Progress.Feb 13 2018, 3:33 PM

ardumont closed subtask T957: googlecode import: Check for origin clashes and fix if any as Resolved.Feb 14 2018, 10:16 AM

ardumont closed subtask T955: googlecode import: hglib.error.CommandError during loading as Wontfix.Feb 14 2018, 10:23 AM

ardumont mentioned this in rSPSITEa5b9070f4d52: data/defaults: Don't try to be smart about visit_date just yet.Feb 14 2018, 10:41 AM

Rescheduled!

ardumont created subtask T965: googlecode import: Analyze and fix errors.Feb 16 2018, 2:13 PM

ardumont created subtask T976: google import: Clean up wrong revisions.Feb 20 2018, 12:57 PM

ardumont closed subtask T965: googlecode import: Analyze and fix errors as Resolved.Feb 20 2018, 4:49 PM

ardumont changed the status of subtask T976: google import: Clean up wrong revisions from Open to Work in Progress.Feb 23 2018, 10:30 AM

ardumont closed subtask T976: google import: Clean up wrong revisions as Resolved.Feb 24 2018, 5:32 PM

As in https://forge.softwareheritage.org/T879#16396, a limit of 2Gib on dump size was used to separate origins.
The current lists are stored at:

ardumont@uffizi:~% wc -l /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-{inferior,superior}-than-2gib.txt
  126981 /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-inferior-than-2gib.txt
      67 /srv/storage/space/mirrors/code.google.com/sources/INDEX-hg-dumps-with-size-superior-than-2gib.txt
  127048 total

And the number of origins is still the same.

Finally, rescheduled using swh-scheduler.
Heading towards T986.

Current status, the queue is empty.

I have no error reported in the kibana dashboard (logs).

And 126920 (ouf of 126981) have their status full:

softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin and ov.visit = (select max(visit) from origin_visit where origin=o.id) and o.type='hg' and ov.status = 'full';
 count
--------
 126920
(1 row)

softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin and ov.visit = (select max(visit) from origin_visit where origin=o.id) and o.type='hg' and ov.status <> 'full';
 count
-------
    56
(1 row)

Remains to investigate the:

56 with their status not full (partial or ongoing...)
5 missing origins (inconsistent hole of 5 origins)
why no errors reported at all in logs (or logs for that matters..., removing all filters, this seems to stop around the 7th of march 2018)

why no errors reported at all in logs (or logs for that matters..., removing all filters, this seems to stop around the 7th of march 2018)

Well, it's because prior to 7th march 2018, elasticsearch related, we used another index (named logstash-*, and now we use swh_workers-*).
Kibana has been properly setuped regarding that change for the visualization ui.
Except for the dashboard part, those use a saved search in their gut (which is bound to the old index).
Adapting the current dashboard with a new saved search using the right index, i now see the errors \m/.

$ cat ~/.config/swh/kibana/query.yml
indexes:
  - swh_workers-2018.03.*

size: 100
from: 0
_source:
  - message
  - swh_logging_args_args
  - swh_logging_args_exc
  - swh_logging_args_kwargs

query:
  bool:
    must:
    - match:
        systemd_unit:
          query: 'swh-worker@swh_loader_mercurial.service'
          type: phrase
    - term:
        priority: '3'
    # must_not:
    # - match:
    #     message:
    #       query: '[.*] consumer: Cannot connect to amqp.*'
    #       type: phrase
    # - match:
    #     message:
    #       query: '[.*] pidbox command error.*'
    #       type: phrase

sort:
- '@timestamp': 'asc'
$ ./kibana_fetch_logs.py > hg-loader-error-logs-from-kibana-201803.txt
$ cat hg-loader-error-logs-from-kibana-201803.txt| ./group_by_exception.py --loader-type hg | jq . 
{
  "gitorious": {
    "total": 0,
    "errors": {}
  },
  "googlecode": {
    "total": 5,
    "errors": {
      "'Worker exited prematurely: signal 9 (SIGKILL).',)": 5
    }
  },
  "unknown": {
    "total": 51,
    "errors": {
      "sion, preexec_fn)\nOSError:  Cannot allocate memory": 32,
      "impleBlob' object does not support item assignment": 3,
      "v)\nValueError: could not convert string to float: ": 3,
      "nction swh_revision_add() line 5 at SQL statement\n": 1,
      ">nginx/1.10.3</center>\\r\\n</body>\\r\\n</html>\\r\\n')": 1,
      "node.remove_tree_node_for_path(rest)\nKeyError: b''": 1,
      "ion aborted.', BrokenPipeError(32, 'Broken pipe'))": 1,
      "ist O. D. - Google Chrome 2014-02-20 17.07.29.png'": 1,
      "r: b'\\x90\\x90t`\\xf6\\x7fkJ@Z\\x86M-\\xf9BV\\xd3\\xae$D'": 1,
      "ing-gfd-source-archive.zip: File is not a zip file": 1,
      "-slides-source-archive.zip: File is not a zip file": 1,
      "or missing revlog for data/sword/mods.d/kjv.conf')": 1,
      "ezeus-source-archive.zip:  No space left on device": 1,
      "ce-ac-source-archive.zip:  No space left on device": 1,
      "redit-source-archive.zip:  No space left on device": 1,
      "ce: '/tmp/swh.loader.mercurial.u7i3awxx.dump-8805'": 1
    }
  }
}

Note:

Changed the group_by_exception.py script to group by on the 50 last characters from the exception.
group_by_exception.py, kibana_fetch_logs.py in snippets repository

Details:

zack edited projects, added Archive coverage; removed Archive content.Jun 19 2018, 3:30 PM

ardumont closed subtask T329: hg / mercurial loader as Resolved.Aug 3 2018, 3:03 PM

ardumont added a subtask: T1156: Fix release targets of already loaded mercurial type origins.

First pass have been done complete a while back.

Unfortunately, or fortunately, a new bug was discovered (T1156).

This is now fixed in the git repository.

But we need to clean up the wrong releases and reinject those.
This is in pending state.

So this task remains a work-in-progress for now.

ardumont removed ardumont as the assignee of this task.Jul 3 2019, 3:26 PM

zack renamed this task from Inject Google Code Mercurial repositories to Ingest Google Code Mercurial repositories.May 19 2020, 9:56 AM

zack mentioned this in T2793: add notable past events to the archive changelog.Nov 25 2020, 1:58 PM

zack closed this task as Resolved.Dec 10 2020, 10:52 AM

zack claimed this task.

gitlab-migration changed the status of subtask T329: hg / mercurial loader from Resolved to Migrated.Jan 8 2023, 4:18 PM

This task has been migrated to GitLab.

gitlab-migration closed subtask T1156: Fix release targets of already loaded mercurial type origins as Migrated.Jan 8 2023, 4:59 PM

gitlab-migration changed the status of subtask T955: googlecode import: hglib.error.CommandError during loading from Wontfix to Migrated.Jan 8 2023, 9:57 PM

gitlab-migration changed the status of subtask T956: googlecode import: Clean up visit wrongly targetting empty snapshot from Resolved to Migrated.

gitlab-migration changed the status of subtask T957: googlecode import: Check for origin clashes and fix if any from Resolved to Migrated.

gitlab-migration changed the status of subtask T965: googlecode import: Analyze and fix errors from Resolved to Migrated.

gitlab-migration changed the status of subtask T976: google import: Clean up wrong revisions from Resolved to Migrated.

Ingest Google Code Mercurial repositoriesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Ingest Google Code Mercurial repositories
Closed, MigratedEdits Locked
Actions

Related Objects
Search...