Page MenuHomeSoftware Heritage
Feed Advanced Search

Feb 21 2018

ardumont updated subscribers of T329: hg / mercurial loader.

I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?

Feb 21 2018, 11:40 AM · Mercurial loader
ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

If cache files are sticking around, then of course the code should make sure that they go away when done

Feb 21 2018, 11:04 AM · Mercurial loader
fiendish added a comment to T964: 2018-02-16 worker disk full postmortem.

If cache files are sticking around, then of course the code should make sure that they go away when done or aborted. But I think that a few G used during processing of extremely large repos should be acceptable. :/

Feb 21 2018, 5:18 AM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?

Feb 21 2018, 5:08 AM · Mercurial loader
fiendish added a comment to T970: mercurial loader: What to do in case of .hgtags?.

I agree with taking tags from both sides and discarding all lines that don't fit the pattern.

Feb 21 2018, 4:28 AM · Archive content, Mercurial loader

Feb 20 2018

ardumont added a comment to T329: hg / mercurial loader.

As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.

Feb 20 2018, 11:00 PM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.

Feb 20 2018, 10:16 PM · Mercurial loader
ardumont closed T965: googlecode import: Analyze and fix errors as Resolved.
Feb 20 2018, 4:49 PM · Archive content, Mercurial loader
ardumont closed T965: googlecode import: Analyze and fix errors, a subtask of T682: Ingest Google Code Mercurial repositories, as Resolved.
Feb 20 2018, 4:49 PM · Archive coverage, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.

Latest version with fixes deployed.
Still remains actions on T964 and clean up on T976 before scheduling another run.

Feb 20 2018, 4:49 PM · Archive content, Mercurial loader
ardumont added a comment to T976: google import: Clean up wrong revisions.

postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz

Feb 20 2018, 4:40 PM · Archive content, Mercurial loader
ardumont updated the task description for T976: google import: Clean up wrong revisions.
Feb 20 2018, 4:25 PM · Archive content, Mercurial loader
ardumont added a comment to T976: google import: Clean up wrong revisions.

Running a full backup of the table for the handful of revisions concerned here is a bit overkill! (better be safe than sorry and all that, but still...)

Feb 20 2018, 4:23 PM · Archive content, Mercurial loader
olasd added a comment to T976: google import: Clean up wrong revisions.

In any case, we need to make a backup dump prior to touching those tables!

Backup running on prado:

postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz
Feb 20 2018, 4:14 PM · Archive content, Mercurial loader
ardumont added a comment to T976: google import: Clean up wrong revisions.

In any case, we need to make a backup dump prior to touching those tables!

Feb 20 2018, 3:45 PM · Archive content, Mercurial loader
ardumont added a comment to T976: google import: Clean up wrong revisions.
  1. We can 'simply' delete the revision of type 'hg' as no other mercurial revision exists as of today.
Feb 20 2018, 2:43 PM · Archive content, Mercurial loader
ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).

Feb 20 2018, 2:11 PM · Mercurial loader
ardumont renamed T976: google import: Clean up wrong revisions from google import: Clean up wrong revision to google import: Clean up wrong revisions.
Feb 20 2018, 12:58 PM · Archive content, Mercurial loader
ardumont created T976: google import: Clean up wrong revisions.
Feb 20 2018, 12:57 PM · Archive content, Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

For information, as mentioned in irc, i see strange behavior regarding the cache disk (P228 for a sample).

Feb 20 2018, 12:09 PM · Mercurial loader
ardumont edited P228 sqlitedict (dependency) spawns 2 disk cache whatever the configuration....
Feb 20 2018, 12:05 PM · Mercurial loader

Feb 19 2018

ardumont edited P228 sqlitedict (dependency) spawns 2 disk cache whatever the configuration....
Feb 19 2018, 6:12 PM · Mercurial loader
ardumont created P228 sqlitedict (dependency) spawns 2 disk cache whatever the configuration....
Feb 19 2018, 6:02 PM · Mercurial loader

Feb 17 2018

ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).
I'll keep a listing of those not rescheduled.

Feb 17 2018, 1:53 PM · Mercurial loader

Feb 16 2018

ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

Now to avoid the OOM to happen too late and kill other services as well, i thought of quota in the memory usage (systemd permits this).

Feb 16 2018, 7:32 PM · Mercurial loader
ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

So, now i've made the deposit/svn/mercurial loaders use private tmp mounting.
So next time this happens, we should only need restarting/stopping the service.
And this will clean up the mess left after this situation arose.

Feb 16 2018, 7:29 PM · Mercurial loader
ardumont closed T970: mercurial loader: What to do in case of .hgtags? as Resolved by committing rDLDHGe0b48c6c6e9a: bundle20_loader: Warn about wrong pattern in tags & continue loading.
Feb 16 2018, 4:00 PM · Archive content, Mercurial loader
ardumont closed T970: mercurial loader: What to do in case of .hgtags?, a subtask of T965: googlecode import: Analyze and fix errors, as Resolved.
Feb 16 2018, 4:00 PM · Archive content, Mercurial loader
ardumont added a comment to T970: mercurial loader: What to do in case of .hgtags?.

I've chosen 3. to comply with the doc's suggestion.
As usual nothing is set in stone.

Feb 16 2018, 3:59 PM · Archive content, Mercurial loader
ardumont updated the task description for T970: mercurial loader: What to do in case of .hgtags?.
Feb 16 2018, 3:58 PM · Archive content, Mercurial loader
ardumont created T970: mercurial loader: What to do in case of .hgtags?.
Feb 16 2018, 3:43 PM · Archive content, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.

Well, sure, if the .hgtags is corrupted...

Feb 16 2018, 3:26 PM · Archive content, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.

Last relevant error found in crossing multiple streams... (noooooo).

Feb 16 2018, 3:19 PM · Archive content, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.

I have a problem in the log fetcher (that's why it's said to be incomplete in the task description).

Feb 16 2018, 3:11 PM · Archive content, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.

I have a problem in the log fetcher (that's why it's said to be incomplete in the task description).

Feb 16 2018, 2:29 PM · Archive content, Mercurial loader
ardumont added a comment to T965: googlecode import: Analyze and fix errors.
"PatoolError('error extracting /srv/storage/space/m": 19,
Feb 16 2018, 2:26 PM · Archive content, Mercurial loader
ardumont created T965: googlecode import: Analyze and fix errors.
Feb 16 2018, 2:13 PM · Archive content, Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Still, we don't want the closed branches (2.).

Feb 16 2018, 12:22 PM · Mercurial loader
ardumont added a comment to T964: 2018-02-16 worker disk full postmortem.

cannot process Mercurial data anymore.

Feb 16 2018, 11:44 AM · Mercurial loader
ftigeot updated the task description for T964: 2018-02-16 worker disk full postmortem.
Feb 16 2018, 11:25 AM · Mercurial loader
ftigeot created T964: 2018-02-16 worker disk full postmortem.
Feb 16 2018, 11:24 AM · Mercurial loader

Feb 15 2018

ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

Rescheduled!

Feb 15 2018, 6:16 PM · Archive coverage, Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.
  1. Also, given the reduce_effort configuration flag flipped to on, and given enough time between visits, we'd be getting divergent snapshots... Even though nothing changed in the repository, that's wrong. The reason for this would be because we would not pass on all revisions each time, thus we could not parse the named branches property (part of the optional 'extra' field in a changeset/~commit).
Feb 15 2018, 3:55 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Now, we can also another some form of branching in mercurial called bookmarks.
This is in upstream mercurial now and it can be pushed/pulled from repository so that sounded like something we want.
It's now supported in the bundle20_loader.

Feb 15 2018, 3:09 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Well, that is only for the named branch case which is not a convention.
So instead, we are leveraging hglib's heads function to determine the branches' names and targetted revision needed.

Feb 15 2018, 3:09 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Deployed and tested on only one repository and already, for something that has some data, we result in an empty snapshot where I expected something not empty (reproduced locally).

Feb 15 2018, 10:31 AM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

For example, for the revision, currently, we compute the revision id (ala swh) but the parents' id are mercurial's...

Feb 15 2018, 10:29 AM · Mercurial loader

Feb 14 2018

ardumont added a comment to T329: hg / mercurial loader.

Now, that's not all!

Feb 14 2018, 6:34 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

According to the code, that information is stored in an optional 'extra' field. So it's possible that nothing is referenced...

Feb 14 2018, 6:33 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

We'll possibly have new edge cases to fix after that ;)

Feb 14 2018, 11:48 AM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

I won't make it a blocker step to reschedule the googlecode origins though.
I'd like to see what actually happens when the loader mercurial loads ;)

Feb 14 2018, 10:32 AM · Mercurial loader
ardumont closed T955: googlecode import: hglib.error.CommandError during loading as Wontfix.

I close this as another mirror exists that we already browsed multiple times.

Feb 14 2018, 10:23 AM · Origin-GoogleCode, Archive content, Mercurial loader
ardumont closed T955: googlecode import: hglib.error.CommandError during loading, a subtask of T682: Ingest Google Code Mercurial repositories, as Wontfix.
Feb 14 2018, 10:23 AM · Archive coverage, Mercurial loader
ardumont closed T957: googlecode import: Check for origin clashes and fix if any as Resolved.

So back at updating the index and the origins for those.

Feb 14 2018, 10:16 AM · Archive content, Mercurial loader
ardumont closed T957: googlecode import: Check for origin clashes and fix if any, a subtask of T682: Ingest Google Code Mercurial repositories, as Resolved.
Feb 14 2018, 10:16 AM · Archive coverage, Mercurial loader
ardumont added a comment to T957: googlecode import: Check for origin clashes and fix if any.

Remains to update the existing origins (in db) with the right urls but not today.

Feb 14 2018, 10:06 AM · Archive content, Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

Does it make sense to open that in the loader's configuration property?

Feb 14 2018, 4:10 AM · Mercurial loader

Feb 13 2018

ardumont added a comment to T957: googlecode import: Check for origin clashes and fix if any.

Listing of the googlecode dumps recomputed with the right origin urls (in the actual INDEX-hg, old one is referenced as INDEX-hg.deprecated).

Feb 13 2018, 7:30 PM · Archive content, Mercurial loader
ardumont updated the task description for T957: googlecode import: Check for origin clashes and fix if any.
Feb 13 2018, 6:46 PM · Archive content, Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

This makes no sense. Listed repos are tiny.

Feb 13 2018, 6:31 PM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

The bundle loader is tunable to use less ram and therefore more disk for its live caching (though I need to revisit the counter to make the tuning argument less arbitrary and more representative of real bytes used, because it currently ignores overhead and python data has a lot of overhead).

Feb 13 2018, 5:13 PM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

The bundle step, for some repository, is at the moment needing quite some ram

Feb 13 2018, 4:50 PM · Mercurial loader
ardumont renamed T957: googlecode import: Check for origin clashes and fix if any from googlecode import: Check for origin clashes to googlecode import: Check for origin clashes and fix if any.
Feb 13 2018, 3:34 PM · Archive content, Mercurial loader
ardumont changed the status of T957: googlecode import: Check for origin clashes and fix if any from Open to Work in Progress.
Feb 13 2018, 3:33 PM · Archive content, Mercurial loader
ardumont changed the status of T957: googlecode import: Check for origin clashes and fix if any, a subtask of T682: Ingest Google Code Mercurial repositories, from Open to Work in Progress.
Feb 13 2018, 3:33 PM · Archive coverage, Mercurial loader
ardumont added a comment to T957: googlecode import: Check for origin clashes and fix if any.

Yes, we are hitting the same problem.

Feb 13 2018, 3:33 PM · Archive content, Mercurial loader
ardumont added a comment to T955: googlecode import: hglib.error.CommandError during loading.

Basic checks on the archive is fine:

Feb 13 2018, 3:06 PM · Origin-GoogleCode, Archive content, Mercurial loader
ardumont closed T956: googlecode import: Clean up visit wrongly targetting empty snapshot, a subtask of T682: Ingest Google Code Mercurial repositories, as Resolved.
Feb 13 2018, 2:26 PM · Archive coverage, Mercurial loader
ardumont closed T956: googlecode import: Clean up visit wrongly targetting empty snapshot as Resolved.
softwareheritage=> select count(*) from origin_visit inner join origin on origin_visit.origin = origin.id where origin.type = 'hg';
 count
--------
 126678
(1 row)
softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin where type='hg' and url like '%googlecode%' and ov.snapshot_id = 16;  # empty snapshot
 count
--------
 126661
(1 row)
Feb 13 2018, 2:26 PM · Archive content, Mercurial loader
ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Feb 13 2018, 2:15 PM · Archive coverage, Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Heads up on this, i'll mention my investigation so far and the archives for those interested to try and reproduce.

Feb 13 2018, 12:38 PM · Mercurial loader
ardumont created T957: googlecode import: Check for origin clashes and fix if any.
Feb 13 2018, 12:26 PM · Archive content, Mercurial loader
ardumont renamed T955: googlecode import: hglib.error.CommandError during loading from import googlecode: hglib.error.CommandError during loading to googlecode import: hglib.error.CommandError during loading.
Feb 13 2018, 12:21 PM · Origin-GoogleCode, Archive content, Mercurial loader
ardumont renamed T956: googlecode import: Clean up visit wrongly targetting empty snapshot from googlecode import: Clean up visit targetting wrongly an empty snapshot to googlecode import: Clean up visit wrongly targetting empty snapshot.
Feb 13 2018, 12:19 PM · Archive content, Mercurial loader
ardumont created T956: googlecode import: Clean up visit wrongly targetting empty snapshot.
Feb 13 2018, 12:19 PM · Archive content, Mercurial loader
ardumont created T955: googlecode import: hglib.error.CommandError during loading.
Feb 13 2018, 12:11 PM · Origin-GoogleCode, Archive content, Mercurial loader
ardumont created T954: Add tests to loader-mercurial.
Feb 13 2018, 11:45 AM · Mercurial loader
ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

FYI all loaded repositories point to an empty snapshot.

Feb 13 2018, 10:19 AM · Archive coverage, Mercurial loader

Feb 12 2018

ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

FYI all loaded repositories point to an empty snapshot.

Feb 12 2018, 5:41 PM · Archive coverage, Mercurial loader
olasd added a comment to T682: Ingest Google Code Mercurial repositories.

FYI all loaded repositories point to an empty snapshot.

Feb 12 2018, 5:26 PM · Archive coverage, Mercurial loader
ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

Some error speaks for themselves (OSError, error during extraction), some are not.
I'm currently digging into this and will open dedicated tasks when deemed necessary.

Feb 12 2018, 4:29 PM · Archive coverage, Mercurial loader
ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.

Feb 12 2018, 3:47 PM · Archive coverage, Mercurial loader
ardumont created P222 mercurial import: origins in error.
Feb 12 2018, 3:33 PM · Mercurial loader

Feb 9 2018

fiendish added a comment to T682: Ingest Google Code Mercurial repositories.

yay

Feb 9 2018, 10:53 PM · Archive coverage, Mercurial loader
ardumont added a comment to T682: Ingest Google Code Mercurial repositories.

rDSNIP26ea29b2d2abf9c931ba5efcf0f49d4194254e79

Feb 9 2018, 5:52 PM · Archive coverage, Mercurial loader
ardumont claimed T682: Ingest Google Code Mercurial repositories.
Feb 9 2018, 5:51 PM · Archive coverage, Mercurial loader
ardumont changed the status of T682: Ingest Google Code Mercurial repositories from Open to Work in Progress.
Feb 9 2018, 5:49 PM · Archive coverage, Mercurial loader
ardumont updated subscribers of T682: Ingest Google Code Mercurial repositories.

This is now running on our swh-workers, scheduling running on saatchi:

Feb 9 2018, 5:48 PM · Archive coverage, Mercurial loader
ardumont created P220 ~/.config/swh/loader/hg.yml.
Feb 9 2018, 4:55 PM · Mercurial loader

Feb 8 2018

fiendish added a comment to T329: hg / mercurial loader.

Well I'm not sure what just happened, but I commited a patch (and apparently also some duplicate history).

Feb 8 2018, 8:21 PM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

I'll do it as part of my patch, but I will need you to look at it.

Feb 8 2018, 10:13 AM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

I'll do it as part of my patch, but I will need you to look at it. You made the original changes for good reasons, so I just want to make sure that the reasons are preserved.

Feb 8 2018, 10:09 AM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

I'm not clear on whether you want me to do it (including the 'incoming' patch) or if you are doing it though ;)

Feb 8 2018, 9:36 AM · Mercurial loader
ardumont added a comment to T329: hg / mercurial loader.

Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.
It tries to store full blob data for every blob in the repository in RAM, which is basically impossible for any large repo.
The get_contents method absolutely must discard the blob after computing its hashes, then check to see which blobs are missing using only the hashes, then re-load and send the missing blobs (what the original code was doing).

Feb 8 2018, 9:25 AM · Mercurial loader

Feb 7 2018

fiendish added a comment to T329: hg / mercurial loader.

Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.

Feb 7 2018, 10:38 PM · Mercurial loader

Feb 3 2018

ardumont added a comment to T329: hg / mercurial loader.

I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare.

Feb 3 2018, 5:13 PM · Mercurial loader

Feb 2 2018

fiendish added a comment to T329: hg / mercurial loader.

I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare. (This may require building mercurial from available 4.5 source to not hit some giant memory leak)

Feb 2 2018, 10:18 PM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.
Feb 2 2018, 6:56 AM · Mercurial loader

Jan 12 2018

ardumont added a comment to T329: hg / mercurial loader.

although the size limit is something really big like 10mb right?

Jan 12 2018, 11:37 AM · Mercurial loader
fiendish added a comment to T329: hg / mercurial loader.

For fetching the blob, the only gotcha i see is that possibly we have contents without data (the big one are filtered out).

Jan 12 2018, 2:58 AM · Mercurial loader