I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 21 2018
If cache files are sticking around, then of course the code should make sure that they go away when done
If cache files are sticking around, then of course the code should make sure that they go away when done or aborted. But I think that a few G used during processing of extremely large repos should be acceptable. :/
I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?
I agree with taking tags from both sides and discarding all lines that don't fit the pattern.
Feb 20 2018
As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.
As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.
postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz
Running a full backup of the table for the handful of revisions concerned here is a bit overkill! (better be safe than sorry and all that, but still...)
In T976#18114, @ardumont wrote:In any case, we need to make a backup dump prior to touching those tables!
Backup running on prado:
postgres@prado:~$ pg_dump --format tar --table revision_history --table revision softwareheritage | gzip -c - > /srv/remote-backups/postgres/T970/revision-revision-history.tar.gz
In any case, we need to make a backup dump prior to touching those tables!
- We can 'simply' delete the revision of type 'hg' as no other mercurial revision exists as of today.
In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).
For information, as mentioned in irc, i see strange behavior regarding the cache disk (P228 for a sample).
Feb 19 2018
Feb 17 2018
In the mean time, i'll reschedule filtering out the huge one (= archive size > 200Mib).
I'll keep a listing of those not rescheduled.
Feb 16 2018
Now to avoid the OOM to happen too late and kill other services as well, i thought of quota in the memory usage (systemd permits this).
So, now i've made the deposit/svn/mercurial loaders use private tmp mounting.
So next time this happens, we should only need restarting/stopping the service.
And this will clean up the mess left after this situation arose.
I've chosen 3. to comply with the doc's suggestion.
As usual nothing is set in stone.
Well, sure, if the .hgtags is corrupted...
Last relevant error found in crossing multiple streams... (noooooo).
I have a problem in the log fetcher (that's why it's said to be incomplete in the task description).
I have a problem in the log fetcher (that's why it's said to be incomplete in the task description).
"PatoolError('error extracting /srv/storage/space/m": 19,
Still, we don't want the closed branches (2.).
cannot process Mercurial data anymore.
Feb 15 2018
Rescheduled!
- Also, given the reduce_effort configuration flag flipped to on, and given enough time between visits, we'd be getting divergent snapshots... Even though nothing changed in the repository, that's wrong. The reason for this would be because we would not pass on all revisions each time, thus we could not parse the named branches property (part of the optional 'extra' field in a changeset/~commit).
Now, we can also another some form of branching in mercurial called bookmarks.
This is in upstream mercurial now and it can be pushed/pulled from repository so that sounded like something we want.
It's now supported in the bundle20_loader.
Well, that is only for the named branch case which is not a convention.
So instead, we are leveraging hglib's heads function to determine the branches' names and targetted revision needed.
Deployed and tested on only one repository and already, for something that has some data, we result in an empty snapshot where I expected something not empty (reproduced locally).
For example, for the revision, currently, we compute the revision id (ala swh) but the parents' id are mercurial's...
Feb 14 2018
Now, that's not all!
According to the code, that information is stored in an optional 'extra' field. So it's possible that nothing is referenced...
We'll possibly have new edge cases to fix after that ;)
I won't make it a blocker step to reschedule the googlecode origins though.
I'd like to see what actually happens when the loader mercurial loads ;)
I close this as another mirror exists that we already browsed multiple times.
So back at updating the index and the origins for those.
Remains to update the existing origins (in db) with the right urls but not today.
Does it make sense to open that in the loader's configuration property?
Feb 13 2018
Listing of the googlecode dumps recomputed with the right origin urls (in the actual INDEX-hg, old one is referenced as INDEX-hg.deprecated).
This makes no sense. Listed repos are tiny.
The bundle loader is tunable to use less ram and therefore more disk for its live caching (though I need to revisit the counter to make the tuning argument less arbitrary and more representative of real bytes used, because it currently ignores overhead and python data has a lot of overhead).
The bundle step, for some repository, is at the moment needing quite some ram
Yes, we are hitting the same problem.
Basic checks on the archive is fine:
softwareheritage=> select count(*) from origin_visit inner join origin on origin_visit.origin = origin.id where origin.type = 'hg'; count -------- 126678 (1 row) softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin where type='hg' and url like '%googlecode%' and ov.snapshot_id = 16; # empty snapshot count -------- 126661 (1 row)
Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.
Heads up on this, i'll mention my investigation so far and the archives for those interested to try and reproduce.
FYI all loaded repositories point to an empty snapshot.
Feb 12 2018
FYI all loaded repositories point to an empty snapshot.
FYI all loaded repositories point to an empty snapshot.
Some error speaks for themselves (OSError, error during extraction), some are not.
I'm currently digging into this and will open dedicated tasks when deemed necessary.
Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.
Feb 9 2018
yay
This is now running on our swh-workers, scheduling running on saatchi:
Feb 8 2018
Well I'm not sure what just happened, but I commited a patch (and apparently also some duplicate history).
I'll do it as part of my patch, but I will need you to look at it.
I'll do it as part of my patch, but I will need you to look at it. You made the original changes for good reasons, so I just want to make sure that the reasons are preserved.
I'm not clear on whether you want me to do it (including the 'incoming' patch) or if you are doing it though ;)
Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.
It tries to store full blob data for every blob in the repository in RAM, which is basically impossible for any large repo.
The get_contents method absolutely must discard the blob after computing its hashes, then check to see which blobs are missing using only the hashes, then re-load and send the missing blobs (what the original code was doing).
Feb 7 2018
Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.
Feb 3 2018
I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare.
Feb 2 2018
I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare. (This may require building mercurial from available 4.5 source to not hit some giant memory leak)
Jan 12 2018
although the size limit is something really big like 10mb right?
For fetching the blob, the only gotcha i see is that possibly we have contents without data (the big one are filtered out).