Page MenuHomeSoftware Heritage

hg / mercurial loader
Closed, MigratedEdits Locked

Description

VCS loader for hg / mercurial repositories

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ardumont changed the task status from Open to Work in Progress.Dec 20 2017, 11:42 AM

After adaptations and adding the requisite cloning step prior to loading, here is a local succesful load (full local swh stack):

python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> origin_url = 'https://hg.python.org/distutils2'
>>> # local clone
... directory = 'distutils2'
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.mercurial.tasks import LoadMercurialTsk
>>>
>>> t = LoadMercurialTsk()
>>> t.run(origin_url=origin_url, directory=directory, visit_date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadMercurialTsk:Cloning https://hg.python.org/distutils2 to /tmp/swh.loader.mercurial.4kaeum3c.tmp/distutils2
DEBUG:swh.scheduler.task.LoadMercurialTsk:Making bundle at /tmp/swh.loader.mercurial.4kaeum3c.tmp/distutils2/HG20_none_bundle
DEBUG:swh.scheduler.task.LoadMercurialTsk:Creating hg origin for https://hg.python.org/distutils2
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done creating hg origin for https://hg.python.org/distutils2
DEBUG:swh.scheduler.task.LoadMercurialTsk:Creating origin_visit for origin 1 at time 2016-05-03T15:16:32+00:00
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done Creating origin_visit for origin 1 at time 2016-05-03T15:16:32+00:00
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1000 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 965 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 965 contents
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 2500 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 2500 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 2500 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 2500 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1664 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1664 directories
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1000 revisions
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1000 revisions
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 356 revisions
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 356 revisions
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 6 releases
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 6 releases
DEBUG:swh.scheduler.task.LoadMercurialTsk:Sending 1 occurrences
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done sending 1 occurrences
DEBUG:swh.scheduler.task.LoadMercurialTsk:Updating origin_visit for origin 1 with status full
DEBUG:swh.scheduler.task.LoadMercurialTsk:Done updating origin_visit for origin 1 with status full

related rDLDHG0db9310f0b6c220600d9b791caccb78e6554cadc

It is also deployed on worker02.

Here are the current shortcomings i saw:

  • as entertained in the code, only bundle20 format support (some googlecode dumps use bundle10 format though). -> found some conversion procedure but that needs a live repository to actually work (e.g. P207)
  • for remote repository, the current loader (swh.loader.mercurial.bundle20_loader) needs a cloning step prior to loading the bundle
  • It seems that mercurial tag does not have usual expected information (author, message, etc...). That raises the question of relevance since they are simply refs to commits. For now, as a technical detail, we create releases whose author is the void author (null name, null email, empty fullname).

It hits me that there are some other things i did not check and must be.
Maybe @fiendish already did:

  • loader idempotence (as in 'no matter how many times we load the same repository, we end up with exactly the same objects in storage').

    I plainly made that hypothesis. I strongly think it is.
  • is the loader iterative? That is, can it start back from previous visit and only continue from that point on?

    I don't think so. Also, recalling my manual tests, i remember that the filtering does take place as we call back-and-forth storage to only send missing stuff. But i think we iterate the repository from scratch each time.

    In any case, I'm not so sure we can due to mercurial's inner workings. @fiendish may have mentioned this in his code comments (as file headers). I don't remember.

as entertained in the code, only bundle20 format support

This is true, but...

(some googlecode dumps use bundle10 format though).

I think this is incorrect. A new tweak to the code has hopefully fixed this. For some reason I think that hglib's bundle creation was automatically generating v2 bundles without specifying to do so when I was originally testing, but now I see that it does not, so I just made the code specify it directly.

But also the googlecode dumps that I saw were full repos and not bundles, so maybe there are actual v1 bundles in there somewhere that I missed?

for remote repository, the current loader (swh.loader.mercurial.bundle20_loader) needs a cloning step prior to loading the bundle

Currently it works that way, but Mercurial also allows to request a bundle file directly from the server, so for any live repos we can hopefully use that.

loader idempotence (as in 'no matter how many times we load the same repository, we end up with exactly the same objects in storage').

How would it not be? There is no randomness. Maybe I misunderstand the question?

is the loader iterative? That is, can it start back from previous visit and only continue from that point on?

I think it can be easily, with some very small modifications, and I can think of two possible approaches...

Approach 1)
Start from only a partial bundle.
Mercurial will produce a bundle starting from any revision number you want by using the --base flag to indicate the last revision that we already have (using 0000000000000000000000000000000000000000 works just like -a, 0 will skip the initial commit, 1 will skip the first two commits, etc.) .
Then the parser would just need a short fallback method for fetching blob data from the DB for any needed parent file revision that is missing from that new bundle if it hasn't already been fetched. This can be either done on the fly, necessitating an arbitrary number of network data requests, or the new bundle could be skimmed up front to determine which parent revision IDs are missing and then fetch them all at once in one arbitrarily large request.

Approach 2)
Save work on a new complete bundle.
If we grab a new entire complete bundle and not just a partial one, many operations can be skipped over.
Immediately we can skip all previous commit and manifest logs. We can also skip hashing of previous file revision blobs.
With a pre-scan to see which files have new revisions, we could also skip constructing blobs for any files that have not been changed since the last visit.

This comment was removed by fiendish.

I think this is incorrect. A new tweak to the code has hopefully fixed this. For some reason I think that hglib's bundle creation was automatically generating v2 bundles without specifying to do so when I was originally testing, but now I see that it does not, so I just made the code specify it directly.

My test at the time of your modification confirmed this, thanks!

Currently it works that way, but Mercurial also allows to request a bundle file directly from the server, so for any live repos we can hopefully use that.

That would be quite awesome!
My tests at the time on live repositories did not work that way (as in with the current code).
Also, after a quick look, I did not find the way to request directly the bundle from the server.
As time was an issue (again at the time), i took the fast way and cloned first.

@zack mentioned that it was not really an issue to clone first (possibly due to the number of mercurial repositories in the wild + the hypothesis that the cloning step is fast).

But that could (should?) be investigated further then ;)

How would it not be? There is no randomness. Maybe I misunderstand the question?

You understood the question alright. My intent was not to imply randomness or anything else.
Plainly, i did not have time to check everything and i just wanted to report that.

I think it can be easily, with some very small modifications, and I can think of two possible approaches...

That's good news.

I understand the approach 1. Also, if we do as in the loader-svn (storing the svn commit number in the revision metadata), we can have a way to start back then (if it exists, retrieve occurrence's revision'se metadata's mercurial commit identifier and request partial bundle from that commit on).

For fetching the blob, the only gotcha i see is that possibly we content without data (the big one are filtered out).
But 1. why would we need the blob data? Aren't the hashes of it enough? (possibly my shortcomings on mercurial and mercurial loader understanding raises this question).

  1. if we need the blob and we don't have it, we may be able to request it to the mercurial repository?

The approach 2 is less clear to me but i possibly need to think more about it (and have another look to the current mercurial loader).

For fetching the blob, the only gotcha i see is that possibly we have contents without data (the big one are filtered out).

You are right. That may sink approach 1 as a generic solution, although the size limit is something really big like 10mb right? How likely is it that a code file exceeds that size without really being binary resource data? Maybe it would be safe to say that if any file is _ever_ too large, then it becomes always invalid after that point? But anyway, to be completely general this may warrant moving to approach 2.

why would we need the blob data? Aren't the hashes of it enough?

No, because every later blob's full data depends on an earlier parent blob full data, because every commit only records blob deltas, so you get previously constructed full data + new commit delta = new commit full data.
You have, e.g.
File: README.txt
commit 1: Hello this is the content of a readme file. It is very long and has many words. Maybe it even has some instructions and then some author credits.
commit 2: 1,14,29,"" <- this deletes "the content of " from the data in this file as of commit 1
etc.

if we need the blob and we don't have it, we may be able to request it to the mercurial repository?

Hm. This is a very interesting question. I do not think there is a way built into hg for pulling one file from a remote repository. From a local repository yes, but then we're back to just cloning the whole repo. We could of course use wget on HTTP servers, but you must customize the request for the different web servers. Bitbucket has different raw file paths than hg server does. ( https://stackoverflow.com/questions/5053640/get-a-single-file-from-a-remote-mercurial-repository )

The approach 2 is less clear to me

I will try to clarify. Approach 2 would pull a bundle of the entire remote repo and is exactly the same as running the first time, except after the first time you can skip processing some of the new bundle because it was already done the first time. For instance, we would only need to reconstruct blobs that are parents of new blobs. If a file never changes again, then we do not need to look at the blob section for that file. Further we would only need to compute hashes for new blobs.

In a nutshell, approach 1 is starting with partial data and trying to recover needed historical entries, and approach 2 is starting from complete data and replacing no-longer-needed processing steps with no-ops.
I think because of the problem you point out with approach 1, then we might have no choice but to go with approach 2. I will start on it after this weekend.

although the size limit is something really big like 10mb right?

It's even bigger actually, 100mib is the default value.
And it's not overridden in the deployment data (link points to the loader-git as an example but it's true for other loaders so far).

How likely is it that a code file exceeds that size without really being binary resource data? Maybe it would be safe to say that if any file is _ever_ too large, then it becomes always invalid after that point?

That sounds reasonable.

Also, it may be possible that there exists a limit somewhere in mercurial.

No, because every later blob's full data depends on an earlier parent blob full data, because every commit only records blob deltas, so you get previously constructed full data + new commit delta = new commit full data.

Right.
(See, that was indeed my mercurial shortcomings and me not reading back again the readme. Thanks! ;)

But anyway, to be completely general this may warrant moving to approach 2.

Indeed.

Hm. This is a very interesting question. I do not think there is a way built into hg for pulling one file from a remote repository. From a local repository yes, but then we're back to just cloning the whole repo. We could of course use wget on HTTP servers, but you must customize the request for the different web servers. Bitbucket has different raw file paths than hg server does. ( https://stackoverflow.com/questions/5053640/get-a-single-file-from-a-remote-mercurial-repository )

Well, sounds like an awful workload (approach 2 sounds better now ;).

I will try to clarify. Approach 2 would pull a bundle of the entire remote repo and is exactly the same as running the first time, except after the first time you can skip processing some of the new bundle because it was already done the first time. For instance, we would only need to reconstruct blobs that are parents of new blobs. If a file never changes again, then we do not need to look at the blob section for that file. Further we would only need to compute hashes for new blobs.

Thanks!

In a nutshell, approach 1 is starting with partial data and trying to recover needed historical entries, and approach 2 is starting from complete data and replacing no-longer-needed processing steps with no-ops. I think because of the problem you point out with approach 1, then we might have no choice but to go with approach 2. I will start on it after this weekend.

Right!

So, the adaptations needed i see according to our latest discussion are:

  • request direct complete bundle from remote repository (and remove current cloning step for remote repository use case).
  • detect no-op operations to skip

Sounds nice!

Again thanks for the clarifications.

This comment was removed by fiendish.

I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare. (This may require building mercurial from available 4.5 source to not hit some giant memory leak)

--- /tmp/MnkEAy_bundle20_loader.py
+++ /home/avi/SWH/swh-environment/swh-loader-mercurial/swh/loader/mercurial/bundle20_loader.py
@@ -80,27 +80,25 @@
         self.bundle_path = None
 
         try:
-            if not directory:  # remote repository
-                self.working_directory = mkdtemp(
-                    suffix='.tmp',
-                    prefix='swh.loader.mercurial.',
-                    dir='/tmp')
-                os.makedirs(self.working_directory, exist_ok=True)
-                self.hgdir = self.working_directory
-
-                self.log.debug('Cloning %s to %s' % (
-                    self.origin_url, self.hgdir))
-                hglib.clone(source=self.origin_url, dest=self.hgdir)
-            else:  # local repository
-                self.working_directory = None
-                self.hgdir = directory
-
-            self.bundle_path = os.path.join(self.hgdir, self.bundle_filename)
-            self.log.debug('Bundling at %s' % self.bundle_path)
-            with hglib.open(self.hgdir) as repo:
-                repo.bundle(bytes(self.bundle_path, 'utf-8'),
-                            all=True,
-                            type=b'none-v2')
+            self.working_directory = mkdtemp(
+                suffix='.tmp',
+                prefix='swh.loader.mercurial.',
+                dir='/tmp')
+            os.makedirs(self.working_directory, exist_ok=True)
+            self.bundle_path = os.path.join(
+                self.working_directory, self.bundle_filename
+            )
+            self.log.debug('Fetching bundle of %s to %s' % (
+                    self.origin_url, self.bundle_path
+                )
+            )
+            with hglib.init(self.working_directory).open() as repo:
+                repo.incoming(
+                    path=bytes(directory or self.origin_url, 'utf-8'),
+                    force=True,
+                    bundle=self.bundle_path,
+                    limit=1
+                )
         except Exception:
             self.cleanup()
             raise

I propose to treat remote and local repositories the same (for now at least) with hg incoming to write the bundle in bundle20_loader:prepare.

Yes, sure! I initiated the different treatment as a workaround to permit to test the cases.
It was not a definitive implementation ;)

Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.

It tries to store full blob data for every blob in the repository in RAM, which is basically impossible for any large repo.

The get_contents method absolutely must discard the blob after computing its hashes, then check to see which blobs are missing using only the hashes, then re-load and send the missing blobs (what the original code was doing). It's an unfortunate side effect of having only a single storage.content_missing call for all blobs at once. I assume that everything in that commit except for the call to content_missing can be safely reverted?

Also commit fbdd798b0e32a4cc0ef50b08ae2217d45f95e7ad is very problematic.
It tries to store full blob data for every blob in the repository in RAM, which is basically impossible for any large repo.
The get_contents method absolutely must discard the blob after computing its hashes, then check to see which blobs are missing using only the hashes, then re-load and send the missing blobs (what the original code was doing).

oh, right!

It's an unfortunate side effect of having only a single storage.content_missing call for all blobs at once.

Indeed.

We could adapt the storage api to open up a more appropriate endpoint if need be (if it's reasonable and shareable).
Not that i see clearly one though.

I assume that everything in that commit except for the call to content_missing can be safely reverted?

Well, sure if it's dumb, revert is the reasonable thing to do ;)

I'm not clear on whether you want me to do it (including the 'incoming' patch) or if you are doing it though ;)

It'd be cool if you could tell me which is which, thanks in advance.

I'll do it as part of my patch, but I will need you to look at it. You made the original changes for good reasons, so I just want to make sure that the reasons are preserved.

I'll do it as part of my patch, but I will need you to look at it.

ok, cool.

You made the original changes for good reasons, so I just want to make sure that the reasons are preserved.

Nice!

Thanks.

Well I'm not sure what just happened, but I commited a patch (and apparently also some duplicate history).

Heads up on this, i'll mention my investigation so far and the archives for those interested to try and reproduce.

"FileNotFoundError(2, 'No such file or directory')": 359,

All are the 'empty repository' case! This is fixed as already
mentioned in referenced commits.

"OSError(12, 'Cannot allocate memory')": 13,

The bundle step, for some repository, is at the moment needing quite
some ram. And as this is shared resources, in some cases, this could
be avoided (given random circumstances).

For the other unavoidable cases, we already discuss on irc about
possibly updating mercurial to the current latest version (4.5) and
changing from cloning + bundling to using directly bundling from
remote/local (implementation wise, cf. T329#17473). This needs some
work though (packaging mainly).

archives:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/i/iirosto-qw/iirosto-qw-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/i/iismichaels-blackthorn/iismichaels-blackthorn-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jsmith-gcaltimeclock/jsmith-gcaltimeclock-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jsmsmj-/jsmsmj--source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jsnail-mytracks/jsnail-mytracks-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jsocketio/jsocketio-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/m/madtank-wifikeyboard/madtank-wifikeyboard-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/m/madtull-quake2/madtull-quake2-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/pucngp-yaffey/pucngp-yaffey-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/r/riosramiro-rrios/riosramiro-rrios-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/slb/slb-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/slc-ie-tcpmon/slc-ie-tcpmon-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/s/slby99-test/slby99-test-source-archive.zip

"IntegrityError('duplicate key value violates uniqu": 3,

Like i said earlier (T682#17658), this smells like origin
clashes. This needs more investigation (T957).

Archives:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/a/ai-docs-en-info/ai-docs-en-info-lang-pt-br.source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/l/lp-csic-uab/lp-csic-uab-psearch.source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/l/lp-csic-uab/lp-csic-uab-xmaldi.source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/t/theforceteam-1073/theforceteam-1073-frc12.source-archive.zip

(1 more archive than reported...)

"PatoolError('error extracting /srv/storage/space/m": 2,

Those are not correct archives (check, 7z t, on those reveals at
least one error).

Archives:
/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/i/ichracing-gfd/ichracing-gfd-source-archive.zip
/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jayalemao-html5-slides/jayalemao-html5-slides-source-archive.zip

"WorkerLostError('Worker exited prematurely: signal": 2,

Well, the only thing i see is that those are big archives (> 350Mib).
And locally, their ram usage is off the roof to the point where the
machine becomes unresponsive.

Archives:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jackalopetears-wifi-propeller-load/jackalopetears-wifi-propeller-load-source-archive.zip
  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/p/pcbranders-testingofpmc/pcbranders-testingofpmc-source-archive.zip

We could 'simply' pass over multiple times and that should reduce its
memory usage since some contents/directories/etc... will be filtered
out after that.

We could also limit the archives' size we inject (we did for svn up to
200Mib IIRC, need reference here).

"CommandError(b'bundle', b'-t', b'none-v2', b'-a', ": 1,

This is to investigate further (T955).

Archive:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/e/exposong/exposong-source-archive.zip

"PatoolError(\"error extracting /srv/storage/space/m": 1

For that one, the worker's error reference long file name issue.
Locally, I did not reproduce though.

Archive:

  • /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/m/mary-sandbox/mary-sandbox-source-archive.zip

The bundle step, for some repository, is at the moment needing quite some ram

This makes no sense. Listed repos are tiny.
e.g. for /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/jsocketio/jsocketio-source-archive.zip aka https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/jsocketio/source-archive.zip :

jsocketio $ /usr/bin/time -v hg bundle -a --type none-v2 bfile
26 changesets found
	Command being timed: "hg bundle -a --type none-v2 bfile"
	User time (seconds): 0.10
	System time (seconds): 0.01
	Percent of CPU this job got: 97%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.12
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 21524
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4197
	Voluntary context switches: 1
	Involuntary context switches: 2
	Swaps: 0
	File system inputs: 0
	File system outputs: 1232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The bundle loader is tunable to use less ram and therefore more disk for its live caching (though I need to revisit the counter to make the tuning argument less arbitrary and more representative of real bytes used, because it currently ignores overhead and python data has a lot of overhead).

See: objects.py:SelectiveCache.DEFAULT_SIZE (max_size=foo during SelectiveCache.init) and also bundle20_reader.py:Bundle20Reader.init

If needed, hg loading can be slimmed at will. Very large repos will just hit the disk more often.

This makes no sense. Listed repos are tiny.

Yes, for those repositories, like i said on irc, i checked at least one instance of each 'bug' but not for that one!
I possibly mislead myself with the yet another ram issue to check! Nice catch.

Tested locally and they were all injected fine!
So my hypothesis on this is that the ram usage (due to other tasks running at the time) on each worker was so intense that the task did not make it and was killed.

The bundle loader is tunable to use less ram and therefore more disk for its live caching (though I need to revisit the counter to make the tuning argument less arbitrary and more representative of real bytes used, because it currently ignores overhead and python data has a lot of overhead).

Ok.
Does it make sense to open that in the loader's configuration property?

See: objects.py:SelectiveCache.DEFAULT_SIZE (max_size=foo during SelectiveCache.init) and also bundle20_reader.py:Bundle20Reader.init
If needed, hg loading can be slimmed at will. Very large repos will just hit the disk more often.

Interesting! Thanks for the heads up!

Does it make sense to open that in the loader's configuration property?

I think it would make sense, but I worry that my size estimate code isn't very good (because it tries foremost to stay fast, but without appropriate multiplier). So it will be hard to change the value reliably except to move it somewhat smaller than default and then pray. At least at first.

Also during directory loading there are two caches active, so at that time the config property will need to be /2 if the number is meant to govern total process usage and not just each cache usage. Just a gotcha to be aware of.

Now that you remind me, I will do some profilng of RAM utilization to see where to improve the estimate without significant slowdown. I can't tomorrow, but hopefully jeudi.

Ack.

I won't make it a blocker step to reschedule the googlecode origins though.
I'd like to see what actually happens when the loader mercurial loads ;)

We'll possibly have new edge cases to fix after that ;)

We'll possibly have new edge cases to fix after that ;)

Well, no need to wait that long.

Deployed and tested on only one repository and already, for something that has some data, we result in an empty snapshot where I expected something not empty (reproduced locally).
That's not good.

Digging in further, the issue is about the way we extract the branch information from the commit.
According to the code, that information is stored in an optional 'extra' field. So it's possible that nothing is referenced...
And apparently that's the case for at least one mercurial origin (/srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/0/00-00-bl/00-00-bl-source-archive.zip).

Well, heading towards mercurial's branch documentation... I'm feeling that i missed something.

According to the code, that information is stored in an optional 'extra' field. So it's possible that nothing is referenced...

Well, that is only for the named branch case which is not a convention.
So instead, we are leveraging hglib's heads function to determine the branches' names and targetted revision needed.

Now, that's not all!

The swh identifier computations at the moment are not correct everywhere.
We are mixing mercurial's internal identifier and swh's.
For example, for the revision, currently, we compute the revision id (ala swh) but the parents' id are mercurial's...

That needs to be fixed to be consistent everywhere.
The goal is to have something browsable inside the archive (which uses swh id ;).

For example, for the revision, currently, we compute the revision id (ala swh) but the parents' id are mercurial's...

Fixed in:

Deployed and tested on only one repository and already, for something that has some data, we result in an empty snapshot where I expected something not empty (reproduced locally).

Well, previous mentioned commits and rDLDHGc2f23f452608011844067179629e8fb09dd20f9c now fixes that.
Injecting a repository and checking from origin through snapshot to revision actually joins now, so that's a win!

Well, that is only for the named branch case which is not a convention.
So instead, we are leveraging hglib's heads function to determine the branches' names and targetted revision needed.

As we can see in the listed commits prior to this comment.

I added first an optional implementation which parsed the named branches that i reverted (I kept it in log for history reason).

Why? Because, with the current heads implementation (which is needed for the default branch), we already have almost everything we need (except for bookmark, cf. last comment T329#17850/commit).

More details, parsing named branch is:

  1. redundant (hglib.heads gives it already)
  2. giving too much information (closed branches for one - which, if we wanted them, could be given by the heads call anyway - closed=True as parameter to the call)
  3. Also, given the reduce_effort configuration flag flipped to on, and given enough time between visits, we'd be getting divergent snapshots... Even though nothing changed in the repository, that's wrong. The reason for this would be because we would not pass on all revisions each time, thus we could not parse the named branches property (part of the optional 'extra' field in a changeset/~commit).

Now, we can also another some form of branching in mercurial called bookmarks.
This is in upstream mercurial now and it can be pushed/pulled from repository so that sounded like something we want.
It's now supported in the bundle20_loader.

  1. Also, given the reduce_effort configuration flag flipped to on, and given enough time between visits, we'd be getting divergent snapshots... Even though nothing changed in the repository, that's wrong. The reason for this would be because we would not pass on all revisions each time, thus we could not parse the named branches property (part of the optional 'extra' field in a changeset/~commit).

15:45:41 fiendish | ardumont: re: reduce_effort blocking branch name gathering, that's true, but the little extra = commit.get('extra') block could also be moved before the continue

Quite, I agree.
Still, we don't want the closed branches (2.).

Still, we don't want the closed branches (2.).

And sure, we can filter out by parsing. That make up for more computations, so it's less clear.

Still, what is inclined to change my mind is what you said about hglib calls being slow on remote.

The small 'blocker' step would then remain to find out how to parse the bookmarks from a bundle though (they can be remote now)!

For information, as mentioned in irc, i see strange behavior regarding the cache disk (P228 for a sample).

I see not one but multiple dict cache used (2 disk caches for each call to swh.loader.mercurial.bundle20_loader.load_directories).

I adapted the code to permit to force the writing to a specific folder (other than /tmp).
Because, i'd like to sandbox that disk growing to the point of chaos (T964).

But, one disk cache respects that, the unexpected other one is not. It is always written in /tmp (P228).
I just don't see where that behavior takes place in the underlying code...

Furthermore, i cannot say if the content is the same but at least, the size of both disk cache converge before getting trashed (as far as i could tell, at that point, the computations are done).

I am under the impression there is something buggy in the sqlitedict dependency (well at least in regards to not writting to where it's supposed to).
Plainly, hoping i'm wrong.

As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.

As discussed in irc a short while ago (just leaving this as note here), seeing 2 caches is normal and expected, since one is spawned inside reader and one in loader. Will have to also pass that argument to the reader instance.

Right!
I missed the Bundle20Reader cache, thanks for pointing me to the right direction there (from irc discussion).
I thus propagated those information (rDLDHG881814e9a432a0099dfe349bc322babe5e9bc34b).

It was also mentioned early on but i did not realize and missed the connection.
As i have opened that parameter for only the one cache i saw, i also opened that part for the second one to be consistent in rDLDHGf308be477d71c570a2aa1f37e31c102454e6b75a.

I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?

I think 6e12c90b160ad3277a1edea27a05f9adea1bc92f may be a bad idea. Have you tested how much RAM it takes to hold the whole dirs dict in memory on a very large repo like mozilla-unified?

Not on mozilla unified, on the huge archives which made the worker fail anyway back in the day (T329#17722 ;).

Yes, it expands quite a lot in ram.

The context was to avoid T964 (which already happened outside the mercurial loader btw ;). It's just that we agree that it is a problem we need to address now.

So i did this possible 'bad' idea ;) but not only:

  1. as i was missing 1 explanation for 1 disk cache too many (again due to my misunderstanding at the time).

Still, for each call to load_directories, we expand to 2 disk caches (up to 3Gib, well that depends on the applicative ram boundary we set...) which are cleaned up (if the process is not killed).
So for me, it was uncontrollable behavior (so reducing to 1 call sounded reasonable at the ram expanse indeed since it's simpler to bound).

(i also might be biased in thinking that the directory_missing will return few missing ids)

So, T964 related, too much disk usage is no longer acceptable (OOM killed -> disk not cleaned up -> no longer enough disk space on workers -> consuming all remaining tasks without doing anything but expand critically logs... ;)

My aim was to:

  1. Control where the cache writes somehow (it was partially fixed until yesterday thanks to your irc ping and this note).
  1. Use less disk space in this loader (rDLDHGfd79b19a97c9ff7dc327cbd6155b82345af9dcec to compress stuff and that rDLDHG6e12c90b160ad3277a1edea27a05f9adea1bc92f indeed)
  1. bound the ram usage (@olasd did it for all loaders, 730c7245f4a89866a97181b2af49808967cbba91)
  1. somehow bound the disk usage as well. There has been a discussion on irc (#swh-sysadm). The better approach seemed to be what @olasd proposed;
15:05:01    +olasd | - for each instance of swh-worker@* unit, have a corresponding swh-worker-tmpfs@ mount, export the path to that as the TMPDIR envvar
15:05:21    +olasd | - make sure that when swh-worker@foo restarts, swh-worker-tmpfs@foo restarts
15:06:58    +olasd | - add a swapfile to the systems or reduce / and increase the swap size
15:08:10    +olasd | instead of exporting TMPDIR we can bind mount the tmpfs on the worker's /tmp using systemd's BindPaths
15:08:40    +olasd | the only unknown is the behavior of tmpfs when several of them are trying to use more than the available system memory

But i don't know where we stand in that regard.

  1. Filter out huge dump archives (> 200Mib) as a first round.

In any case, if we agree that it is a bad idea (i'm not sure yet), this can be reverted.
As usual, nothing is set in stone.

There are always other possibilities that it's not one loader we want but multiple ones depending on the size of those repositories.
Or as you hinted, one loader but with dynamic cache parameter depending on the origin.

In any case, we would then need to determine the heurisitic(s) to discriminate between those (number of changesets, branches, size, etc...).
(For the dumps, it could be simple as the archive's size maybe).

Cheers,

I worry that RAM is way more constrained than disk space is. It seems like the biggest problem is/was

OOM killed -> disk not cleaned up

As long as the disk stays clean, using storage space should be better than being OOMed. But 6e12c90b160ad3277a1edea27a05f9adea1bc92f could possibly cause more OOMs and doesn't solve the disk cleaning.

My intuition is that it makes all scenarios worse.
With after-OOM cleanup, I think that there will be more OOMs but no real benefit since the disk is being cleaned up after OOMs anyway.
Without after-OOM cleanup, I think that there will be more OOMs and therefore more disk filling than before.

Can we associate the name of the temporary storage directory for a load with that loader's pid, and then make every new loader instance compare existing temp storage dirs during init? If a storage directory exists for a process that does not exist (because the process was killed) then it can be deleted.

It seems like the biggest problem is/was

Well, that's not clear...

Come to think of it again, you must be right.
As locally, when the task is running out of disk, the task fails and clean up the mess.

As long as the disk stays clean, using storage space is better than being OOMed.

So, in that regard, yes, indeed.

I forgot to mention that a discussion raised a possible workaround implementation.
A loader, for starter, could check for dangling files from prior load (of same nature) and potential clean up mess left...

Nobody likes it, but hell, that could still be a possible thing to implement to keep that darn disk clean.

Another implementation, which would be much more 'separation of concern' compliant would be to register clean up tasks.
As soon as the disk is messed up (uncompressing tarball or whatnot), register clean up tasks...
Now remains to understand how to effectively make those tasks being consumed... (we don't know currently when the initial job that spawned the cleanup task is effectively done or do we?... i believe that's an interesting question ;)

But this change likely guarantees more OOMs.

Ok then.
I'll do more thorough checks and revert in that case (not now though).

Cheers,

gitlab-migration changed the status of subtask T908: mercurial loader: Define scheduler task(s) from Resolved to Migrated.
gitlab-migration changed the status of subtask T964: 2018-02-16 worker disk full postmortem from Resolved to Migrated.