stuff related to https://forge.softwareheritage.org/diffusion/DLDG/
Details
Thu, Jun 16
All production loaders have been restarted now.
Fri, Jun 10
Let's do this the other way around, closing this as i'm done.
Please reopen if you need something else.
Thu, Jun 9
@vlorentz I don't have anything left to do, can i close it now?
And the 2nd fork ingestion is done as well:
swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git' Enumerating objects: 12661350, done. Counting objects: 100% (191/191), done. Compressing objects: 100% (56/56), done. Total 12661350 (delta 140), reused 135 (delta 135), pack-reused 12661159 INFO:swh.loader.git.loader:Listed 15230 refs for repo https://github.com/Tomahawkd/chromium INFO:swh.loader.git.loader.GitLoader:Fetched 12661351 objects; 2 are new self.statsd.constant_tags: {'visit_type': 'git', 'incremental_enabled': True, 'has_parent_origins': True, 'has_parent_snapshot': True, 'has_previous_snapshot': False} self.parent_origins: [Origin(url='https://github.com/chromium/chromium', id=b'\xa9\xf66\xa1/\\\xc3\\\xa4\x18+\r\xe7L\x91\x94\xe9\x00\x96J')] {'status': 'eventful'} for origin 'https://github.com/Tomahawkd/chromium' Command being timed: "swh loader run git https://github.com/Tomahawkd/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368" User time (seconds): 62323.33 System time (seconds): 3001.76 Percent of CPU this job got: 72% Elapsed (wall clock) time (h:mm:ss or m:ss): 25:03:29 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 29352136 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 8 Minor (reclaiming a frame) page faults: 10355329 Voluntary context switches: 265156 Involuntary context switches: 265330 Swaps: 0 File system inputs: 2048 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
I've restarted the staging workers (loader_git and loader_high_priority) with the new dulwich version
heh, ok, so it's indeed because github sends us way too much
status, second fork ingestion done (prior to the other one still ongoing) [1]
Wed, Jun 8
That's still a pretty big packfile ~12.6G [1]... I'm pondering whether i should stop it,
install the new python3-dulwich olasd packaged and trigger it back...
fwiw, jenkins is python3-dulwich aware.
I don't see the point of that for packages that can be backported with no changes, which is what I had done before, so I admit I hadn't even looked.
So the first fork ingestion finished and took less time.
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?
jsyk, I've edited accordingly the file and triggered back another fork ingestion:
swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
fwiw, jenkins is python3-dulwich aware.
I checked that the swh.loader.git tests are green with the new dulwich version.
@vlorentz I also encountered [1] this morning which might explain the large packfile...
So the first fork ingestion finished and took less time.
Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)
Tue, Jun 7
Which one has that much more commit, the initial one?
Yes
If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.
I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test
Which one has that much more commit, the initial one?
initial load of a different repository, which has 338k more commits
initial load of a different repository, which has 338k more commits
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).
Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:
Success for production worker [1]. Staging worker is still working on it.
Thu, Jun 2
heads up, ingestion still ongoing with quite some stability in regards to memory consumption.
Wed, Jun 1
Status update, both worker1.staging and worker17 are beyond the failing step of pack file limit where they usually crash \o/:
I've started a 32g experiment in worker1.staging and 64g in worker17.
8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.
worker17 is complaining as well but differently somehow.
same version for both though [2].
Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].
I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).
May 30 2022
May 20 2022
I did some profiling early this week, and found that when incrementally loading a linux fork we already visited: