- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 8 2023
Jan 5 2023
Dec 13 2022
Dec 1 2022
Nov 4 2022
swh.loader.git 2.1.0 has now been deployed on all workers.
Nov 3 2022
Nov 2 2022
Oct 26 2022
Oct 19 2022
Sep 27 2022
Sentry issue: SWH-LOADER-GIT-198
Aug 11 2022
Sentry issue: SWH-LOADER-GIT-16G
Aug 6 2022
Polished up and shared the tool built to produce the refined list priority.list.github.
It is now available at https://github.com/rdicosmo/swh-check-repositories
Aug 4 2022
In T4400#89007, @rdicosmo wrote:
@ardumont here is the subset of the repositories on GitHub, ordered by number of stars, that are:
- still on GitHub
- not a fork
In T4400#88794, @ardumont wrote:I took the opportunity to retrieve large origins [7] out of the sentry issue listing [6] (cf. description) [8].
And schedule those in the large queues after the one scheduled out of the scanoss exchange.If it's considered useless at some point, feel free to dismiss them (by purging the queue).
[7] 28310 unique origins ->
loader-git.pack-file-too-big-issue-5823.urls.txt.gz288 KBDownload[8] command used to create the listing out of sentry, in a venv (snippets repository) in worker1.staging:
(sentry-U52ipwI-) ardumont@worker1:~/snippets/ardumont/sentry% python -m list-urls-from-issue --project-name swh-loader-git --event-id 5823 | tee loader-git.pack-file-too-big-issue-5823.urls.txt ...
Aug 2 2022
Since the normal ingestion is mostly done (1 last normal ingestion ongoing), i've now make worker17-18 consumes 1 more task for the large repositories queue as well (vs. letting them twiddle their thumbs ;).
Aug 1 2022
At this point in time:
- 1 "normal" origin
- 22 "large" origins
Jul 29 2022
Jul 25 2022
fwiw, large repositories are taking their sweet time but it's on its way:
Jul 22 2022
Jul 19 2022
It's currently ingesting [1].
Jul 12 2022
Jun 16 2022
All production loaders have been restarted now.
Jun 10 2022
Let's do this the other way around, closing this as i'm done.
Please reopen if you need something else.
Jun 9 2022
@vlorentz I don't have anything left to do, can i close it now?
And the 2nd fork ingestion is done as well:
swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git' Enumerating objects: 12661350, done. Counting objects: 100% (191/191), done. Compressing objects: 100% (56/56), done. Total 12661350 (delta 140), reused 135 (delta 135), pack-reused 12661159 INFO:swh.loader.git.loader:Listed 15230 refs for repo https://github.com/Tomahawkd/chromium INFO:swh.loader.git.loader.GitLoader:Fetched 12661351 objects; 2 are new self.statsd.constant_tags: {'visit_type': 'git', 'incremental_enabled': True, 'has_parent_origins': True, 'has_parent_snapshot': True, 'has_previous_snapshot': False} self.parent_origins: [Origin(url='https://github.com/chromium/chromium', id=b'\xa9\xf66\xa1/\\\xc3\\\xa4\x18+\r\xe7L\x91\x94\xe9\x00\x96J')] {'status': 'eventful'} for origin 'https://github.com/Tomahawkd/chromium' Command being timed: "swh loader run git https://github.com/Tomahawkd/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368" User time (seconds): 62323.33 System time (seconds): 3001.76 Percent of CPU this job got: 72% Elapsed (wall clock) time (h:mm:ss or m:ss): 25:03:29 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 29352136 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 8 Minor (reclaiming a frame) page faults: 10355329 Voluntary context switches: 265156 Involuntary context switches: 265330 Swaps: 0 File system inputs: 2048 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
I've restarted the staging workers (loader_git and loader_high_priority) with the new dulwich version
heh, ok, so it's indeed because github sends us way too much
status, second fork ingestion done (prior to the other one still ongoing) [1]
Jun 8 2022
That's still a pretty big packfile ~12.6G [1]... I'm pondering whether i should stop it,
install the new python3-dulwich olasd packaged and trigger it back...
fwiw, jenkins is python3-dulwich aware.
I don't see the point of that for packages that can be backported with no changes, which is what I had done before, so I admit I hadn't even looked.
So the first fork ingestion finished and took less time.
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?
jsyk, I've edited accordingly the file and triggered back another fork ingestion:
swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
In T4311#86497, @ardumont wrote:fwiw, jenkins is python3-dulwich aware.
fwiw, jenkins is python3-dulwich aware.
I checked that the swh.loader.git tests are green with the new dulwich version.
@vlorentz I also encountered [1] this morning which might explain the large packfile...
So the first fork ingestion finished and took less time.
Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)
Jun 7 2022
Which one has that much more commit, the initial one?
Yes
If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.
I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test
Which one has that much more commit, the initial one?
initial load of a different repository, which has 338k more commits
initial load of a different repository, which has 338k more commits
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).