@vlorentz I also encountered [1] this morning which might explain the large packfile...
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 8 2022
So the first fork ingestion finished and took less time.
Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)
Jun 7 2022
Which one has that much more commit, the initial one?
Yes
If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.
I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test
Which one has that much more commit, the initial one?
initial load of a different repository, which has 338k more commits
initial load of a different repository, which has 338k more commits
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?
Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.
Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).
Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:
Success for production worker [1]. Staging worker is still working on it.
Jun 2 2022
heads up, ingestion still ongoing with quite some stability in regards to memory consumption.
Jun 1 2022
Status update, both worker1.staging and worker17 are beyond the failing step of pack
file limit where they usually crash \o/ [1].
I've started a 32g experiment in worker1.staging and 64g in worker17.
8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.
worker17 is complaining as well but differently somehow.
same version for both though [2].
Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].
I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).
May 30 2022
May 20 2022
I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:
May 16 2022
In T4219#84994, @vlorentz wrote:This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".
May 13 2022
Dashboard to check for improvments [1]
https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:
May 10 2022
Currently can't do it on GitLab while logged out: https://gitlab.com/gitlab-org/gitlab/-/issues/361952
May 6 2022
May 3 2022
May 2 2022
Apr 29 2022
I'm closing this. I've submitted T4216 to track the actual packfile limit issue.
Apr 28 2022
Apr 20 2022
Apr 11 2022
dealt with (at least in terms of only using https for clones)
- this ensures that objects loaded in the archive are self-consistent
- but this increases the processing needed to load git repositories (i.e. it will slow them down)
- this adds "redundant" data to the extid table (space usage increase)
- but this is a cheaper way to enable global deduplication until the topological order is guaranteed
Mar 17 2022
The worst case scenario is that someone maliciously creates repositories generated on the fly that refer to each other via .gitmodules, so we end up in an infinite loop of loading garbage.
Mar 16 2022
In T3311#80997, @olasd wrote:I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.
I think the approach in D7332 is interesting, but it feels a bit expensive to be doing it for every instance of a .gitmodules file found in any new directory for all git repos that are being loaded, as well as doing it again for the top level of any known branch in the git snapshot being loaded currently.
Mar 14 2022
It's been more/less discussed above but IMHO it would make sense to:
Mar 10 2022
Mar 7 2022
Somewhat related task: T3311
Feb 21 2022
Feb 18 2022
Jan 24 2022
https://github.com/dulwich/dulwich/pull/927 (actually, this doesn't expose the offset_bytes yet)
This will need a patch to Dulwich to fix properly. I'll use the opportunity to make Dulwich expose offset_bytes so we don't have to re-parse it ourselves
Sentry issue: SWH-LOADER-GIT-TJ
Jan 11 2022
I guess this is then related to T3653 somehow
Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).
For now, I've disabled our hardcoding of the TCP transport for GitHub origins.
The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.
Nov 15 2021
Nov 4 2021
In T3627#73323, @zack wrote:Thanks for the summaries @olasd, both here and on list.
I've followed up on list.Meanwhile here's what I propose we do (spoiler!):
a) A4: add to the archive Merkle DAG only the filtered snapshot (referencing "intrinsic" branches only, as per A2) and its transitive closure
Oct 31 2021
Thanks for the summaries @olasd, both here and on list.
I've followed up on list.
Oct 26 2021
Related to T3627