Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 7 2022
Success for production worker [1]. Staging worker is still working on it.
Jun 2 2022
heads up, ingestion still ongoing with quite some stability in regards to memory consumption.
Jun 1 2022
Status update, both worker1.staging and worker17 are beyond the failing step of pack
file limit where they usually crash \o/ [1].
I've started a 32g experiment in worker1.staging and 64g in worker17.
8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.
worker17 is complaining as well but differently somehow.
same version for both though [2].
Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].
I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).
May 30 2022
May 20 2022
I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:
May 16 2022
In T4219#84994, @vlorentz wrote:This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".
May 13 2022
Dashboard to check for improvments [1]
https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:
May 10 2022
Currently can't do it on GitLab while logged out: https://gitlab.com/gitlab-org/gitlab/-/issues/361952
May 6 2022
May 3 2022
May 2 2022
Apr 29 2022
I'm closing this. I've submitted T4216 to track the actual packfile limit issue.
Apr 28 2022
Apr 20 2022
Apr 11 2022
dealt with (at least in terms of only using https for clones)
- this ensures that objects loaded in the archive are self-consistent
- but this increases the processing needed to load git repositories (i.e. it will slow them down)
- this adds "redundant" data to the extid table (space usage increase)
- but this is a cheaper way to enable global deduplication until the topological order is guaranteed
Mar 17 2022
The worst case scenario is that someone maliciously creates repositories generated on the fly that refer to each other via .gitmodules, so we end up in an infinite loop of loading garbage.
Mar 16 2022
In T3311#80997, @olasd wrote:I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.
I think the approach in D7332 is interesting, but it feels a bit expensive to be doing it for every instance of a .gitmodules file found in any new directory for all git repos that are being loaded, as well as doing it again for the top level of any known branch in the git snapshot being loaded currently.
Mar 14 2022
It's been more/less discussed above but IMHO it would make sense to:
Mar 10 2022
Mar 7 2022
Somewhat related task: T3311
Feb 21 2022
Feb 18 2022
Jan 24 2022
https://github.com/dulwich/dulwich/pull/927 (actually, this doesn't expose the offset_bytes yet)
This will need a patch to Dulwich to fix properly. I'll use the opportunity to make Dulwich expose offset_bytes so we don't have to re-parse it ourselves
Sentry issue: SWH-LOADER-GIT-TJ
Jan 11 2022
I guess this is then related to T3653 somehow
Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).
For now, I've disabled our hardcoding of the TCP transport for GitHub origins.
The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.
Nov 15 2021
Nov 4 2021
In T3627#73323, @zack wrote:Thanks for the summaries @olasd, both here and on list.
I've followed up on list.Meanwhile here's what I propose we do (spoiler!):
a) A4: add to the archive Merkle DAG only the filtered snapshot (referencing "intrinsic" branches only, as per A2) and its transitive closure
Oct 31 2021
Thanks for the summaries @olasd, both here and on list.
I've followed up on list.
Oct 26 2021
Related to T3627
Oct 25 2021
Oct 19 2021
Sent a summary of this discussion to the swh-devel list for input:
Oct 18 2021
In T3627#72544, @douardda wrote:B3 I am not convinced a "synthetic" flag on the Snapshot branch makes sense, or at least I find this name confusing, especially considering we already have a synthetic flag on Revision: it's not synthetic in the sense of it's not object crafted by SWH, it comes from the origin.
B3 I am not convinced a "synthetic" flag on the Snapshot branch makes sense, or at least I find this name confusing, especially considering we already have a synthetic flag on Revision: it's not synthetic in the sense of it's not object crafted by SWH, it comes from the origin.
I would like us to conclude this discussion soon.
Oct 15 2021
Now, I still don't understand what mapping is to be stored in the extid table. What is
meaning of (version 0, sha1-git of the commit/tag, revision/release id) above? (I
expect a mapping to be a couple).
Oct 14 2021
In T3635#72206, @douardda wrote:Then I don't really get how this can help if we don't load revisions in topological order.
Ok I think what puzzle me in this description is the fact the 2 first bullets of the "git loader adaptations" are actually only one point: at the end of a successful loading, store a mapping in the extid table.