Page MenuHomeSoftware Heritage
Feed Advanced Search

Jun 8 2022

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

@vlorentz I also encountered [1] this morning which might explain the large packfile...

Jun 8 2022, 3:30 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

So the first fork ingestion finished and took less time.

Jun 8 2022, 3:20 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)

Jun 8 2022, 3:10 PM · System administration, Git loader

Jun 7 2022

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Which one has that much more commit, the initial one?

Yes

If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.

I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test

Jun 7 2022, 4:28 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Which one has that much more commit, the initial one?

Jun 7 2022, 4:24 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

initial load of a different repository, which has 338k more commits

Jun 7 2022, 4:16 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

initial load of a different repository, which has 338k more commits

Jun 7 2022, 3:36 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

Jun 7 2022, 3:34 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

Jun 7 2022, 3:30 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).

Jun 7 2022, 3:12 PM · System administration, Git loader
anlambert triaged T4311: Package and deploy dulwich 0.20.43 in production as Normal priority.
Jun 7 2022, 2:45 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:

Jun 7 2022, 10:59 AM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Success for production worker [1]. Staging worker is still working on it.

Jun 7 2022, 9:25 AM · System administration, Git loader

Jun 2 2022

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

heads up, ingestion still ongoing with quite some stability in regards to memory consumption.

Jun 2 2022, 1:40 PM · System administration, Git loader

Jun 1 2022

ardumont moved T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from code-review/await-feedback/pause to deployed/landed/monitoring on the System administration board.
Jun 1 2022, 5:37 PM · System administration, Git loader
ardumont moved T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from in-progress to code-review/await-feedback/pause on the System administration board.
Jun 1 2022, 5:37 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Status update, both worker1.staging and worker17 are beyond the failing step of pack
file limit where they usually crash \o/ [1].

Jun 1 2022, 5:16 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

I've started a 32g experiment in worker1.staging and 64g in worker17.

Jun 1 2022, 4:24 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.

Jun 1 2022, 4:04 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

worker17 is complaining as well but differently somehow.
same version for both though [2].

Jun 1 2022, 2:58 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].

Jun 1 2022, 2:54 PM · System administration, Git loader
ardumont changed the status of T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from Open to Work in Progress.
Jun 1 2022, 2:35 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).

Jun 1 2022, 2:35 PM · System administration, Git loader

May 30 2022

vlorentz triaged T4283: Load https://github.com/chromium/chromium with a higher packfile size limit as Low priority.
May 30 2022, 3:41 PM · System administration, Git loader
vlorentz added a parent task for T3273: Use "fork" relationships to speed-up initial load of large repositories: T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.
May 30 2022, 3:41 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a subtask for T4283: Load https://github.com/chromium/chromium with a higher packfile size limit: T3273: Use "fork" relationships to speed-up initial load of large repositories.
May 30 2022, 3:41 PM · System administration, Git loader
vlorentz created T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.
May 30 2022, 3:40 PM · System administration, Git loader

May 20 2022

vlorentz added revisions to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7873: Add an unweighted average for filtered_objects + fix existing metric name, D7876: Log summary of filtered objects in store_data.
May 20 2022, 3:54 PM · Origin-GitHub, Git loader
vlorentz added a revision to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7871: Add metrics in store_data on ratios of objects already stored.
May 20 2022, 1:48 PM · Origin-GitHub, Git loader
vlorentz added a comment to T4219: Investigate why GitHub fork detection did not bring a speed-up.

I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:

May 20 2022, 10:55 AM · Origin-GitHub, Git loader

May 16 2022

vlorentz added a comment to T4219: Investigate why GitHub fork detection did not bring a speed-up.

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

May 16 2022, 3:33 PM · Origin-GitHub, Git loader

May 13 2022

ardumont added a subtask for T4219: Investigate why GitHub fork detection did not bring a speed-up: T4242: Deployed loader.git v1.8.
May 13 2022, 6:01 PM · Origin-GitHub, Git loader
ardumont added a parent task for T4242: Deployed loader.git v1.8: T4219: Investigate why GitHub fork detection did not bring a speed-up.
May 13 2022, 6:01 PM · System administration, Git loader
ardumont closed T4243: Deploy loader.metadata credentials for high and oneshot loaders as Resolved.
May 13 2022, 5:59 PM · System administration, Metadata Loaders, Git loader
ardumont moved T4243: Deploy loader.metadata credentials for high and oneshot loaders from deployed/landed/monitoring to Component upgrades on the System administration board.
May 13 2022, 5:59 PM · System administration, Metadata Loaders, Git loader
ardumont closed T4242: Deployed loader.git v1.8 as Resolved.
May 13 2022, 5:59 PM · System administration, Git loader
ardumont added a comment to T4242: Deployed loader.git v1.8.

Dashboard to check for improvments [1]

May 13 2022, 5:29 PM · System administration, Git loader
ardumont moved T4243: Deploy loader.metadata credentials for high and oneshot loaders from in-progress to deployed/landed/monitoring on the System administration board.
May 13 2022, 5:23 PM · System administration, Metadata Loaders, Git loader
ardumont moved T4242: Deployed loader.git v1.8 from in-progress to deployed/landed/monitoring on the System administration board.
May 13 2022, 5:23 PM · System administration, Git loader
ardumont renamed T4243: Deploy loader.metadata credentials for high and oneshot loaders from Deploy loader.metadata credentials for high and oneshot loader to Deploy loader.metadata credentials for high and oneshot loaders.
May 13 2022, 5:16 PM · System administration, Metadata Loaders, Git loader
ardumont updated the task description for T4243: Deploy loader.metadata credentials for high and oneshot loaders.
May 13 2022, 5:16 PM · System administration, Metadata Loaders, Git loader
ardumont added a revision to T4243: Deploy loader.metadata credentials for high and oneshot loaders: D7832: Deploy metadata loader credentials for high and oneshot loaders.
May 13 2022, 5:13 PM · System administration, Metadata Loaders, Git loader
ardumont changed the status of T4243: Deploy loader.metadata credentials for high and oneshot loaders from Open to Work in Progress.
May 13 2022, 5:12 PM · System administration, Metadata Loaders, Git loader
ardumont triaged T4243: Deploy loader.metadata credentials for high and oneshot loaders as Normal priority.
May 13 2022, 5:11 PM · System administration, Metadata Loaders, Git loader
ardumont changed the status of T4242: Deployed loader.git v1.8 from Open to Work in Progress.
May 13 2022, 4:55 PM · System administration, Git loader
ardumont updated the task description for T4242: Deployed loader.git v1.8.
May 13 2022, 4:55 PM · System administration, Git loader
ardumont triaged T4242: Deployed loader.git v1.8 as Normal priority.
May 13 2022, 4:53 PM · System administration, Git loader
olasd closed T4225: Deploy a more recent version of prometheus-statsd-exporter on all nodes, a subtask of T4219: Investigate why GitHub fork detection did not bring a speed-up, as Resolved.
May 13 2022, 4:20 PM · Origin-GitHub, Git loader
vlorentz added a revision to T3273: Use "fork" relationships to speed-up initial load of large repositories: D7831: Use all base snapshots in determine_wants().
May 13 2022, 3:23 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a revision to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7831: Use all base snapshots in determine_wants().
May 13 2022, 3:23 PM · Origin-GitHub, Git loader
vlorentz updated subscribers of T4219: Investigate why GitHub fork detection did not bring a speed-up.

https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:

May 13 2022, 3:21 PM · Origin-GitHub, Git loader

May 10 2022

vlorentz added a comment to T3273: Use "fork" relationships to speed-up initial load of large repositories.

Currently can't do it on GitLab while logged out: https://gitlab.com/gitlab-org/gitlab/-/issues/361952

May 10 2022, 4:13 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader

May 6 2022

olasd changed the status of T4225: Deploy a more recent version of prometheus-statsd-exporter on all nodes, a subtask of T4219: Investigate why GitHub fork detection did not bring a speed-up, from Open to Work in Progress.
May 6 2022, 5:00 PM · Origin-GitHub, Git loader

May 3 2022

vlorentz removed a subtask for T3273: Use "fork" relationships to speed-up initial load of large repositories: T2202: Collect extrinsic metadata.
May 3 2022, 11:16 AM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added subtasks for T3273: Use "fork" relationships to speed-up initial load of large repositories: T1740: fetch extrinsic origin metadata from GitHub, T2202: Collect extrinsic metadata.
May 3 2022, 11:16 AM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a subtask for T3273: Use "fork" relationships to speed-up initial load of large repositories: T4219: Investigate why GitHub fork detection did not bring a speed-up.
May 3 2022, 11:15 AM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a parent task for T4219: Investigate why GitHub fork detection did not bring a speed-up: T3273: Use "fork" relationships to speed-up initial load of large repositories.
May 3 2022, 11:15 AM · Origin-GitHub, Git loader

May 2 2022

vlorentz added revisions to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7726: loader.core: Add statsd timing metrics, D7727: loader.core: Add statsd metrics on collected metadata.
May 2 2022, 3:29 PM · Origin-GitHub, Git loader
vlorentz triaged T4219: Investigate why GitHub fork detection did not bring a speed-up as Normal priority.
May 2 2022, 3:29 PM · Origin-GitHub, Git loader

Apr 29 2022

olasd closed T3544: Deal with GitHub removing support for git:// URLs as Resolved.

I'm closing this. I've submitted T4216 to track the actual packfile limit issue.

Apr 29 2022, 4:11 PM · Origin-GitHub, Git loader
olasd triaged T4216: git loader packfile size limit is poorly applied to HTTP(s) repositories as Normal priority.
Apr 29 2022, 4:09 PM · Git loader
zack renamed T3652: Cannot ingest git repositories with (too) large packfiles from Ingest git loader origins with smaller packfiles to Cannot ingest git repositories with (too) large packfiles.
Apr 29 2022, 4:00 PM · Git loader
ardumont closed T4206: prod: Deploy metadata loader v0.0.2, a subtask of T3273: Use "fork" relationships to speed-up initial load of large repositories, as Resolved.
Apr 29 2022, 11:27 AM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader

Apr 28 2022

ardumont changed the status of T4206: prod: Deploy metadata loader v0.0.2, a subtask of T3273: Use "fork" relationships to speed-up initial load of large repositories, from Open to Work in Progress.
Apr 28 2022, 3:43 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz edited projects for T3273: Use "fork" relationships to speed-up initial load of large repositories, added: Origin-GitHub; removed GitHub lister.
Apr 28 2022, 3:27 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz edited projects for T3273: Use "fork" relationships to speed-up initial load of large repositories, added: Origin-GitLab; removed GitLab migration.
Apr 28 2022, 3:27 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added projects to T3273: Use "fork" relationships to speed-up initial load of large repositories: GitHub lister, GitLab migration.
Apr 28 2022, 3:27 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a project to T3273: Use "fork" relationships to speed-up initial load of large repositories: Git loader.
Apr 28 2022, 3:26 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader

Apr 20 2022

vlorentz added a revision to T3880: Support Git commits with no angle brackets in author name: D7603: [WIP] Add support for commits with no author.
Apr 20 2022, 12:56 PM · Git loader

Apr 11 2022

bchauvet lowered the priority of T3544: Deal with GitHub removing support for git:// URLs from High to Normal.
Apr 11 2022, 11:57 AM · Origin-GitHub, Git loader
bchauvet added a comment to T3544: Deal with GitHub removing support for git:// URLs.

dealt with (at least in terms of only using https for clones)

Apr 11 2022, 11:56 AM · Origin-GitHub, Git loader
bchauvet added a comment to T3654: loader git: load revisions in topological order.
  • this ensures that objects loaded in the archive are self-consistent
  • but this increases the processing needed to load git repositories (i.e. it will slow them down)
Apr 11 2022, 11:53 AM · Git loader
bchauvet added a comment to T3635: git loader: enable "partial" global deduplication of revisions via the extid mapping table.
  • this adds "redundant" data to the extid table (space usage increase)
  • but this is a cheaper way to enable global deduplication until the topological order is guaranteed
Apr 11 2022, 11:52 AM · Git loader
bchauvet added a comment to T3655: loader git: enable global deduplication of head branches before fetching them.
  • could be done globally (i.e. query if any branch target is already in the archive), but does not fill historical “holes”
  • T3635 is a safer, but partial version of this
  • T3654 would enable doing this globally
Apr 11 2022, 11:50 AM · Git loader

Mar 17 2022

vlorentz added a comment to T3311: Use .gitmodules to discover origins.

The worst case scenario is that someone maliciously creates repositories generated on the fly that refer to each other via .gitmodules, so we end up in an infinite loop of loading garbage.

Mar 17 2022, 11:39 AM · Archive coverage, Git loader

Mar 16 2022

olasd added a comment to T3311: Use .gitmodules to discover origins.
In T3311#80997, @olasd wrote:

I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.

Mar 16 2022, 4:19 PM · Archive coverage, Git loader
olasd added a comment to T3311: Use .gitmodules to discover origins.

I think the approach in D7332 is interesting, but it feels a bit expensive to be doing it for every instance of a .gitmodules file found in any new directory for all git repos that are being loaded, as well as doing it again for the top level of any known branch in the git snapshot being loaded currently.

Mar 16 2022, 4:15 PM · Archive coverage, Git loader

Mar 14 2022

douardda added a comment to T3311: Use .gitmodules to discover origins.

It's been more/less discussed above but IMHO it would make sense to:

Mar 14 2022, 1:29 PM · Archive coverage, Git loader

Mar 10 2022

anlambert added a revision to T3923: Include submodules recursively when saving git repositories: D7332: loader: Add support for submodules discovering.
Mar 10 2022, 3:25 PM · Git loader, Save Code Now
anlambert added a revision to T3311: Use .gitmodules to discover origins: D7332: loader: Add support for submodules discovering.
Mar 10 2022, 3:25 PM · Archive coverage, Git loader

Mar 7 2022

vlorentz added a comment to T3923: Include submodules recursively when saving git repositories.

Somewhat related task: T3311

Mar 7 2022, 11:03 AM · Git loader, Save Code Now

Feb 21 2022

anlambert added a project to T3923: Include submodules recursively when saving git repositories: Git loader.
Feb 21 2022, 1:32 PM · Git loader, Save Code Now

Feb 18 2022

douardda added a parent task for T3544: Deal with GitHub removing support for git:// URLs: T2207: Improve ingestion efficiency .
Feb 18 2022, 11:21 AM · Origin-GitHub, Git loader
douardda added a parent task for T3655: loader git: enable global deduplication of head branches before fetching them: T2207: Improve ingestion efficiency .
Feb 18 2022, 11:20 AM · Git loader

Jan 24 2022

vlorentz added a comment to T3880: Support Git commits with no angle brackets in author name.

https://github.com/dulwich/dulwich/pull/927 (actually, this doesn't expose the offset_bytes yet)

Jan 24 2022, 4:42 PM · Git loader
vlorentz updated the task description for T3880: Support Git commits with no angle brackets in author name.
Jan 24 2022, 4:27 PM · Git loader
vlorentz claimed T3880: Support Git commits with no angle brackets in author name.
Jan 24 2022, 2:57 PM · Git loader
vlorentz added a comment to T3880: Support Git commits with no angle brackets in author name.

This will need a patch to Dulwich to fix properly. I'll use the opportunity to make Dulwich expose offset_bytes so we don't have to re-parse it ourselves

Jan 24 2022, 2:34 PM · Git loader
swh-sentry-integration added a comment to T3880: Support Git commits with no angle brackets in author name.

Sentry issue: SWH-LOADER-GIT-TJ

Jan 24 2022, 2:33 PM · Git loader
vlorentz triaged T3880: Support Git commits with no angle brackets in author name as Normal priority.
Jan 24 2022, 2:32 PM · Git loader

Jan 11 2022

douardda added a comment to T3544: Deal with GitHub removing support for git:// URLs.

I guess this is then related to T3653 somehow

Jan 11 2022, 10:55 AM · Origin-GitHub, Git loader
olasd added a comment to T3544: Deal with GitHub removing support for git:// URLs.

Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).

Jan 11 2022, 10:27 AM · Origin-GitHub, Git loader
olasd added a comment to T3544: Deal with GitHub removing support for git:// URLs.

For now, I've disabled our hardcoding of the TCP transport for GitHub origins.

Jan 11 2022, 10:25 AM · Origin-GitHub, Git loader
vsellier added a comment to T3544: Deal with GitHub removing support for git:// URLs.

The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.

Jan 11 2022, 9:08 AM · Origin-GitHub, Git loader

Nov 15 2021

vlorentz updated the task description for T3544: Deal with GitHub removing support for git:// URLs.
Nov 15 2021, 3:14 PM · Origin-GitHub, Git loader

Nov 4 2021

olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.
In T3627#73323, @zack wrote:

Thanks for the summaries @olasd, both here and on list.
I've followed up on list.

Meanwhile here's what I propose we do (spoiler!):

a) A4: add to the archive Merkle DAG only the filtered snapshot (referencing "intrinsic" branches only, as per A2) and its transitive closure

Nov 4 2021, 12:11 PM · Git loader

Oct 31 2021

zack added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

Thanks for the summaries @olasd, both here and on list.
I've followed up on list.

Oct 31 2021, 4:11 PM · Git loader

Oct 26 2021

ardumont updated the title for P1208 sampled origins: Patch (drop PR branches) or no patch (current version), failure to ingest with huge packfile from Patch (drop PR branches) or no patch (current version), ingestion fails to sampled origins: Patch (drop PR branches) or no patch (current version), failure to ingest with huge packfile.
Oct 26 2021, 12:12 PM · Git loader
ardumont added a comment to P1208 sampled origins: Patch (drop PR branches) or no patch (current version), failure to ingest with huge packfile.

Related to T3627

Oct 26 2021, 12:11 PM · Git loader
ardumont edited P1208 sampled origins: Patch (drop PR branches) or no patch (current version), failure to ingest with huge packfile.
Oct 26 2021, 12:07 PM · Git loader