Investigate why GitHub fork detection did not bring a speed-up
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	May 2 2022, 3:29 PM

Description

GitHub fork detection started running on 2022-04-28 around 19:00 UTC, but there is no noticeable speedup around that time:

Revisions and Commits

rDLDG Git loader
	D8808	rDLDG3cf7582aa74d Eagerly populate the set of local heads in RepoRepresentation.__init__
	D7876	rDLDG356b5542852b Log summary of filtered objects in store_data
	D7871	rDLDGf45ca1c2c0fa Add metrics in store_data on ratios of objects already stored
	D7873	rDLDG5ced09db7e66 Add an unweighted average for filtered_objects + fix existing metric name
	D7831	rDLDG9b47b24b98c2 Use all base snapshots in determine_wants()
rDLDBASE Generic VCS/Package Loader
	D7727	rDLDBASEc4b1119763ef loader.core: Add statsd metrics on collected metadata
	D7726	rDLDBASE6ca6d5cf9cef loader.core: Add statsd timing metrics

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4283 Load https://github.com/chromium/chromium with a higher packfile size limit
Migrated	gitlab-migration	T3273 Use "fork" relationships to speed-up initial load of large repositories
Migrated	gitlab-migration	T4219 Investigate why GitHub fork detection did not bring a speed-up
Migrated	gitlab-migration	T4225 Deploy a more recent version of prometheus-statsd-exporter on all nodes
Migrated	gitlab-migration	T4235 [As a temporary solution] deploy the statsd-exporter binary published by prometheus
Migrated	gitlab-migration	T4242 Deployed loader.git v1.8

Event Timeline

vlorentz triaged this task as Normal priority.May 2 2022, 3:29 PM

vlorentz created this task.

vlorentz added revisions: D7726: loader.core: Add statsd timing metrics, D7727: loader.core: Add statsd metrics on collected metadata.

vlorentz added a parent task: T3273: Use "fork" relationships to speed-up initial load of large repositories.May 3 2022, 11:15 AM

vlorentz added a commit: rDLDBASE6ca6d5cf9cef: loader.core: Add statsd timing metrics.May 6 2022, 10:35 AM

vlorentz added a commit: rDLDBASEc4b1119763ef: loader.core: Add statsd metrics on collected metadata.

olasd changed the status of subtask T4225: Deploy a more recent version of prometheus-statsd-exporter on all nodes from Open to Work in Progress.May 6 2022, 5:00 PM

https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:

in average 3 to 4 minutes, which is used compared to git repositories with a visit but no parent (~15s)
and about half the total time spent by loaders, even though they represent only one tenth of the git visits

I decided to take a look at today's visits which take a long time:

softwareheritage=> select url, runtime from (select origin.url, (select max(date) - min(date) from origin_visit_status where origin_visit_status.origin=origin_visit.origin and origin_visit_status.visit=origin_visit.visit) as runtime from origin_visit inner join origin on (origin.id=origin_visit.origin) where date > '2022-05-12' and date < '2022-05-14') as t where runtime > '00:15:00' order by runtime desc limit 10;
                                             url                                             |        runtime        
---------------------------------------------------------------------------------------------+-----------------------
 https://code.launchpad.net/~m.ch/mysql-server/mysql-6.0-sigar-plugin                        | 1 day 07:55:44.139706
 https://github.com/EasyEngine/homebrew-core                                                 | 1 day 02:06:38.710557
 https://github.com/paritytech/polkadot                                                      | 20:41:30.272244
 https://github.com/facebook/relay                                                           | 19:09:03.008676
 https://code.launchpad.net/~kenbreeman-deactivatedaccount/mysql-server/memcache_query_cache | 19:06:19.878981
 https://code.launchpad.net/~starbuggers/sakila-server/mysql-5.1-wl820-antony1               | 18:17:04.017944
 https://code.launchpad.net/~atcurtis/sakila-server/sakila-5.2                               | 17:47:05.104121
 https://gitlab.com/searchwing/development/payloads/pi-linux-kernel.git                      | 15:14:21.298042
 https://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc.git                              | 14:23:48.964101
 https://gitlab.com/BobbyTheBuilder/linux-pine64.git                                         | 13:18:48.230494

It seems many of them are in that category, and were recently rebased. For example, https://github.com/EasyEngine/homebrew-core was not updated in 4 years, except for two branches pushed recently, with a single commit each, based on the parent's master branch.

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

vlorentz added a revision: D7831: Use all base snapshots in determine_wants().May 13 2022, 3:23 PM

vlorentz added a commit: rDLDG9b47b24b98c2: Use all base snapshots in determine_wants().May 13 2022, 4:14 PM

olasd closed subtask T4225: Deploy a more recent version of prometheus-statsd-exporter on all nodes as Resolved.May 13 2022, 4:20 PM

ardumont mentioned this in T4242: Deployed loader.git v1.8.May 13 2022, 4:55 PM

ardumont added a subtask: T4242: Deployed loader.git v1.8.May 13 2022, 6:01 PM

In T4219#84994, @vlorentz wrote:

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

This was done last friday, and still no speed-up. @olasd suggests "IME GitHub aggressively caches packfiles and it may fling you many more bytes than you've asked". I do not know how to check that, and I am out of ideas for now.

I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:

2/3 of the time is spent fetching a huge packfile
1/3 is spent checking the object is present in the storage.

I'm going to try implementing a graph traversal in packfiles, so we can eliminate objects which are referenced by objects we know we have; to save requests to the storage. It should save resources on the storage server, but it is unclear to me what the impact on loader throughput will be; we'll see.

vlorentz mentioned this in D7871: Add metrics in store_data on ratios of objects already stored.May 20 2022, 1:47 PM

vlorentz added a revision: D7871: Add metrics in store_data on ratios of objects already stored.May 20 2022, 1:48 PM

vlorentz added revisions: D7873: Add an unweighted average for filtered_objects + fix existing metric name, D7876: Log summary of filtered objects in store_data.May 20 2022, 3:54 PM

vlorentz added a commit: rDLDG5ced09db7e66: Add an unweighted average for filtered_objects + fix existing metric name.May 23 2022, 1:54 PM

vlorentz added a commit: rDLDGf45ca1c2c0fa: Add metrics in store_data on ratios of objects already stored.

vlorentz added a commit: rDLDG356b5542852b: Log summary of filtered objects in store_data.

gitlab-migration changed the status of subtask T4225: Deploy a more recent version of prometheus-statsd-exporter on all nodes from Resolved to Migrated.Oct 19 2022, 6:06 PM

gitlab-migration changed the status of subtask T4242: Deployed loader.git v1.8 from Resolved to Migrated.

olasd added a revision: D8808: Eagerly populate the set of local heads in RepoRepresentation.__init__.Nov 3 2022, 5:28 PM

olasd added a commit: rDLDG3cf7582aa74d: Eagerly populate the set of local heads in RepoRepresentation.__init__.Nov 4 2022, 1:30 PM

swh.loader.git 2.1.0 has now been deployed on all workers.

vlorentz closed this task as Resolved.Dec 1 2022, 4:18 PM