GitHub fork detection started running on 2022-04-28 around 19:00 UTC, but there is no noticeable speedup around that time:
Revisions and Commits
|rDLDG Git loader|
|D8808||rDLDG3cf7582aa74d Eagerly populate the set of local heads in RepoRepresentation.__init__|
|D7876||rDLDG356b5542852b Log summary of filtered objects in store_data|
|D7871||rDLDGf45ca1c2c0fa Add metrics in store_data on ratios of objects already stored|
|D7873||rDLDG5ced09db7e66 Add an unweighted average for filtered_objects + fix existing metric name|
|D7831||rDLDG9b47b24b98c2 Use all base snapshots in determine_wants()|
|rDLDBASE Generic VCS/Package Loader|
|D7727||rDLDBASEc4b1119763ef loader.core: Add statsd metrics on collected metadata|
|D7726||rDLDBASE6ca6d5cf9cef loader.core: Add statsd timing metrics|
https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:
- in average 3 to 4 minutes, which is used compared to git repositories with a visit but no parent (~15s)
- and about half the total time spent by loaders, even though they represent only one tenth of the git visits
I decided to take a look at today's visits which take a long time:
softwareheritage=> select url, runtime from (select origin.url, (select max(date) - min(date) from origin_visit_status where origin_visit_status.origin=origin_visit.origin and origin_visit_status.visit=origin_visit.visit) as runtime from origin_visit inner join origin on (origin.id=origin_visit.origin) where date > '2022-05-12' and date < '2022-05-14') as t where runtime > '00:15:00' order by runtime desc limit 10; url | runtime ---------------------------------------------------------------------------------------------+----------------------- https://code.launchpad.net/~m.ch/mysql-server/mysql-6.0-sigar-plugin | 1 day 07:55:44.139706 https://github.com/EasyEngine/homebrew-core | 1 day 02:06:38.710557 https://github.com/paritytech/polkadot | 20:41:30.272244 https://github.com/facebook/relay | 19:09:03.008676 https://code.launchpad.net/~kenbreeman-deactivatedaccount/mysql-server/memcache_query_cache | 19:06:19.878981 https://code.launchpad.net/~starbuggers/sakila-server/mysql-5.1-wl820-antony1 | 18:17:04.017944 https://code.launchpad.net/~atcurtis/sakila-server/sakila-5.2 | 17:47:05.104121 https://gitlab.com/searchwing/development/payloads/pi-linux-kernel.git | 15:14:21.298042 https://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc.git | 14:23:48.964101 https://gitlab.com/BobbyTheBuilder/linux-pine64.git | 13:18:48.230494
It seems many of them are in that category, and were recently rebased. For example, https://github.com/EasyEngine/homebrew-core was not updated in 4 years, except for two branches pushed recently, with a single commit each, based on the parent's master branch.
This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".
I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:
- 2/3 of the time is spent fetching a huge packfile
- 1/3 is spent checking the object is present in the storage.
I'm going to try implementing a graph traversal in packfiles, so we can eliminate objects which are referenced by objects we know we have; to save requests to the storage. It should save resources on the storage server, but it is unclear to me what the impact on loader throughput will be; we'll see.