Page MenuHomeSoftware Heritage

Investigate why GitHub fork detection did not bring a speed-up
Closed, MigratedEdits Locked

Event Timeline

https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:

  • in average 3 to 4 minutes, which is used compared to git repositories with a visit but no parent (~15s)
  • and about half the total time spent by loaders, even though they represent only one tenth of the git visits

I decided to take a look at today's visits which take a long time:

softwareheritage=> select url, runtime from (select origin.url, (select max(date) - min(date) from origin_visit_status where origin_visit_status.origin=origin_visit.origin and origin_visit_status.visit=origin_visit.visit) as runtime from origin_visit inner join origin on (origin.id=origin_visit.origin) where date > '2022-05-12' and date < '2022-05-14') as t where runtime > '00:15:00' order by runtime desc limit 10;
                                             url                                             |        runtime        
---------------------------------------------------------------------------------------------+-----------------------
 https://code.launchpad.net/~m.ch/mysql-server/mysql-6.0-sigar-plugin                        | 1 day 07:55:44.139706
 https://github.com/EasyEngine/homebrew-core                                                 | 1 day 02:06:38.710557
 https://github.com/paritytech/polkadot                                                      | 20:41:30.272244
 https://github.com/facebook/relay                                                           | 19:09:03.008676
 https://code.launchpad.net/~kenbreeman-deactivatedaccount/mysql-server/memcache_query_cache | 19:06:19.878981
 https://code.launchpad.net/~starbuggers/sakila-server/mysql-5.1-wl820-antony1               | 18:17:04.017944
 https://code.launchpad.net/~atcurtis/sakila-server/sakila-5.2                               | 17:47:05.104121
 https://gitlab.com/searchwing/development/payloads/pi-linux-kernel.git                      | 15:14:21.298042
 https://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc.git                              | 14:23:48.964101
 https://gitlab.com/BobbyTheBuilder/linux-pine64.git                                         | 13:18:48.230494

It seems many of them are in that category, and were recently rebased. For example, https://github.com/EasyEngine/homebrew-core was not updated in 4 years, except for two branches pushed recently, with a single commit each, based on the parent's master branch.

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

This was done last friday, and still no speed-up. @olasd suggests "IME GitHub aggressively caches packfiles and it may fling you many more bytes than you've asked". I do not know how to check that, and I am out of ideas for now.

I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:

  • 2/3 of the time is spent fetching a huge packfile
  • 1/3 is spent checking the object is present in the storage.

I'm going to try implementing a graph traversal in packfiles, so we can eliminate objects which are referenced by objects we know we have; to save requests to the storage. It should save resources on the storage server, but it is unclear to me what the impact on loader throughput will be; we'll see.

swh.loader.git 2.1.0 has now been deployed on all workers.

vlorentz claimed this task.