Use "fork" relationships to speed-up initial load of large repositories
Closed, MigratedEdits Locked
Actions

Description

(I'm writing this task just so that I don't forget the idea, but I don't expect it to be actionable in the short term)

To work incrementally, VCS loaders fetch the last snapshot of the origin, which gives them a set of "heads", they can pass to origins, so origins will detect what revisions they don't need to send.

Unfortunately, when someone forks a large repository (such as https://github.com/chromium/chromium) and we see it for the first time, we don't have that snapshot; so the server needs to send all revisions, and we then discard almost all of them, because they are already in the archive.

However, if we could detect new repositories are forks (from extrinsic metadata, from heuristics based on repository names, ...), we could fetch the snapshot from the original repositories and use them as the base to load the fork incrementally

Revisions and Commits

rDLDMD Extrinsic Metadata Loaders
	Closed		D7663 Add method get_parent_origins()
rDLDG Git loader
		D7831	rDLDG9b47b24b98c2 Use all base snapshots in determine_wants()
		D7695	rDLDG4ede7b351783 Replace 'base_url' argument with 'self.parent_origins' attribute
rDLDBASE Generic VCS/Package Loader
		D7691	rDLDBASE07fb382655a0 Store the result of MetadataFetcher.get_parent_origins

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4283 Load https://github.com/chromium/chromium with a higher packfile size limit
Migrated	gitlab-migration	T3273 Use "fork" relationships to speed-up initial load of large repositories
Migrated	gitlab-migration	T4219 Investigate why GitHub fork detection did not bring a speed-up
Migrated	gitlab-migration	T4225 Deploy a more recent version of prometheus-statsd-exporter on all nodes
Migrated	gitlab-migration	T4235 [As a temporary solution] deploy the statsd-exporter binary published by prometheus
Migrated	gitlab-migration	T4242 Deployed loader.git v1.8
Migrated	gitlab-migration	T1740 fetch extrinsic origin metadata from GitHub
Migrated	gitlab-migration	T1344 Write specs about metadata workflow
Migrated	gitlab-migration	T1738 Define and specify extrinsic origin metadata
Migrated	gitlab-migration	T1739 Define an architecture to fetch extrinsic metadata outside listers and loaders
Migrated	gitlab-migration	T1737 Define and specify metadata providers
Migrated	gitlab-migration	T1747 Review APIs to get metadata from supported origins
Migrated	gitlab-migration	T1748 Review which extrinsic metadata we want to fetch and archive
Migrated	gitlab-migration	T3542 Decide what metadata we want to / can collect from GitHub
Migrated	gitlab-migration	T3681 Review extrinsic metadata specification
Migrated	gitlab-migration	T4186 Allow loaders to fetch extrinsic metadata
Migrated	gitlab-migration	T4187 Pass forge type to loaders
Migrated	gitlab-migration	T4188 Make swh-loader-core run metadata fetchers before loading an origin
Migrated	gitlab-migration	T4193 staging: Deploy metadata loader
Migrated	gitlab-migration	T4194 staging: Deploy swh-scheduler 1.1.0
Migrated	gitlab-migration	T4195 staging: Deploy swh-loader-core 3.1.0
Migrated	gitlab-migration	T4189 Deploy swh.loader.core v3.0 (and other impacted loaders)
Migrated	gitlab-migration	T4206 prod: Deploy metadata loader v0.0.2
Migrated	gitlab-migration	T4204 prod: Deploy swh-scheduler 1.1.1
Migrated	gitlab-migration	T4205 prod: Deploy swh-loader-core 3.2.1