Define an architecture to fetch extrinsic metadata outside listers and loaders
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	May 23 2019, 11:55 AM

Description

Listers won't work to fetch metadata for all forges. On Github and on Bitbucket they don't get any metadata other than the description (and on Github, whether it's a fork). On GitLab they also get the number of stars and forks, but not much more.

Existing package loaders currently load some metadata, but we may want to use dedicated loaders for that, eg. with a speicifc visit type.

Or have a new kind of component outside listers and loaders (eg. if we want to use Github's endpoint to list an org's repos like etalab does)

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4283 Load https://github.com/chromium/chromium with a higher packfile size limit
Migrated	gitlab-migration	T3273 Use "fork" relationships to speed-up initial load of large repositories
Migrated	gitlab-migration	T2201 Indexing / mining
Migrated	gitlab-migration	T2202 Collect extrinsic metadata
Migrated	gitlab-migration	T833 When listing an origin, add origin level metadata to RMD storage
Migrated	gitlab-migration	T2693 fetch extrinsic origin metadata from GitLab instances
		Unknown Object (Maniphest Task)
Migrated	gitlab-migration	T1102 Handle all GitHub elements
Migrated	gitlab-migration	T1740 fetch extrinsic origin metadata from GitHub
Migrated	gitlab-migration	T1344 Write specs about metadata workflow
Migrated	gitlab-migration	T1738 Define and specify extrinsic origin metadata
Migrated	gitlab-migration	T1739 Define an architecture to fetch extrinsic metadata outside listers and loaders
Migrated	gitlab-migration	T1737 Define and specify metadata providers
Migrated	gitlab-migration	T1747 Review APIs to get metadata from supported origins
Migrated	gitlab-migration	T1748 Review which extrinsic metadata we want to fetch and archive
Migrated	gitlab-migration	T3542 Decide what metadata we want to / can collect from GitHub
Migrated	gitlab-migration	T3681 Review extrinsic metadata specification

Event Timeline

vlorentz triaged this task as Normal priority.May 23 2019, 11:55 AM

vlorentz created this task.

vlorentz added a subtask: T1737: Define and specify metadata providers.

vlorentz added a parent task: T1740: fetch extrinsic origin metadata from GitHub.May 23 2019, 12:06 PM

vlorentz changed the status of subtask T1737: Define and specify metadata providers from Open to Work in Progress.May 24 2019, 10:30 AM

vlorentz closed subtask T1737: Define and specify metadata providers as Resolved.Jul 4 2019, 11:14 AM

olasd mentioned this in T2306: Generic storage for extrinsic, qualified metadata related to any node of the swh archive.Mar 9 2020, 8:48 PM

moranegg moved this task from Backlog to Specifications on the Metadata workflow board.Sep 18 2020, 2:19 PM

vlorentz renamed this task from Define an architecture to fetch extrinsic metadata outside listers to Define an architecture to fetch extrinsic metadata outside listers and loaders.Sep 18 2020, 2:36 PM

vlorentz updated the task description. (Show Details)

zack added a parent task: T2693: fetch extrinsic origin metadata from GitLab instances.Oct 13 2020, 10:15 AM

moranegg edited projects, added Extrinsic metadata; removed Metadata workflow.Feb 12 2021, 4:34 PM

rdicosmo mentioned this in T2202: Collect extrinsic metadata.Mar 15 2021, 9:08 PM

moranegg added a subtask: T3681: Review extrinsic metadata specification.Oct 21 2021, 12:59 PM

The original idea for this was to have separate tasks to fetch metadata, so that loaders did not have forge-specific code to fetch metadata.

However, the idea of loading metadata from loader is more appealing the more I think about it:

Metadata are fetched at about the same time as we snapshot code; which would allow showing more consistent states of repositories
Active repositories automatically have their metadata fetched more often than inactive ones
We don't have one more moving part to monitor and schedule
This allows the Git loader to know a new repo is a "forge fork" of another one before it starts loading, so it can do an incremental load

The downsides I see:

Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)
It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)
It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.

Thoughts?

To me the advantages are strong ,especially point 1 and 4.

Having a "complete" snapshot including the current code and current metadata is better than having the metadata at a different visit.

In T1739#82920, @vlorentz wrote:

The original idea for this was to have separate tasks to fetch metadata, so that loaders did not have forge-specific code to fetch metadata.

However, the idea of loading metadata from loader is more appealing the more I think about it:

Metadata are fetched at about the same time as we snapshot code; which would allow showing more consistent states of repositories

Active repositories automatically have their metadata fetched more often than inactive ones

We don't have one more moving part to monitor and schedule

This allows the Git loader to know a new repo is a "forge fork" of another one before it starts loading, so it can do an incremental load

Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.

I think we will want to ensure that a failing metadata fetch doesn't fail the whole loading operation altogether, to avoid too strongly coupling these components.

The downsides I see:

Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)

I doubt we will be giving git loaders ssh keys any time soon, and I'd rather we explicited and maybe contracted out the improvements of dulwich that we need for better generic https support (which will automatically be useful for all upstreams, not just GitHub).

Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.

It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)

I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.

In terms of testing, I think we will want to do have the option to do metadata-only/code-only/both loads, so we could consider scheduling some metadata-only loads more often.

It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.

I think we want to design the metadata loading as a pluggable / third party module from the get go, because for forges which support multiple different VCSes, we will want to share the logic between them. This will also make testing the metadata fetching/mangling logic in isolation easier.

In T1739#82939, @olasd wrote:

Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.

Some do, including GitHub, but I think the benefits of these bulk APIs are minimal because of the way GitHub implements rate-limiting.

The downsides I see:

Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)

[...]

Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.

Sure; but I meant it from a security point of view: these credentials will need to be available to processes which handle non-trusted data. Not to mention that they may be accidentally leaked to Sentry too.

It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)

I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.

Probably not all metadata (eg. number of stars). Either way, the semantics of updated_at is not documented by GitHub.

I started working this design. We'll see if it needs to change later

vlorentz mentioned this in T3859: investigate using metadata from GHTorrent.Apr 21 2022, 8:39 PM

vlorentz mentioned this in T4252: Schedule recurring fetches of origin metadata.May 17 2022, 3:06 PM

vlorentz closed subtask T1747: Review APIs to get metadata from supported origins as Resolved.Jul 5 2022, 5:28 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T1747: Review APIs to get metadata from supported origins from Resolved to Migrated.Jan 8 2023, 9:59 PM

gitlab-migration closed subtask T3681: Review extrinsic metadata specification as Migrated.Jan 8 2023, 10:23 PM

Define an architecture to fetch extrinsic metadata outside listers and loadersClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Define an architecture to fetch extrinsic metadata outside listers and loaders
Closed, MigratedEdits Locked
Actions

Related Objects
Search...