Page MenuHomeSoftware Heritage

Define an architecture to fetch extrinsic metadata outside listers and loaders
Closed, ResolvedPublic

Description

Listers won't work to fetch metadata for all forges. On Github and on Bitbucket they don't get any metadata other than the description (and on Github, whether it's a fork). On GitLab they also get the number of stars and forks, but not much more.

Existing package loaders currently load some metadata, but we may want to use dedicated loaders for that, eg. with a speicifc visit type.

Or have a new kind of component outside listers and loaders (eg. if we want to use Github's endpoint to list an org's repos like etalab does)

Related Objects

Event Timeline

vlorentz renamed this task from Define an architecture to fetch extrinsic metadata outside listers to Define an architecture to fetch extrinsic metadata outside listers and loaders.Sep 18 2020, 2:36 PM
vlorentz updated the task description. (Show Details)

The original idea for this was to have separate tasks to fetch metadata, so that loaders did not have forge-specific code to fetch metadata.

However, the idea of loading metadata from loader is more appealing the more I think about it:

  1. Metadata are fetched at about the same time as we snapshot code; which would allow showing more consistent states of repositories
  2. Active repositories automatically have their metadata fetched more often than inactive ones
  3. We don't have one more moving part to monitor and schedule
  4. This allows the Git loader to know a new repo is a "forge fork" of another one before it starts loading, so it can do an incremental load

The downsides I see:

  1. Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)
  2. It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)
  3. It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.

Thoughts?

To me the advantages are strong ,especially point 1 and 4.

Having a "complete" snapshot including the current code and current metadata is better than having the metadata at a different visit.

The original idea for this was to have separate tasks to fetch metadata, so that loaders did not have forge-specific code to fetch metadata.

However, the idea of loading metadata from loader is more appealing the more I think about it:

  1. Metadata are fetched at about the same time as we snapshot code; which would allow showing more consistent states of repositories
  2. Active repositories automatically have their metadata fetched more often than inactive ones
  3. We don't have one more moving part to monitor and schedule
  4. This allows the Git loader to know a new repo is a "forge fork" of another one before it starts loading, so it can do an incremental load

Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.

I think we will want to ensure that a failing metadata fetch doesn't fail the whole loading operation altogether, to avoid too strongly coupling these components.

The downsides I see:

  1. Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)

I doubt we will be giving git loaders ssh keys any time soon, and I'd rather we explicited and maybe contracted out the improvements of dulwich that we need for better generic https support (which will automatically be useful for all upstreams, not just GitHub).

Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.

  1. It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)

I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.

In terms of testing, I think we will want to do have the option to do metadata-only/code-only/both loads, so we could consider scheduling some metadata-only loads more often.

  1. It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.

I think we want to design the metadata loading as a pluggable / third party module from the get go, because for forges which support multiple different VCSes, we will want to share the logic between them. This will also make testing the metadata fetching/mangling logic in isolation easier.

In T1739#82939, @olasd wrote:

Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.

Some do, including GitHub, but I think the benefits of these bulk APIs are minimal because of the way GitHub implements rate-limiting.

The downsides I see:

  1. Loaders will need API credentials (though we may need to give Git loaders SSH keys because of T3544#69746)

[...]

Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.

Sure; but I meant it from a security point of view: these credentials will need to be available to processes which handle non-trusted data. Not to mention that they may be accidentally leaked to Sentry too.

  1. It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)

I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.

Probably not all metadata (eg. number of stars). Either way, the semantics of updated_at is not documented by GitHub.

I started working this design. We'll see if it needs to change later