Page MenuHomeSoftware Heritage

When listing an origin, add origin level metadata to RMD storage
Closed, MigratedEdits Locked

Description

  • create method in lister core
  • add different providers by listers
  • add method into each lister Note: this task is for the abstract method in lister-core, subtasks should be created for each lister

RMD = Raw Metadata Storage

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

moranegg renamed this task from Add origin_metadata injection to listers to When listing an origin, add origin level metadata to storage.Nov 15 2017, 2:10 PM
moranegg updated the task description. (Show Details)
moranegg changed the task status from Open to Work in Progress.Feb 20 2018, 12:01 PM
moranegg raised the priority of this task from Low to Normal.
moranegg changed the task status from Work in Progress to Open.Oct 1 2018, 3:05 PM
moranegg removed moranegg as the assignee of this task.
moranegg mentioned this in Unknown Object (Maniphest Task).Apr 5 2019, 5:04 PM

Some thoughts about this (with contributions from @moranegg and @olasd):

  • the API endpoints used by the github and bitbucket listers does not show extrinsic metadata, so that option is out
  • sending a request for each repository would need ~2 to 3 years for a full pass over github. That's with our current infrastructure, so it's not a hard limit.
  • sending a request for each repository would need ~2 to 3 years for a full pass over github. That's with our current infrastructure, so it's not a hard limit.

Where is the bottleneck for this? API rate limit or what? We already use multiple tokens for listing github, can't we just do the same here and speed up (almost) arbitrarily a complete pass?

@zack Yes, rate limit. And I determined this based on our current listing rate (@olasd said 2 to 3 days to fully list GitHub with 500 repos per request; so an API call for each repo would take a total of 500 times 2 or 3 days).

But reading etalab's code gave me an idea: they send an API call per github organization/user (or more, if they have a lot of repos), and get multiple repo metadata at once; we could do that too.

moranegg renamed this task from When listing an origin, add origin level metadata to storage to When listing an origin, add origin level metadata to RMD storage.Sep 18 2020, 2:31 PM
moranegg updated the task description. (Show Details)
moranegg edited projects, added Extrinsic metadata; removed Metadata workflow.
vlorentz claimed this task.

replaced loader-based metadata loading (T4188 / T4186)

gitlab-migration changed the status of subtask T1740: fetch extrinsic origin metadata from GitHub from Resolved to Migrated.