Page MenuHomeSoftware Heritage

When listing an origin, add origin level metadata to storage
Open, NormalPublic

Description

  • create method in lister core
  • add different providers by listers
  • add method into each lister Note: this task is for the abstract method in lister-core, subtasks should be created for each lister

Event Timeline

moranegg created this task.Nov 6 2017, 12:22 PM
moranegg renamed this task from Add origin_metadata injection to listers to When listing an origin, add origin level metadata to storage.Nov 15 2017, 2:10 PM
moranegg updated the task description. (Show Details)
moranegg triaged this task as Low priority.Nov 28 2017, 4:16 PM
moranegg changed the task status from Open to Work in Progress.Feb 20 2018, 12:01 PM
moranegg raised the priority of this task from Low to Normal.
moranegg removed moranegg as the assignee of this task.Oct 1 2018, 3:05 PM
moranegg changed the task status from Work in Progress to Open.
vlorentz added a project: Restricted Project.Feb 21 2019, 10:11 AM
moranegg mentioned this in Unknown Object (Maniphest Task).Apr 5 2019, 5:04 PM

Some thoughts about this (with contributions from @moranegg and @olasd):

  • the API endpoints used by the github and bitbucket listers does not show extrinsic metadata, so that option is out
  • sending a request for each repository would need ~2 to 3 years for a full pass over github. That's with our current infrastructure, so it's not a hard limit.
zack added a subscriber: zack.May 20 2019, 3:45 PM
  • sending a request for each repository would need ~2 to 3 years for a full pass over github. That's with our current infrastructure, so it's not a hard limit.

Where is the bottleneck for this? API rate limit or what? We already use multiple tokens for listing github, can't we just do the same here and speed up (almost) arbitrarily a complete pass?

@zack Yes, rate limit. And I determined this based on our current listing rate (@olasd said 2 to 3 days to fully list GitHub with 500 repos per request; so an API call for each repo would take a total of 500 times 2 or 3 days).

But reading etalab's code gave me an idea: they send an API call per github organization/user (or more, if they have a lot of repos), and get multiple repo metadata at once; we could do that too.