Page MenuHomeSoftware Heritage

Prioritize archival from gitlab.com
Open, Unbreak Now!Public

Description

gitlab.com may start pruning inactive repositories in september 2022: https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/ (thx to pabs3 on #swh for bringing it up)

according to https://grafana.softwareheritage.org/d/6q7qtNSnk/draft-scheduler-metrics?orgId=1&from=1657007647537&to=1659599647538 , we currently have 4M archived repositories from gitlab.com, but 60k of them are outdated; and we are missing 220k more.

Therefore, I believe we should prioritize archival of Gitlab.com so that all repositories are visited within 12 months of their last update (including creation).

Event Timeline

vlorentz triaged this task as Unbreak Now! priority.EditedThu, Aug 4, 10:28 AM
vlorentz created this task.

I am currently running a query to find how many origins are over one year overdue for a visit:

select
  count(*)
from
  listed_origins
inner join
  listers
  on (listed_origins.lister_id=listers.id)
left join
  origin_visit_stats
  on (listed_origins.url=origin_visit_stats.url and listed_origins.visit_type=origin_visit_stats.visit_type)
where
  listers.name='gitlab'
  and listers.instance_name='gitlab.com'
  and (last_visit is null or last_update>last_visit)
  and last_update < now()-'1 year'::interval
;

As usual, I'm uneasy with the (general) idea of manually handling some repositories to resorb one bit of lag. This will only increase lag in another area that we will want to cover next. Rinse, repeat.

Your query filters repos by visit regardless of status. You may want to get a more accurate number by using last_successful instead of last_visit. How many of these have we really archived, vs. /attempted/ (and failed) to archive?

If we want to handle this in a sustainable way, we should turn this into a scheduling policy that would prioritize "repos from <lister instance id> not updated since <configurable interval> and not successfully archived since".

updated query running:

select
  count(*) as all_repos,
  count(*) filter (where enabled) as visitable,
  count(*) filter (where enabled and last_successful >= last_update) as uptodate,
  count(*) filter (where enabled and last_successful < last_update) as not_uptodate,
  count(*) filter (where enabled and last_successful is null and last_visit is not null) as visited_unsuccessfully,
  count(*) filter (where enabled and last_visit is null) as never_visited
from
  listed_origins
inner join
  listers
  on (listed_origins.lister_id=listers.id)
left join
  origin_visit_stats
  on (listed_origins.url=origin_visit_stats.url and listed_origins.visit_type=origin_visit_stats.visit_type)
where
  listers.name='gitlab'
  and listers.instance_name='gitlab.com'
  and last_update < now() - '1 year'::interval;

Looks like there's many more repos that should be visitable but aren't:

 all_repos │ visitable │ uptodate │ not_uptodate │ visited_unsuccessfully │ never_visited 
───────────┼───────────┼──────────┼──────────────┼────────────────────────┼───────────────
   2641101 │   2574254 │  2442236 │          161 │                 127126 │          4731

Of the 127k repos that we've tried to visit but didn't succeed, most seem to have disappeared before we've visited them (last visit status = not found). What bothers me is that recent lister runs are claiming to have seen them ("last_seen" in the listed_origins table is in the last few days).

https://twitter.com/gitlab/status/1555325376687226883

We discussed internally what to do with inactive repositories.
We reached a decision to move unused repos to object storage.
Once implemented, they will still be accessible but take a bit longer to access after a long period of inactivity.

So it looks like we don't need to do anything, after all