Page MenuHomeSoftware Heritage

ingest bitbucket hg/mercurial repositories
Closed, ResolvedPublic

Description

as per parent task, but focusing on mercurial repositories (for which we don't have a loader yet)

Event Timeline

ardumont changed the status of subtask T329: hg / mercurial loader from Open to Work in Progress.Dec 20 2017, 11:42 AM
zack raised the priority of this task from Normal to High.Aug 21 2019, 10:11 AM

Given the recent announcement by bitbucket about dropping mercurial support, the priority of this task has just increased.

We do have a mercurial loader which we have already used, it's time to spin it on Bitbucket !

The mercurial loader and the bitbucket lister have been running all summer.

  • The bitbucket lister knows of 252402 mercurial origins.
  • The mercurial loader has visited 251755 of these origins, with the following results:
 latest_status | count  
---------------+--------
               | 251755
 partial       |  18004
 full          | 232641
 ongoing       |   1110

(there's clearly a bug somewhere in the loader as we don't have 1110 parallel workers ;))

I'd be tempted to consider that this task is done, and that a followup should be made to investigate and fix the failing repositories.

SQL queries for reference

Count bitbucket lister origins

(on the swh-lister database)

select origin_type, count(*) from bitbucket_repo group by origin_type;
Count latest visits for mercurial bitbucket origins

(on the softwareheritage database)

with origin_latest_visit as (
  select
    origin.url,
    (select status
     from origin_visit
     where
       origin_visit.origin = origin.id 
       and origin_visit.type = 'hg'
     order by date desc 
     limit 1) as latest_status
  from origin
  where origin.url like 'https://bitbucket.org/%'
) select latest_status, count(*)
  from origin_latest_visit
  where latest_status is not null  -- filter out non-mercurial origins
  group by rollup(latest_status);  -- rollup adds a row with a null latest_status containing the sum of all rows

I've rescheduled the tasks for the repositories that had not been loaded. we'll need to follow up separately on failing tasks.