Page MenuHomeSoftware Heritage

ingest the OW2 GitLab instance
Closed, ResolvedPublic

Description

the instance is at https://gitlab.ow2.org/ , we should add it to our crawler rotation

Event Timeline

zack triaged this task as Normal priority.Jul 16 2019, 10:54 AM
zack created this task.
$ curl --head https://gitlab.ow2.org/api/v4/projects
HTTP/2 200
server: nginx
date: Tue, 27 Aug 2019 10:21:30 GMT
content-type: application/json
content-length: 19658
vary: Accept-Encoding
cache-control: no-cache
link: <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=2&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=1&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="first", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=51&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="last"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-next-page: 2
x-page: 1
x-per-page: 20
x-prev-page:
x-request-id: a8aqxWitUT6
x-runtime: 1.180268
x-total: 1003
x-total-pages: 51
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
ardumont changed the task status from Open to Work in Progress.Aug 29 2019, 11:01 AM
olasd added a subscriber: olasd.Aug 29 2019, 11:01 AM

Do we have an admin contact there, to make sure that cloning all their repos at once will not kill their infra?

ardumont added a comment.EditedAug 29 2019, 11:21 AM

admin contact

Martin Hamant

(email sent but out of office for now)

A first round has been done:

softwareheritage-scheduler=> select status, count(*) from task where type='load-git' and policy='oneshot' and priority='high' and arguments#>>'{args,0}' like 'https://gitlab.ow2.org%' group by status;
  status   | count
-----------+-------
 completed |   960
 disabled  |    43
(2 rows)

Note:
I did not investigate the 43 disabled.

add it to our crawler rotation

done

$ SCHEDULER_API_URL=http://saatchi.internal.softwareheritage.org:5008/; 
$ swh scheduler --url $SCHEDULER_API_URL task add list-gitlab-full api_baseurl=https://gitlab.ow2.org/api/v4 instance=ow2
$ swh scheduler --url $SCHEDULER_API_URL task list --task-type list-gitlab-full
...
Task 203527512
  Next run: in 3 months (2019-12-01 09:09:56+00:00)
  Interval: 90 days, 0:00:00
  Type: list-gitlab-full
  Policy: recurring
  Status: next_run_not_scheduled
  Priority:
  Args:
  Keyword args:
    api_baseurl: 'https://gitlab.ow2.org/api/v4'
    instance: 'ow2'
...

I expect things to do mostly noop next time it runs (aside the first 43 disabled).

ardumont closed this task as Resolved.Sep 3 2019, 5:06 PM
ardumont claimed this task.

The "standard" listing (output recurring tasks with no priority) ran:

softwareheritage-scheduler=> select status, count(*) from task where type='load-git' and policy='recurring' and priority is null and arguments#>>'{args,0}' like 'https://gitlab.ow2.org%' group by status;

         status         | count
------------------------+-------
 next_run_not_scheduled |  1003

So closing this.