Page MenuHomeSoftware Heritage

ingest the OW2 GitLab instance
Closed, MigratedEdits Locked

Description

the instance is at https://gitlab.ow2.org/ , we should add it to our crawler rotation

Event Timeline

zack triaged this task as Normal priority.Jul 16 2019, 10:54 AM
zack created this task.
$ curl --head https://gitlab.ow2.org/api/v4/projects
HTTP/2 200
server: nginx
date: Tue, 27 Aug 2019 10:21:30 GMT
content-type: application/json
content-length: 19658
vary: Accept-Encoding
cache-control: no-cache
link: <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=2&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=1&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="first", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=51&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="last"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-next-page: 2
x-page: 1
x-per-page: 20
x-prev-page:
x-request-id: a8aqxWitUT6
x-runtime: 1.180268
x-total: 1003
x-total-pages: 51
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
ardumont changed the task status from Open to Work in Progress.Aug 29 2019, 11:01 AM

Do we have an admin contact there, to make sure that cloning all their repos at once will not kill their infra?

admin contact

Martin Hamant

(email sent but out of office for now)

A first round has been done:

softwareheritage-scheduler=> select status, count(*) from task where type='load-git' and policy='oneshot' and priority='high' and arguments#>>'{args,0}' like 'https://gitlab.ow2.org%' group by status;
  status   | count
-----------+-------
 completed |   960
 disabled  |    43
(2 rows)

Note:
I did not investigate the 43 disabled.

add it to our crawler rotation

done

$ SCHEDULER_API_URL=http://saatchi.internal.softwareheritage.org:5008/; 
$ swh scheduler --url $SCHEDULER_API_URL task add list-gitlab-full api_baseurl=https://gitlab.ow2.org/api/v4 instance=ow2
$ swh scheduler --url $SCHEDULER_API_URL task list --task-type list-gitlab-full
...
Task 203527512
  Next run: in 3 months (2019-12-01 09:09:56+00:00)
  Interval: 90 days, 0:00:00
  Type: list-gitlab-full
  Policy: recurring
  Status: next_run_not_scheduled
  Priority:
  Args:
  Keyword args:
    api_baseurl: 'https://gitlab.ow2.org/api/v4'
    instance: 'ow2'
...

I expect things to do mostly noop next time it runs (aside the first 43 disabled).

ardumont claimed this task.

The "standard" listing (output recurring tasks with no priority) ran:

softwareheritage-scheduler=> select status, count(*) from task where type='load-git' and policy='recurring' and priority is null and arguments#>>'{args,0}' like 'https://gitlab.ow2.org%' group by status;

         status         | count
------------------------+-------
 next_run_not_scheduled |  1003

So closing this.