Ingest http://cvsweb.netbsd.org/ forge
Description
| Status | Assigned | Task | ||
|---|---|---|---|---|
| Migrated | gitlab-migration | T2845 Improve Subversion loader and develop CVS loader | ||
| Migrated | gitlab-migration | T3691 Implement CVS loader | ||
| Migrated | gitlab-migration | T4625 staging: ingest netbsd.org cvs forge | ||
| Restricted Maniphest Task | ||||
| Restricted Maniphest Task |
Event Timeline
The first try with a oneshot task failed, the containe was killed:
swhscheduler@scheduler0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add load-cvs -p oneshot ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src
Created 1 tasks
Task 33419700
Next run: today (2022-10-11T19:01:36.865607+00:00)
Interval: 1 day, 0:00:00
Type: load-cvs
Policy: oneshot
Args:
'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src'
Keyword args:
swhscheduler@scheduler0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add load-cvs -p oneshot url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src
Created 1 tasks
Task 33419701
Next run: today (2022-10-11T19:03:01.852004+00:00)
Interval: 1 day, 0:00:00
Type: load-cvs
Policy: oneshot
Args:
Keyword args:
url: 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src'swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:02:04,933: ERROR/ForkPoolWorker-1] Task swh.loader.cvs.tasks.LoadCvsRepository[a3bc7947-bbfa-41c6-a262-029c8f54e93c] raised unexpected: TypeError('load_cvs() takes 0 positional argu
ments but 1 was given')
swh/loader-cvs-7df4454db6-rq2jv[loaders]: Traceback (most recent call last):
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
swh/loader-cvs-7df4454db6-rq2jv[loaders]: R = retval = fun(*args, **kwargs)
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/integrations/celery.py", line 204, in _inner
swh/loader-cvs-7df4454db6-rq2jv[loaders]: reraise(*exc_info)
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/_compat.py", line 54, in reraise
swh/loader-cvs-7df4454db6-rq2jv[loaders]: raise value
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/integrations/celery.py", line 199, in _inner
swh/loader-cvs-7df4454db6-rq2jv[loaders]: return f(*args, **kwargs)
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/swh/scheduler/task.py", line 61, in __call__
swh/loader-cvs-7df4454db6-rq2jv[loaders]: result = super().__call__(*args, **kwargs)
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
swh/loader-cvs-7df4454db6-rq2jv[loaders]: return self.run(*args, **kwargs)
swh/loader-cvs-7df4454db6-rq2jv[loaders]: TypeError: load_cvs() takes 0 positional arguments but 1 was given
swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:09,781: INFO/MainProcess] Task swh.loader.cvs.tasks.LoadCvsRepository[50b01c22-89d9-483a-b251-a2e79ab7a20c] received
swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:10,324: INFO/ForkPoolWorker-1] Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs'
swh/loader-cvs-7df4454db6-rq2jv[loaders]: Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.
swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:33,630: INFO/MainProcess] Task swh.loader.cvs.tasks.LoadCvsRepository[9b06f36c-9440-4693-a8de-6800499db192] received
swh/loader-cvs-7df4454db6-rq2jv[loaders]: Process 'ForkPoolWorker-1' pid:9 exited with 'signal 9 (SIGKILL)'
swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 23:22:39,306: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 1.')
swh/loader-cvs-7df4454db6-rq2jv[loaders]: Traceback (most recent call last):
swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
swh/loader-cvs-7df4454db6-rq2jv[loaders]: raise WorkerLostError(
swh/loader-cvs-7df4454db6-rq2jv[loaders]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 1.A container to manually launch the loading was created to test the behavior: P1494
swh@loader-cvs-manual:~$ swh --log-level=DEBUG loader run cvs ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src
DEBUG:swh.loader.cli:ctx: <click.core.Context object at 0x7ff194333b20>
DEBUG:swh.core.config:Loading config file /etc/swh/config.yml
DEBUG:swh.loader.cli:config_file: /etc/swh/config.yml
DEBUG:swh.loader.cli:config:
DEBUG:swh.loader.cli:kw: {}
DEBUG:swh.loader.cli:registry: {'task_modules': ['swh.loader.cvs.tasks'], 'loader': <class 'swh.loader.cvs.loader.CvsLoader'>}
DEBUG:swh.loader.cli:loader class: <class 'swh.loader.cvs.loader.CvsLoader'>
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): storage1.internal.staging.swh.network:5002
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit/get_latest HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/add_multi HTTP/1.1" 200 13
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit/add HTTP/1.1" 200 127
INFO:swh.loader.cvs.loader.CvsLoader:Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs'
DEBUG:swh.loader.cvs.loader.CvsLoader:lister_not provided, skipping extrinsic origin metadata
DEBUG:swh.loader.cvs.loader.CvsLoader:prepare; origin_url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src scheme=ssh path=/cvsroot/src
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.
DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src
...The loading is in progress
The loader got killed after it starts to consume a lot of memory...
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts. DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src Killed swh@loader-cvs-manual:~$
I've launched the loader on the /src subdirectory, if I follow the cvs loader documentation: https://forge.softwareheritage.org/source/swh-loader-cvs/browse/master/docs/ ,
it should also be launched on the other subdirectories:
[DIR] htdocs/
[DIR] othersrc/
[DIR] pkgsrc/
[DIR] src/
[DIR] xsrc/
I'm not sure it doesn't deserve a lister
The loading finally failed:
INFO:swh.loader.cvs.loader.CvsLoader:Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs'
DEBUG:swh.loader.cvs.loader.CvsLoader:lister_not provided, skipping extrinsic origin metadata
DEBUG:swh.loader.cvs.loader.CvsLoader:prepare; origin_url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src scheme=ssh path=/cvsroot/src
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.
DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src
ERROR:swh.loader.cvs.loader.CvsLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/core/loader.py", line 391, in load
self.prepare()
File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 502, in prepare
self.rlog.parse_rlog(main_rlog_file)
File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/rlog.py", line 220, in parse_rlog
raise ValueError("No filename found in rlog header")
ValueError: No filename found in rlog header
DEBUG:urllib3.connectionpool:Resetting dropped connection: storage1.internal.staging.swh.network
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): sentry.softwareheritage.org:443
DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/store/ HTTP/1.1" 200 41
DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/envelope/ HTTP/1.1" 200 2
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit_status/add HTTP/1.1" 200 26
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /flush HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /clear/buffer HTTP/1.1" 200 1
DEBUG:swh.loader.cvs.loader.CvsLoader:cleanup
{'status': 'failed'} for origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src'
real 350m49.023s
user 134m4.740s
sys 4m38.862sRegarding that memory issue, D8682 should help avoid it.
I encountered the same kind of issues with large repos when I was working on the subversion loader
and after applying the same kind of patch the memory consumption was much more reasonable
and overall loader performance was much better.
@vsellier, I landed all optimizations for the CVS loader and tagged a new version v0.5.0 so you can retry the NetBSD repository loading on staging.
Also you should use the following origin URL: rsync://anoncvs.netbsd.org/cvsroot/src, it will make the loading faster as the whole repository
is dumped to disk first and no network requests will be issued afterwards.
