Ingest http://cvsweb.netbsd.org/ forge
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T2845 Improve Subversion loader and develop CVS loader | ||
Migrated | gitlab-migration | T3691 Implement CVS loader | ||
Migrated | gitlab-migration | T4625 staging: ingest netbsd.org cvs forge | ||
Restricted Maniphest Task | ||||
Restricted Maniphest Task |
Event Timeline
The first try with a oneshot task failed, the containe was killed:
swhscheduler@scheduler0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add load-cvs -p oneshot ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src Created 1 tasks Task 33419700 Next run: today (2022-10-11T19:01:36.865607+00:00) Interval: 1 day, 0:00:00 Type: load-cvs Policy: oneshot Args: 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' Keyword args: swhscheduler@scheduler0:~$ swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ task add load-cvs -p oneshot url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src Created 1 tasks Task 33419701 Next run: today (2022-10-11T19:03:01.852004+00:00) Interval: 1 day, 0:00:00 Type: load-cvs Policy: oneshot Args: Keyword args: url: 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src'
swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:02:04,933: ERROR/ForkPoolWorker-1] Task swh.loader.cvs.tasks.LoadCvsRepository[a3bc7947-bbfa-41c6-a262-029c8f54e93c] raised unexpected: TypeError('load_cvs() takes 0 positional argu ments but 1 was given') swh/loader-cvs-7df4454db6-rq2jv[loaders]: Traceback (most recent call last): swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task swh/loader-cvs-7df4454db6-rq2jv[loaders]: R = retval = fun(*args, **kwargs) swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/integrations/celery.py", line 204, in _inner swh/loader-cvs-7df4454db6-rq2jv[loaders]: reraise(*exc_info) swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/_compat.py", line 54, in reraise swh/loader-cvs-7df4454db6-rq2jv[loaders]: raise value swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/sentry_sdk/integrations/celery.py", line 199, in _inner swh/loader-cvs-7df4454db6-rq2jv[loaders]: return f(*args, **kwargs) swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/swh/scheduler/task.py", line 61, in __call__ swh/loader-cvs-7df4454db6-rq2jv[loaders]: result = super().__call__(*args, **kwargs) swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__ swh/loader-cvs-7df4454db6-rq2jv[loaders]: return self.run(*args, **kwargs) swh/loader-cvs-7df4454db6-rq2jv[loaders]: TypeError: load_cvs() takes 0 positional arguments but 1 was given swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:09,781: INFO/MainProcess] Task swh.loader.cvs.tasks.LoadCvsRepository[50b01c22-89d9-483a-b251-a2e79ab7a20c] received swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:10,324: INFO/ForkPoolWorker-1] Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs' swh/loader-cvs-7df4454db6-rq2jv[loaders]: Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts. swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 19:03:33,630: INFO/MainProcess] Task swh.loader.cvs.tasks.LoadCvsRepository[9b06f36c-9440-4693-a8de-6800499db192] received swh/loader-cvs-7df4454db6-rq2jv[loaders]: Process 'ForkPoolWorker-1' pid:9 exited with 'signal 9 (SIGKILL)' swh/loader-cvs-7df4454db6-rq2jv[loaders]: [2022-10-11 23:22:39,306: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 1.') swh/loader-cvs-7df4454db6-rq2jv[loaders]: Traceback (most recent call last): swh/loader-cvs-7df4454db6-rq2jv[loaders]: File "/opt/swh/.local/lib/python3.10/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost swh/loader-cvs-7df4454db6-rq2jv[loaders]: raise WorkerLostError( swh/loader-cvs-7df4454db6-rq2jv[loaders]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 1.
A container to manually launch the loading was created to test the behavior: P1494
swh@loader-cvs-manual:~$ swh --log-level=DEBUG loader run cvs ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src DEBUG:swh.loader.cli:ctx: <click.core.Context object at 0x7ff194333b20> DEBUG:swh.core.config:Loading config file /etc/swh/config.yml DEBUG:swh.loader.cli:config_file: /etc/swh/config.yml DEBUG:swh.loader.cli:config: DEBUG:swh.loader.cli:kw: {} DEBUG:swh.loader.cli:registry: {'task_modules': ['swh.loader.cvs.tasks'], 'loader': <class 'swh.loader.cvs.loader.CvsLoader'>} DEBUG:swh.loader.cli:loader class: <class 'swh.loader.cvs.loader.CvsLoader'> DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): storage1.internal.staging.swh.network:5002 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit/get_latest HTTP/1.1" 200 1 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/add_multi HTTP/1.1" 200 13 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit/add HTTP/1.1" 200 127 INFO:swh.loader.cvs.loader.CvsLoader:Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs' DEBUG:swh.loader.cvs.loader.CvsLoader:lister_not provided, skipping extrinsic origin metadata DEBUG:swh.loader.cvs.loader.CvsLoader:prepare; origin_url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src scheme=ssh path=/cvsroot/src Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts. DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src ...
The loading is in progress
The loader got killed after it starts to consume a lot of memory...
Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts. DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src Killed swh@loader-cvs-manual:~$
I've launched the loader on the /src subdirectory, if I follow the cvs loader documentation: https://forge.softwareheritage.org/source/swh-loader-cvs/browse/master/docs/ ,
it should also be launched on the other subdirectories:
[DIR] htdocs/
[DIR] othersrc/
[DIR] pkgsrc/
[DIR] src/
[DIR] xsrc/
I'm not sure it doesn't deserve a lister
The loading finally failed:
INFO:swh.loader.cvs.loader.CvsLoader:Load origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' with type 'cvs' DEBUG:swh.loader.cvs.loader.CvsLoader:lister_not provided, skipping extrinsic origin metadata DEBUG:swh.loader.cvs.loader.CvsLoader:prepare; origin_url=ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src scheme=ssh path=/cvsroot/src Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts. DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src ERROR:swh.loader.cvs.loader.CvsLoader:Loading failure, updating to `failed` status Traceback (most recent call last): File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/core/loader.py", line 391, in load self.prepare() File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/loader.py", line 502, in prepare self.rlog.parse_rlog(main_rlog_file) File "/opt/swh/.local/lib/python3.10/site-packages/swh/loader/cvs/rlog.py", line 220, in parse_rlog raise ValueError("No filename found in rlog header") ValueError: No filename found in rlog header DEBUG:urllib3.connectionpool:Resetting dropped connection: storage1.internal.staging.swh.network DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): sentry.softwareheritage.org:443 DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/store/ HTTP/1.1" 200 41 DEBUG:urllib3.connectionpool:https://sentry.softwareheritage.org:443 "POST /api/21/envelope/ HTTP/1.1" 200 2 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /origin/visit_status/add HTTP/1.1" 200 26 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /flush HTTP/1.1" 200 1 DEBUG:urllib3.connectionpool:http://storage1.internal.staging.swh.network:5002 "POST /clear/buffer HTTP/1.1" 200 1 DEBUG:swh.loader.cvs.loader.CvsLoader:cleanup {'status': 'failed'} for origin 'ssh://anoncvs@anoncvs.NetBSD.org:/cvsroot/src' real 350m49.023s user 134m4.740s sys 4m38.862s
Regarding that memory issue, D8682 should help avoid it.
I encountered the same kind of issues with large repos when I was working on the subversion loader
and after applying the same kind of patch the memory consumption was much more reasonable
and overall loader performance was much better.
@vsellier, I landed all optimizations for the CVS loader and tagged a new version v0.5.0 so you can retry the NetBSD repository loading on staging.
Also you should use the following origin URL: rsync://anoncvs.netbsd.org/cvsroot/src, it will make the loading faster as the whole repository
is dumped to disk first and no network requests will be issued afterwards.