Implement CVS loader
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Oct 26 2021, 11:51 AM

Description

Implement loader
Deploy to staging
Call for public review [1]
Deploy to production

[1] https://sympa.inria.fr/sympa/arc/swh-devel/2022-04/msg00033.html

Revisions and Commits

rCDFJ Dockerfiles for Jenkins
	Closed		D6278 Update container to install the cvs dependency
rDDOC Development documentation
	Closed		D5966 Update documentation entry point with the new loader cvs module
rDLDCVS CVS Loader
	Abandoned		D6684 fix regular expression used for matching RCS keywords
	Closed		D6559 cvsclient: handle additional responses sent by server
	Closed		D5988 initial CVS loader stub
		D6823	rDLDCVS238c9c0335af validate input paths in the CVS loader
		D6813	rDLDCVSa66c6b4937d4 fix Log keyword expansion with trailing whitespace in prefix
		D6791	rDLDCVSdcb895ca2ff1 support custom keywords during rsync:// conversion
		D6781	rDLDCVS965629d6c8b3 fix the top-level directory path of imported CVS modules
		D6762	rDLDCVS9e8f931ef786 update test suite documentation
		D6758	rDLDCVS5298a8f9500e make CVS loader create one snapshot per visit
		D6745	rDLDCVS099959bbfa73 fix expansion of the Log keyword with rsync origins
		D6708	rDLDCVS939dd546b050 fix expansion of multiple RCS keywords on a line via rsync
		D6678	rDLDCVSbc00d6b16979 add a test for conversion of a file which contains a Header keyword
		D6678	rDLDCVS5539ccb67b2a attempt to avoid content differences due to paths in keywords
		D6638	rDLDCVS34f46486f4a4 preserve empty lines in CVS log messages over pserver
		D6623	rDLDCVSf5b974a00951 add CVS commit ID support to rlog.py
		D6593	rDLDCVSd28a4b21c56a handle Attic-only RCS files over CVS pserver
		D6590	rDLDCVSf52f0e452132 add support for RCS keyword expansion over pserver protocol
		D6566	rDLDCVSbeb7fc8a023a test checkout of file lacking trailing \n over pserver protocol
		D6561	rDLDCVS509ac801df74 rlog: fix loading of CVS commits which have a commit ID
		D6560	rDLDCVS0829dc3309d7 rlog: fix parsing of multiple file revisions
		D6558	rDLDCVSd3b3344bc26d cvsclient: handle files which lack a trailing newline

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2845 Improve Subversion loader and develop CVS loader
Migrated	gitlab-migration	T3691 Implement CVS loader
Migrated	gitlab-migration	T3788 staging: Deploy cvs loader v0.1
Migrated	gitlab-migration	T3798 Debian package for swh.loader.cvs
Migrated	gitlab-migration	T3835 staging: Ingest sourceforge cvs origins
Migrated	gitlab-migration	T3789 Adapt sourceforge lister to list cvs origins according to what the cvs loader expects
Migrated	gitlab-migration	T3947 Deploy swh.lister v2.7
Migrated	gitlab-migration	T4625 staging: ingest netbsd.org cvs forge
		Restricted Maniphest Task
		Restricted Maniphest Task
Migrated	gitlab-migration	T4626 staging: ingest openbsd.org cvs forge

Event Timeline

ardumont triaged this task as Normal priority.Oct 26 2021, 11:51 AM

ardumont created this task.

ardumont assigned this task to stsp.Oct 26 2021, 11:55 AM

ardumont added a revision: D5988: initial CVS loader stub.

ardumont added revisions: D6278: Update container to install the cvs dependency, D5966: Update documentation entry point with the new loader cvs module.

ardumont edited projects, added Archive coverage, CVS loader; removed Unknown Object (Project).Oct 26 2021, 11:59 AM

Current CVS loader status update:

I am testing the CVS loader against various repositories on cvs.savannah.gnu.org in order to find remaining problems that need to be addressed.

This includes testing of import via rsync origins like: rsync://cvs.savannah.gnu.org/sources/<PROJECT>/<REPO> and pserver origins like pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/<PROJECT>/<REPO>

The CVS loader can already import a very simple CVS repository which contains a handful of files which have a single revision each:
http://cvs.savannah.nongnu.org/viewvc/runbaby/runbaby/

Testing with a more complicated repository has revealed several issues in the pserver access method (rsync works fine).
Fixes for these issues have been submitted for review:
https://forge.softwareheritage.org/D6558
https://forge.softwareheritage.org/D6559
https://forge.softwareheritage.org/D6560
https://forge.softwareheritage.org/D6561

Another known remaining problem with the pserver method is that it does not yet expand RCS keywords in checked out files.
The rsync access method already does this, which means we may end up with different content hashes depending on the access method.

ardumont added a revision: D6558: cvsclient: handle files which lack a trailing newline.Oct 27 2021, 12:48 PM

stsp added a commit: rDLDCVSd3b3344bc26d: cvsclient: handle files which lack a trailing newline.Oct 27 2021, 3:45 PM

stsp added a revision: D6560: rlog: fix parsing of multiple file revisions.Oct 27 2021, 3:55 PM

stsp added a revision: D6561: rlog: fix loading of CVS commits which have a commit ID.

stsp added a revision: D6559: cvsclient: handle additional responses sent by server.

stsp added a commit: rDLDCVS0829dc3309d7: rlog: fix parsing of multiple file revisions.Oct 27 2021, 3:59 PM

stsp added a commit: rDLDCVS509ac801df74: rlog: fix loading of CVS commits which have a commit ID.

stsp added a revision: D6566: test checkout of file lacking trailing \n over pserver protocol.Oct 27 2021, 4:20 PM

stsp added a commit: rDLDCVSbeb7fc8a023a: test checkout of file lacking trailing \n over pserver protocol.Oct 28 2021, 7:02 PM

stsp added a revision: D6590: add support for RCS keyword expansion over pserver protocol.Oct 30 2021, 11:49 AM

stsp added a revision: D6593: handle Attic-only RCS files over CVS pserver.Oct 31 2021, 11:32 PM

Status update:

Patches to add RCS keyword expansion are under review.

I am still testing with the GNU dino CVS repository. There is (hopefully only) one more issue which needs to be addressed before this repository will be converted correctly over both pserver and rsync: The pserver access method currently ignores CVS commit IDs. This means it might merge CVS commits together which contain the same log message and were made within 5 minutes of each other, even if commit IDs could be used to tell the commits apart. The rsync method already does this. The GNU dino CVS repository contains some such commits near the end of its history and this is causing hash diversions between the pserver and rsync access method.

There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard $Id$ keyword. This prevents wrong expansion of $Id$ when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as $OpenBSD$ or $NetBSD$, instead of $Id$. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.

stsp added a revision: D6623: add CVS commit ID support to rlog.py.Nov 9 2021, 1:27 PM

As of D6623 the CVS loader is able to convert GNU dino correctly over both rsync and pserver access.

stsp added a commit: rDLDCVSf52f0e452132: add support for RCS keyword expansion over pserver protocol.Nov 9 2021, 3:43 PM

stsp added a commit: rDLDCVSd28a4b21c56a: handle Attic-only RCS files over CVS pserver.Nov 9 2021, 3:54 PM

stsp added a commit: rDLDCVSf5b974a00951: add CVS commit ID support to rlog.py.Nov 10 2021, 12:09 PM

In T3691#73518, @stsp wrote:

There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard $Id$ keyword. This prevents wrong expansion of $Id$ when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as $OpenBSD$ or $NetBSD$, instead of $Id$. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.

The above is the only currently known remaining issue.

CVS calls its related option "KeywordExpand". So I guess we could use a corresponding parameter in the origin URL, like this: rsync://cvs.example.com/cvs/myproject?KeywordExpand=MyProject

The above would then expand $MyProject$ keywords in files, as if they were $Id$ keywords.
Note again that this would only matter for rsync where keyword expansion is done locally. With pserver access, the CVS server already expands such keywords on our behalf.

Would anyone object to passing a project-specific keyword as part of the origin URL like this? Or would this break assumptions made elsewhere in the system? For a given origin that uses a custom keyword the conversion would produce different results depending on whether the custom keyword is expanded or not. An origin URL which causes the custom keyword to be expanded would represent a slightly different origin (and result in different commit hashes) compared to an origin URL which ignores the custom keyword.

stsp added a revision: D6638: preserve empty lines in CVS log messages over pserver.Nov 12 2021, 11:38 AM

Another problem with keyword expansion found during testing:

Keywords may contain the file path. During conversion over rsync we currently write out the absolute path of the local file we have on disk. In the pserver case the expanded keyword uses the server-side path instead.

For example:

-    $Header: /tmp/swh.loader.cvs.4lwzuu20-108/ccvs/windows-NT/Attic/ndir.c,v 1.1.1.1 1995/08/28 16:14:12 jimb Exp $
+    $Header: /sources/cvs/ccvs/windows-NT/Attic/ndir.c,v 1.1.1.1 1995/08/28 16:14:12 jimb Exp $

This should be fixed such that the rsync access method expands such keywords with the server-side path.

stsp added a commit: rDLDCVS34f46486f4a4: preserve empty lines in CVS log messages over pserver.Nov 22 2021, 7:51 PM

The above problem with the Header keyword can be worked around (at least for the GNU savannah site) with the patch in D6678.

D6684 addresses another keyword expansion issue found while testing conversion of CVS's own history.

stsp added a revision: D6678: attempt to avoid content differences due to paths in keywords.Nov 24 2021, 12:30 PM

stsp added a revision: D6684: fix regular expression used for matching RCS keywords.

stsp added a commit: rDLDCVS5539ccb67b2a: attempt to avoid content differences due to paths in keywords.Nov 29 2021, 2:36 PM

stsp added a commit: rDLDCVSbc00d6b16979: add a test for conversion of a file which contains a Header keyword.

stsp added a revision: D6708: fix expansion of multiple RCS keywords on a line via rsync.Nov 29 2021, 7:11 PM

stsp added a commit: rDLDCVS939dd546b050: fix expansion of multiple RCS keywords on a line via rsync.Dec 4 2021, 5:29 PM

stsp added a revision: D6745: fix expansion of the Log keyword with rsync origins.Dec 4 2021, 5:39 PM

stsp added a commit: rDLDCVS099959bbfa73: fix expansion of the Log keyword with rsync origins.Dec 7 2021, 9:53 AM

stsp added a revision: D6758: make CVS loader create one snapshot per visit.Dec 7 2021, 10:59 AM

stsp added a commit: rDLDCVS5298a8f9500e: make CVS loader create one snapshot per visit.Dec 7 2021, 11:26 AM

stsp added a revision: D6762: update test suite documentation.Dec 7 2021, 11:43 AM

stsp added a commit: rDLDCVS9e8f931ef786: update test suite documentation.Dec 7 2021, 11:58 AM

stsp added a revision: D6781: fix the top-level directory path of imported CVS modules.Dec 8 2021, 9:51 AM

stsp added a commit: rDLDCVS965629d6c8b3: fix the top-level directory path of imported CVS modules.Dec 8 2021, 12:14 PM

stsp added a revision: D6791: support custom keywords during rsync:// conversion.Dec 8 2021, 3:53 PM

I have started test conversions of the OpenBSD CVS repository.

Unfortunately, it will be impossible to match the existing conversion of this repository to Git which is published on Github,
even though this conversion is created using the cvs2gitdump script which our own CVS loader is based on.

The problem is again related to to keyword expansion.

The OpenBSD history contains Header keywords which expand to server-side paths. The published conversion uses a repository at the path /home/cvs/src, and this path ends up in various files via Header keywords. We can cope with this by using an rsync origin which exposes a copy of this CVS repository at rsync://example.com/home/cvs/src. However, CVS repositories published on official mirrors via rsync use a different path (just "/cvs/src"), so we would end up with different hashes when loading history from such an official mirror.

While the above could be worked around, there is another problem: a small difference in keyword expansion which is already present in the very first commit:

-        "$Header: /home/cvs/src/usr.sbin/eeprom/Attic/getdate.y,v 1.1.1.1 1995/10/18 08:47:33 deraadt Exp $";
+        "$Header: /home/cvs/src/usr.sbin/eeprom/getdate.y,v 1.1.1.1 1995/10/18 08:47:33 deraadt Exp $";

Our CVS loader expands this path like a CVS server would do, preserving the 'Attic' path component.
The upstream cvs2gitdump script however strips 'Attic' from the path, which is incompatible with the behavior implemented by CVS.

Given this choice, I would rather keep CVS compatibility than compatibility to cvs2gitdump behavior.

There is one bug in our Log keyword handling which is also exposed by converting the very first OpenBSD CVS commit.
I will send a fix for this soon, with a corresponding test added.

However, beyond this, further experiments with converting the OpenBSD CVS repository are unlikely to be very useful since we cannot trivially compare our results to a known-good reference conversion. From an operational point of view our CVS loader is already up to the task, albeit quite slow.

stsp added a commit: rDLDCVSdcb895ca2ff1: support custom keywords during rsync:// conversion.Dec 9 2021, 3:33 PM

stsp added a revision: D6813: fix Log keyword expansion with trailing whitespace in prefix.Dec 9 2021, 3:48 PM

stsp added a commit: rDLDCVSa66c6b4937d4: fix Log keyword expansion with trailing whitespace in prefix.Dec 10 2021, 11:34 AM

Unless I have overlooked something, all currently known issues have now been addressed.

Unless I have overlooked something, all currently known issues have now been addressed.

Awesome.

The next steps would be the new created subtasks:

T3789: adapt the sourceforge lister to actually what expects the loader in terms of cvs origins (more details in the task, it'd be neat if you could have a look?)

-> fixing the lister would allow actually trying to load more origins (in the production way but for staging) which would eventually lift some more issues (or not, sentry [1] would tell us)

T3788: actually trigger some origins on the staging infra (I can probably attend to it next week)

-> that's a requisite to actually deploy (pulls pypi upload, debian package and some puppet works)

Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?

[1] https://sentry.softwareheritage.org (i can invite you to the team there if you
create an account first, that way you will be able to see issues there)

I found one additional problem. See D6823.

ardumont changed the status of subtask T3788: staging: Deploy cvs loader v0.1 from Open to Work in Progress.Dec 17 2021, 3:49 PM

stsp added a commit: rDLDCVS238c9c0335af: validate input paths in the CVS loader.Jan 6 2022, 12:38 PM

ardumont closed subtask T3788: staging: Deploy cvs loader v0.1 as Resolved.Jan 7 2022, 3:50 PM

Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?

@anlambert did it in D7176