Page MenuHomeSoftware Heritage

Implement CVS loader
Open, NormalPublic

Revisions and Commits

rCDFJ Dockerfiles for Jenkins
Closed
rDDOC Development documentation
Closed
rDLDCVS CVS Loader
Abandoned
Closed
Closed
D6823
D6813
D6791
D6781
D6762
D6758
D6745
D6708
D6678
D6678
D6638
D6623
D6593
D6590
D6566
D6561
D6560
D6558

Event Timeline

ardumont triaged this task as Normal priority.Oct 26 2021, 11:51 AM
ardumont created this task.
ardumont edited projects, added Archive coverage, Loader-CVS; removed Unknown Object (Project).Oct 26 2021, 11:59 AM

Current CVS loader status update:

I am testing the CVS loader against various repositories on cvs.savannah.gnu.org in order to find remaining problems that need to be addressed.

This includes testing of import via rsync origins like: rsync://cvs.savannah.gnu.org/sources/<PROJECT>/<REPO> and pserver origins like pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/<PROJECT>/<REPO>

The CVS loader can already import a very simple CVS repository which contains a handful of files which have a single revision each:
http://cvs.savannah.nongnu.org/viewvc/runbaby/runbaby/

Testing with a more complicated repository has revealed several issues in the pserver access method (rsync works fine).
Fixes for these issues have been submitted for review:
https://forge.softwareheritage.org/D6558
https://forge.softwareheritage.org/D6559
https://forge.softwareheritage.org/D6560
https://forge.softwareheritage.org/D6561

Another known remaining problem with the pserver method is that it does not yet expand RCS keywords in checked out files.
The rsync access method already does this, which means we may end up with different content hashes depending on the access method.

Status update:

Patches to add RCS keyword expansion are under review.

I am still testing with the GNU dino CVS repository. There is (hopefully only) one more issue which needs to be addressed before this repository will be converted correctly over both pserver and rsync: The pserver access method currently ignores CVS commit IDs. This means it might merge CVS commits together which contain the same log message and were made within 5 minutes of each other, even if commit IDs could be used to tell the commits apart. The rsync method already does this. The GNU dino CVS repository contains some such commits near the end of its history and this is causing hash diversions between the pserver and rsync access method.

There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard $Id$ keyword. This prevents wrong expansion of $Id$ when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as $OpenBSD$ or $NetBSD$, instead of $Id$. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.

As of D6623 the CVS loader is able to convert GNU dino correctly over both rsync and pserver access.

In T3691#73518, @stsp wrote:

There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard $Id$ keyword. This prevents wrong expansion of $Id$ when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as $OpenBSD$ or $NetBSD$, instead of $Id$. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.

The above is the only currently known remaining issue.

CVS calls its related option "KeywordExpand". So I guess we could use a corresponding parameter in the origin URL, like this: rsync://cvs.example.com/cvs/myproject?KeywordExpand=MyProject

The above would then expand $MyProject$ keywords in files, as if they were $Id$ keywords.
Note again that this would only matter for rsync where keyword expansion is done locally. With pserver access, the CVS server already expands such keywords on our behalf.

Would anyone object to passing a project-specific keyword as part of the origin URL like this? Or would this break assumptions made elsewhere in the system? For a given origin that uses a custom keyword the conversion would produce different results depending on whether the custom keyword is expanded or not. An origin URL which causes the custom keyword to be expanded would represent a slightly different origin (and result in different commit hashes) compared to an origin URL which ignores the custom keyword.

Another problem with keyword expansion found during testing:

Keywords may contain the file path. During conversion over rsync we currently write out the absolute path of the local file we have on disk. In the pserver case the expanded keyword uses the server-side path instead.

For example:

-    $Header: /tmp/swh.loader.cvs.4lwzuu20-108/ccvs/windows-NT/Attic/ndir.c,v 1.1.1.1 1995/08/28 16:14:12 jimb Exp $
+    $Header: /sources/cvs/ccvs/windows-NT/Attic/ndir.c,v 1.1.1.1 1995/08/28 16:14:12 jimb Exp $

This should be fixed such that the rsync access method expands such keywords with the server-side path.

The above problem with the Header keyword can be worked around (at least for the GNU savannah site) with the patch in D6678.

D6684 addresses another keyword expansion issue found while testing conversion of CVS's own history.

I have started test conversions of the OpenBSD CVS repository.

Unfortunately, it will be impossible to match the existing conversion of this repository to Git which is published on Github,
even though this conversion is created using the cvs2gitdump script which our own CVS loader is based on.

The problem is again related to to keyword expansion.

The OpenBSD history contains Header keywords which expand to server-side paths. The published conversion uses a repository at the path /home/cvs/src, and this path ends up in various files via Header keywords. We can cope with this by using an rsync origin which exposes a copy of this CVS repository at rsync://example.com/home/cvs/src. However, CVS repositories published on official mirrors via rsync use a different path (just "/cvs/src"), so we would end up with different hashes when loading history from such an official mirror.

While the above could be worked around, there is another problem: a small difference in keyword expansion which is already present in the very first commit:

-        "$Header: /home/cvs/src/usr.sbin/eeprom/Attic/getdate.y,v 1.1.1.1 1995/10/18 08:47:33 deraadt Exp $";
+        "$Header: /home/cvs/src/usr.sbin/eeprom/getdate.y,v 1.1.1.1 1995/10/18 08:47:33 deraadt Exp $";

Our CVS loader expands this path like a CVS server would do, preserving the 'Attic' path component.
The upstream cvs2gitdump script however strips 'Attic' from the path, which is incompatible with the behavior implemented by CVS.

Given this choice, I would rather keep CVS compatibility than compatibility to cvs2gitdump behavior.

There is one bug in our Log keyword handling which is also exposed by converting the very first OpenBSD CVS commit.
I will send a fix for this soon, with a corresponding test added.

However, beyond this, further experiments with converting the OpenBSD CVS repository are unlikely to be very useful since we cannot trivially compare our results to a known-good reference conversion. From an operational point of view our CVS loader is already up to the task, albeit quite slow.

Unless I have overlooked something, all currently known issues have now been addressed.

Unless I have overlooked something, all currently known issues have now been addressed.

Awesome.


The next steps would be the new created subtasks:

  • T3789: adapt the sourceforge lister to actually what expects the loader in terms of cvs origins (more details in the task, it'd be neat if you could have a look?)

    -> fixing the lister would allow actually trying to load more origins (in the production way but for staging) which would eventually lift some more issues (or not, sentry [1] would tell us)
  • T3788: actually trigger some origins on the staging infra (I can probably attend to it next week)

    -> that's a requisite to actually deploy (pulls pypi upload, debian package and some puppet works)

Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?

[1] https://sentry.softwareheritage.org (i can invite you to the team there if you
create an account first, that way you will be able to see issues there)

ardumont changed the status of subtask T3788: staging: Deploy cvs loader v0.1 from Open to Work in Progress.Dec 17 2021, 3:49 PM