Revisions and Commits
|Open||None||T2845 Improve Subversion loader and develop CVS loader|
|Open||stsp||T3691 Implement CVS loader|
|Resolved||ardumont||T3788 staging: Deploy cvs loader v0.1|
|Resolved||ardumont||T3798 Debian package for swh.loader.cvs|
|Open||None||T3835 staging: Ingest sourceforge cvs origins|
|Open||None||T3789 Adapt sourceforge lister to list cvs origins according to what the cvs loader expects|
- Mentioned Here
- D6823: validate input paths in the CVS loader
T3788: staging: Deploy cvs loader v0.1
T3789: Adapt sourceforge lister to list cvs origins according to what the cvs loader expects
D6684: fix regular expression used for matching RCS keywords
D6678: attempt to avoid content differences due to paths in keywords
D6623: add CVS commit ID support to rlog.py
D6560: rlog: fix parsing of multiple file revisions
D6561: rlog: fix loading of CVS commits which have a commit ID
D6558: cvsclient: handle files which lack a trailing newline
D6559: cvsclient: handle additional responses sent by server
Current CVS loader status update:
I am testing the CVS loader against various repositories on cvs.savannah.gnu.org in order to find remaining problems that need to be addressed.
This includes testing of import via rsync origins like: rsync://cvs.savannah.gnu.org/sources/<PROJECT>/<REPO> and pserver origins like pserver://anonymous:email@example.com/sources/<PROJECT>/<REPO>
The CVS loader can already import a very simple CVS repository which contains a handful of files which have a single revision each:
Testing with a more complicated repository has revealed several issues in the pserver access method (rsync works fine).
Fixes for these issues have been submitted for review:
Another known remaining problem with the pserver method is that it does not yet expand RCS keywords in checked out files.
The rsync access method already does this, which means we may end up with different content hashes depending on the access method.
Patches to add RCS keyword expansion are under review.
I am still testing with the GNU dino CVS repository. There is (hopefully only) one more issue which needs to be addressed before this repository will be converted correctly over both pserver and rsync: The pserver access method currently ignores CVS commit IDs. This means it might merge CVS commits together which contain the same log message and were made within 5 minutes of each other, even if commit IDs could be used to tell the commits apart. The rsync method already does this. The GNU dino CVS repository contains some such commits near the end of its history and this is causing hash diversions between the pserver and rsync access method.
There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard $Id$ keyword. This prevents wrong expansion of $Id$ when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as $OpenBSD$ or $NetBSD$, instead of $Id$. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.
The above is the only currently known remaining issue.
CVS calls its related option "KeywordExpand". So I guess we could use a corresponding parameter in the origin URL, like this: rsync://cvs.example.com/cvs/myproject?KeywordExpand=MyProject
The above would then expand $MyProject$ keywords in files, as if they were $Id$ keywords.
Note again that this would only matter for rsync where keyword expansion is done locally. With pserver access, the CVS server already expands such keywords on our behalf.
Would anyone object to passing a project-specific keyword as part of the origin URL like this? Or would this break assumptions made elsewhere in the system? For a given origin that uses a custom keyword the conversion would produce different results depending on whether the custom keyword is expanded or not. An origin URL which causes the custom keyword to be expanded would represent a slightly different origin (and result in different commit hashes) compared to an origin URL which ignores the custom keyword.
Another problem with keyword expansion found during testing:
Keywords may contain the file path. During conversion over rsync we currently write out the absolute path of the local file we have on disk. In the pserver case the expanded keyword uses the server-side path instead.
- $Header: /tmp/swh.loader.cvs.4lwzuu20-108/ccvs/windows-NT/Attic/ndir.c,v 18.104.22.168 1995/08/28 16:14:12 jimb Exp $ + $Header: /sources/cvs/ccvs/windows-NT/Attic/ndir.c,v 22.214.171.124 1995/08/28 16:14:12 jimb Exp $
This should be fixed such that the rsync access method expands such keywords with the server-side path.
I have started test conversions of the OpenBSD CVS repository.
Unfortunately, it will be impossible to match the existing conversion of this repository to Git which is published on Github,
even though this conversion is created using the cvs2gitdump script which our own CVS loader is based on.
The problem is again related to to keyword expansion.
The OpenBSD history contains Header keywords which expand to server-side paths. The published conversion uses a repository at the path /home/cvs/src, and this path ends up in various files via Header keywords. We can cope with this by using an rsync origin which exposes a copy of this CVS repository at rsync://example.com/home/cvs/src. However, CVS repositories published on official mirrors via rsync use a different path (just "/cvs/src"), so we would end up with different hashes when loading history from such an official mirror.
While the above could be worked around, there is another problem: a small difference in keyword expansion which is already present in the very first commit:
- "$Header: /home/cvs/src/usr.sbin/eeprom/Attic/getdate.y,v 126.96.36.199 1995/10/18 08:47:33 deraadt Exp $"; + "$Header: /home/cvs/src/usr.sbin/eeprom/getdate.y,v 188.8.131.52 1995/10/18 08:47:33 deraadt Exp $";
Our CVS loader expands this path like a CVS server would do, preserving the 'Attic' path component.
The upstream cvs2gitdump script however strips 'Attic' from the path, which is incompatible with the behavior implemented by CVS.
Given this choice, I would rather keep CVS compatibility than compatibility to cvs2gitdump behavior.
There is one bug in our Log keyword handling which is also exposed by converting the very first OpenBSD CVS commit.
I will send a fix for this soon, with a corresponding test added.
However, beyond this, further experiments with converting the OpenBSD CVS repository are unlikely to be very useful since we cannot trivially compare our results to a known-good reference conversion. From an operational point of view our CVS loader is already up to the task, albeit quite slow.
Unless I have overlooked something, all currently known issues have now been addressed.
The next steps would be the new created subtasks:
- T3789: adapt the sourceforge lister to actually what expects the loader in terms of cvs origins (more details in the task, it'd be neat if you could have a look?)
-> fixing the lister would allow actually trying to load more origins (in the production way but for staging) which would eventually lift some more issues (or not, sentry  would tell us)
- T3788: actually trigger some origins on the staging infra (I can probably attend to it next week)
-> that's a requisite to actually deploy (pulls pypi upload, debian package and some puppet works)
Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
 https://sentry.softwareheritage.org (i can invite you to the team there if you
create an account first, that way you will be able to see issues there)