diff --git a/README.md b/README.md index e8a2091..3f25079 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,176 @@ -swh-loader-cvs -============== +Software Heritage - CVS loader +============================== -The Software Heritage CVS Loader is a tool and a library to walk a local CVS repository -and inject into the SWH dataset all contained files that weren't known before. +The Software Heritage CVS Loader imports the history of CVS repositories +into the SWH dataset. The main entry points are - :class:`swh.loader.cvs.loader.CvsLoader` for the main cvs loader which ingests content out of a local cvs repository + +Features +-------- +The CVS loader can access CVS repositories via rsync or via the CVS +pserver protocol, with optional support for tunnelling pserver via SSH. + +The CVS loader does _not_ require the cvs program to be installed. + +Access via rsync requires the rsync program to be installed. +The CVS loader will then invoke rsync to obtain a temporary local copy of the +entire CVS repository. It will then walk the local copy the CVS repository and +parse history of each RCS file with a built-in RCS parser. +This will usually be the fastest method for importing a given CVS repository. +However, most CVS servers do not offer repository access via rsync, and CVS +repositories which see active commits may see conversion problems because the +CVS repository format was not designed for lock-less read access. + +Access via the plaintext CVS pserver protocol requires no external dependencies +to be installed, and is compatible with regular CVS servers. This method will +use read-locks on the server side and should therefore be safe to use with +active CVS repositories. +The CVS loader will use a built-in minimal CVS client written in Python to fetch +the output of the cvs rlog command executed on the CVS server. This output will +be processed to obtain repository history information. All versions of all files +will then be fetched from the server and injected into the SWH archive. + +Access via pserver over SSH requires OpenSSH to be installed. Apart from +using SSH as a transport layer the conversion process is the same as in +the plaintext pserver case. The SSH client will be instructed to trust SSH host key +fingeprints upon first use. If a CVS server changes its SSH fingerprint then manual +intervention may be required in order for future visits to be successful. + +Regardless of access protocol, the CVS loader uses heuristics to convert the +per-file history stored in CVS into changesets. These changesets correspond to +snapshots in the SWH database model. A given CVS repository should always yield +a consistent series of changesets across multiple visits. + +The following URL protocol schemes are recognized by the loader: + +- rsync:// +- pserver:// +- ssh:// + +After the protocol scheme, the CVS server hostname must be specified, +with an optional user:password field delimited from the hostname +with the '@' character: + +``` +pserver://anonymous:password@cvs.example.com/ +``` + +After the hostname, the server-side CVS root path must be specified. +The path will usually contain a CVSROOT directory on the server, though +this directory may be hidden from clients: + +``` +pserver://anonymous:password@cvs.example.com/var/cvs/ +``` + +The final component of the URL identifies the name of the CVS module +which should be ingested into the SWH archive: + +``` +pserver://anonymous:password@cvs.example.com/var/cvs/project1 +``` + +As a concrete example, this URL points to the historical CVS repository +of the a2ps project. In this case, the cvsroot path is /sources/a2ps and +the CVS module of the project is called a2ps: + +``` +pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps +``` + +In order to obtain the history of this repository the CVS loader will +perform the CVS pserver protocol exchange which is also performed by: + +``` +cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps +``` + +Known Limitations +----------------- +CVS repositories which see active commits should be converted with care. +It is possible to end up with a partial conversion of the latest commit +if repository data is fetched via rsync while a commit is in progress. +The pserver protocol is the safer option in such cases. + +Only history of main CVS branch is converted. +CVS vendor branch imports and merges which modify the main branch are +modeled as two distinct commits to the main branch. +Other branches will not be represented in the conversion result at all. + +CVS labels are not converted into corresponding SWH tags/releases yet. + +The converter does not yet support incremental fetching of CVS history. +The entire history will be fetched and processed during every visit. +By design, CVS does not fully support a concept of changesets that span multiple +files and, as such, importing an evolving CVS history incrementally is a not a +trivial problem. Regardless, some improvements could be made relatively easily, +as noted below. + +CVS repositories copied with rsync could be cached locally, such that +rsync will only download RCS files which have changed since the last visit. +At present the local copy of the repository is fetched to a temporary directory +and is deleted once the conversion process is done. + +It might help to store persistent meta-data about blobs imported from CVS. +If such meta-data could be searched via a given CVS repository name, a path, +and an RCS revision number then redundant downloads of file versions over +the pserver protocol could be detected and skipped. + +The minimal CVS client does not yet support the optional gzip extension +offered by the CVS pserver protocol. If this was supported then files +downloaded from a CVS server could be compressed while in transit. + +The built-in minimal CVS client has not been tested against many versions of CVS. +It should work fine against CVS 1.11 and 1.12 servers. More work may be needed +to improve compatibility with older versions of CVS. + +Acknowledgements +---------------- +This software contains code derived from *cvs2gitdump* written by YASUOKA Masahiko +and from the *rcsparse* library written by Simon Schubert. + +Licensing information +--------------------- +Parts of the software written by SWH developers are licensed under GPLv3. +See the file LICENSE + +cvs2gitdump by YASUOKA Masahiko is licensed under ISC. +See the top of the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py + +rcsparse by Simon Schubert is licensed under AGPLv3. +See the file swh/loader/cvs/rcsparse/COPYRIGHT + +# Running Tests + +Because the rcsparse library is implemented in C and accessed via Python bindings, +the CVS loader must be compiled and installed before tests can be run and the +*build* directory must be passed as an argument to pytest: + +``` +$ ./setup.py build install +$ pytest ./build +``` + # CLI run With the configuration: /tmp/loader_cvs.yml: ``` storage: cls: remote args: url: http://localhost:5002/ ``` Run: ``` swh loader --config-file /tmp/loader_cvs.yml \ - run cvs + run cvs ``` diff --git a/README.rst b/README.rst deleted file mode 120000 index cffceba..0000000 --- a/README.rst +++ /dev/null @@ -1 +0,0 @@ -docs/README.rst \ No newline at end of file diff --git a/setup.py b/setup.py index 07c53d7..b4927cf 100755 --- a/setup.py +++ b/setup.py @@ -1,77 +1,77 @@ #!/usr/bin/env python3 # Copyright (C) 2019-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from io import open from os import path from setuptools import find_packages, setup, Extension here = path.abspath(path.dirname(__file__)) # Get the long description from the README file -with open(path.join(here, "README.rst"), encoding="utf-8") as f: +with open(path.join(here, "README.md"), encoding="utf-8") as f: long_description = f.read() def parse_requirements(*names): requirements = [] for name in names: if name: reqf = "requirements-%s.txt" % name else: reqf = "requirements.txt" if not path.exists(reqf): return requirements with open(reqf) as f: for line in f.readlines(): line = line.strip() if not line or line.startswith("#"): continue requirements.append(line) return requirements setup( name="swh.loader.cvs", description="Software Heritage CVS Loader", long_description=long_description, long_description_content_type="text/x-rst", python_requires=">=3.7", author="Software Heritage developers", author_email="swh-devel@inria.fr", url="https://forge.softwareheritage.org/diffusion/swh-loader-cvs", packages=find_packages(), # packages's modules install_requires=parse_requirements(None, "swh"), tests_require=parse_requirements("test"), setup_requires=["setuptools-scm"], use_scm_version=True, extras_require={"testing": parse_requirements("test")}, include_package_data=True, entry_points=""" [swh.workers] loader.cvs=swh.loader.cvs:register """, classifiers=[ "Programming Language :: Python :: 3", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: OS Independent", "Development Status :: 3 - Alpha", ], project_urls={ "Bug Reports": "https://forge.softwareheritage.org/maniphest", "Funding": "https://www.softwareheritage.org/donate", "Source": "https://forge.softwareheritage.org/source/swh-loader-cvs", "Documentation": "https://docs.softwareheritage.org/devel/swh-loader-cvs", }, ext_modules = [ Extension("swh.loader.cvs.rcsparse", sources=["swh/loader/cvs/rcsparse/py-rcsparse.c", "swh/loader/cvs/rcsparse/rcsparse.c"]) ] )