diff --git a/README.md b/README.md deleted file mode 100644 index 9f17f18..0000000 --- a/README.md +++ /dev/null @@ -1,181 +0,0 @@ -Software Heritage - CVS loader -============================== - -The Software Heritage CVS Loader imports the history of CVS repositories -into the SWH dataset. - -The main entry points are - -- :class:`swh.loader.cvs.loader.CvsLoader` for the main cvs loader which ingests content out of - a local cvs repository - - -Features --------- -The CVS loader can access CVS repositories via rsync or via the CVS -pserver protocol, with optional support for tunnelling pserver via SSH. - -The CVS loader does _not_ require the cvs program to be installed. - -Access via rsync requires the rsync program to be installed. -The CVS loader will then invoke rsync to obtain a temporary local copy of the -entire CVS repository. It will then walk the local copy the CVS repository and -parse history of each RCS file with a built-in RCS parser. -This will usually be the fastest method for importing a given CVS repository. -However, most CVS servers do not offer repository access via rsync, and CVS -repositories which see active commits may see conversion problems because the -CVS repository format was not designed for lock-less read access. - -Access via the plaintext CVS pserver protocol requires no external dependencies -to be installed, and is compatible with regular CVS servers. This method will -use read-locks on the server side and should therefore be safe to use with -active CVS repositories. -The CVS loader will use a built-in minimal CVS client written in Python to fetch -the output of the cvs rlog command executed on the CVS server. This output will -be processed to obtain repository history information. All versions of all files -will then be fetched from the server and injected into the SWH archive. - -Access via pserver over SSH requires OpenSSH to be installed. Apart from -using SSH as a transport layer the conversion process is the same as in -the plaintext pserver case. The SSH client will be instructed to trust SSH host key -fingeprints upon first use. If a CVS server changes its SSH fingerprint then manual -intervention may be required in order for future visits to be successful. - -Regardless of access protocol, the CVS loader uses heuristics to convert the -per-file history stored in CVS into changesets. These changesets correspond to -snapshots in the SWH database model. A given CVS repository should always yield -a consistent series of changesets across multiple visits. - -The following URL protocol schemes are recognized by the loader: - -- rsync:// -- pserver:// -- ssh:// - -After the protocol scheme, the CVS server hostname must be specified, -with an optional user:password field delimited from the hostname -with the '@' character: - -``` -pserver://anonymous:password@cvs.example.com/ -``` - -After the hostname, the server-side CVS root path must be specified. -The path will usually contain a CVSROOT directory on the server, though -this directory may be hidden from clients: - -``` -pserver://anonymous:password@cvs.example.com/var/cvs/ -``` - -The final component of the URL identifies the name of the CVS module -which should be ingested into the SWH archive: - -``` -pserver://anonymous:password@cvs.example.com/var/cvs/project1 -``` - -As a concrete example, this URL points to the historical CVS repository -of the a2ps project. In this case, the cvsroot path is /sources/a2ps and -the CVS module of the project is called a2ps: - -``` -pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps -``` - -In order to obtain the history of this repository the CVS loader will -perform the CVS pserver protocol exchange which is also performed by: - -``` -cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps -``` - -Known Limitations ------------------ -CVS repositories which see active commits should be converted with care. -It is possible to end up with a partial conversion of the latest commit -if repository data is fetched via rsync while a commit is in progress. -The pserver protocol is the safer option in such cases. - -Only history of main CVS branch is converted. -CVS vendor branch imports and merges which modify the main branch are -modeled as two distinct commits to the main branch. -Other branches will not be represented in the conversion result at all. - -CVS labels are not converted into corresponding SWH tags/releases yet. - -The converter does not yet support incremental fetching of CVS history. -The entire history will be fetched and processed during every visit. -By design, CVS does not fully support a concept of changesets that span multiple -files and, as such, importing an evolving CVS history incrementally is a not a -trivial problem. Regardless, some improvements could be made relatively easily, -as noted below. - -CVS repositories copied with rsync could be cached locally, such that -rsync will only download RCS files which have changed since the last visit. -At present the local copy of the repository is fetched to a temporary directory -and is deleted once the conversion process is done. - -It might help to store persistent meta-data about blobs imported from CVS. -If such meta-data could be searched via a given CVS repository name, a path, -and an RCS revision number then redundant downloads of file versions over -the pserver protocol could be detected and skipped. - -The minimal CVS client does not yet support the optional gzip extension -offered by the CVS pserver protocol. If this was supported then files -downloaded from a CVS server could be compressed while in transit. - -The built-in minimal CVS client has not been tested against many versions of CVS. -It should work fine against CVS 1.11 and 1.12 servers. More work may be needed -to improve compatibility with older versions of CVS. - -Acknowledgements ----------------- -This software contains code derived from *cvs2gitdump* written by YASUOKA Masahiko, -and from the *rcsparse* library written by Simon Schubert. - -This software contains code derived from ViewVC: https://www.viewvc.org/ - -Licensing information ---------------------- -Parts of the software written by SWH developers are licensed under GPLv3. -See the file LICENSE - -cvs2gitdump by YASUOKA Masahiko is licensed under ISC. -See the top of the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py - -rcsparse by Simon Schubert is licensed under AGPLv3. -See the file swh/loader/cvs/rcsparse/COPYRIGHT - -ViewVC is licensed under the 2-clause BSD licence. -See the file swh/loader/cvs/rlog.py - -# Running Tests - -Because the rcsparse library is implemented in C and accessed via Python bindings, -the CVS loader must be compiled and installed before tests can be run and the -*build* directory must be passed as an argument to pytest: - -``` -$ ./setup.py build install -$ pytest ./build -``` - -# CLI run - -With the configuration: - -/tmp/loader_cvs.yml: -``` -storage: - cls: remote - args: - url: http://localhost:5002/ -``` - -Run: - -``` -swh loader --config-file /tmp/loader_cvs.yml \ - run cvs -``` diff --git a/README.rst b/README.rst new file mode 120000 index 0000000..cffceba --- /dev/null +++ b/README.rst @@ -0,0 +1 @@ +docs/README.rst \ No newline at end of file diff --git a/docs/README.rst b/docs/README.rst index 1e1c9f8..56771da 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -1,6 +1,194 @@ Software Heritage - CVS loader ============================== -The Software Heritage CVS Loader is a tool and a library to walk -`_ repositories and inject into the SWH -dataset all contained files that weren't known before. +The Software Heritage CVS Loader imports the history of CVS repositories +into the SWH dataset. + +The main entry points are + +- :class:``swh.loader.cvs.loader.CvsLoader`` for the main cvs loader + which ingests content out of a local cvs repository + +Features +-------- + +The CVS loader can access CVS repositories via rsync or via the CVS +pserver protocol, with optional support for tunnelling pserver via SSH. + +The CVS loader does *not* require the cvs program to be installed. + +Access via rsync requires the rsync program to be installed. The CVS +loader will then invoke rsync to obtain a temporary local copy of the +entire CVS repository. It will then walk the local copy the CVS +repository and parse history of each RCS file with a built-in RCS +parser. This will usually be the fastest method for importing a given +CVS repository. However, most CVS servers do not offer repository access +via rsync, and CVS repositories which see active commits may see +conversion problems because the CVS repository format was not designed +for lock-less read access. + +Access via the plaintext CVS pserver protocol requires no external +dependencies to be installed, and is compatible with regular CVS +servers. This method will use read-locks on the server side and should +therefore be safe to use with active CVS repositories. The CVS loader +will use a built-in minimal CVS client written in Python to fetch the +output of the cvs rlog command executed on the CVS server. This output +will be processed to obtain repository history information. All versions +of all files will then be fetched from the server and injected into the +SWH archive. + +Access via pserver over SSH requires OpenSSH to be installed. Apart from +using SSH as a transport layer the conversion process is the same as in +the plaintext pserver case. The SSH client will be instructed to trust +SSH host key fingeprints upon first use. If a CVS server changes its SSH +fingerprint then manual intervention may be required in order for future +visits to be successful. + +Regardless of access protocol, the CVS loader uses heuristics to convert +the per-file history stored in CVS into changesets. These changesets +correspond to snapshots in the SWH database model. A given CVS +repository should always yield a consistent series of changesets across +multiple visits. + +The following URL protocol schemes are recognized by the loader: + +- rsync:// +- pserver:// +- ssh:// + +After the protocol scheme, the CVS server hostname must be specified, +with an optional user:password field delimited from the hostname with +the ‘@’ character: + +:: + + pserver://anonymous:password@cvs.example.com/ + +After the hostname, the server-side CVS root path must be specified. The +path will usually contain a CVSROOT directory on the server, though this +directory may be hidden from clients: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/ + +The final component of the URL identifies the name of the CVS module +which should be ingested into the SWH archive: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/project1 + +As a concrete example, this URL points to the historical CVS repository +of the a2ps project. In this case, the cvsroot path is /sources/a2ps and +the CVS module of the project is called a2ps: + +:: + + pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps + +In order to obtain the history of this repository the CVS loader will +perform the CVS pserver protocol exchange which is also performed by: + +:: + + cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps + +Known Limitations +----------------- + +CVS repositories which see active commits should be converted with care. +It is possible to end up with a partial conversion of the latest commit +if repository data is fetched via rsync while a commit is in progress. +The pserver protocol is the safer option in such cases. + +Only history of main CVS branch is converted. CVS vendor branch imports +and merges which modify the main branch are modeled as two distinct +commits to the main branch. Other branches will not be represented in +the conversion result at all. + +CVS labels are not converted into corresponding SWH tags/releases yet. + +The converter does not yet support incremental fetching of CVS history. +The entire history will be fetched and processed during every visit. By +design, CVS does not fully support a concept of changesets that span +multiple files and, as such, importing an evolving CVS history +incrementally is a not a trivial problem. Regardless, some improvements +could be made relatively easily, as noted below. + +CVS repositories copied with rsync could be cached locally, such that +rsync will only download RCS files which have changed since the last +visit. At present the local copy of the repository is fetched to a +temporary directory and is deleted once the conversion process is done. + +It might help to store persistent meta-data about blobs imported from +CVS. If such meta-data could be searched via a given CVS repository +name, a path, and an RCS revision number then redundant downloads of +file versions over the pserver protocol could be detected and skipped. + +The minimal CVS client does not yet support the optional gzip extension +offered by the CVS pserver protocol. If this was supported then files +downloaded from a CVS server could be compressed while in transit. + +The built-in minimal CVS client has not been tested against many +versions of CVS. It should work fine against CVS 1.11 and 1.12 servers. +More work may be needed to improve compatibility with older versions of +CVS. + +Acknowledgements +---------------- + +This software contains code derived from *cvs2gitdump* written by +YASUOKA Masahiko, and from the *rcsparse* library written by Simon +Schubert. + +This software contains code derived from ViewVC: https://www.viewvc.org/ + +Licensing information +--------------------- + +Parts of the software written by SWH developers are licensed under +GPLv3. See the file LICENSE + +cvs2gitdump by YASUOKA Masahiko is licensed under ISC. See the top of +the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py + +rcsparse by Simon Schubert is licensed under AGPLv3. See the file +swh/loader/cvs/rcsparse/COPYRIGHT + +ViewVC is licensed under the 2-clause BSD licence. See the file +swh/loader/cvs/rlog.py + +Running Tests +============= + +Because the rcsparse library is implemented in C and accessed via Python +bindings, the CVS loader must be compiled and installed before tests can +be run and the *build* directory must be passed as an argument to +pytest: + +:: + + $ ./setup.py build install + $ pytest ./build + +CLI run +======= + +With the configuration: + +/tmp/loader_cvs.yml: + +:: + + storage: + cls: remote + args: + url: http://localhost:5002/ + +Run: + +:: + + swh loader --config-file /tmp/loader_cvs.yml \ + run cvs diff --git a/setup.py b/setup.py index c1dfb25..d68478b 100755 --- a/setup.py +++ b/setup.py @@ -1,81 +1,81 @@ #!/usr/bin/env python3 # Copyright (C) 2019-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from io import open from os import path from setuptools import Extension, find_packages, setup here = path.abspath(path.dirname(__file__)) # Get the long description from the README file -with open(path.join(here, "README.md"), encoding="utf-8") as f: +with open(path.join(here, "README.rst"), encoding="utf-8") as f: long_description = f.read() def parse_requirements(*names): requirements = [] for name in names: if name: reqf = "requirements-%s.txt" % name else: reqf = "requirements.txt" if not path.exists(reqf): return requirements with open(reqf) as f: for line in f.readlines(): line = line.strip() if not line or line.startswith("#"): continue requirements.append(line) return requirements setup( name="swh.loader.cvs", description="Software Heritage CVS Loader", long_description=long_description, long_description_content_type="text/x-rst", python_requires=">=3.7", author="Software Heritage developers", author_email="swh-devel@inria.fr", url="https://forge.softwareheritage.org/diffusion/swh-loader-cvs", packages=find_packages(), # packages's modules install_requires=parse_requirements(None, "swh"), tests_require=parse_requirements("test"), setup_requires=["setuptools-scm"], use_scm_version=True, extras_require={"testing": parse_requirements("test")}, include_package_data=True, entry_points=""" [swh.workers] loader.cvs=swh.loader.cvs:register """, classifiers=[ "Programming Language :: Python :: 3", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: OS Independent", "Development Status :: 3 - Alpha", ], project_urls={ "Bug Reports": "https://forge.softwareheritage.org/maniphest", "Funding": "https://www.softwareheritage.org/donate", "Source": "https://forge.softwareheritage.org/source/swh-loader-cvs", "Documentation": "https://docs.softwareheritage.org/devel/swh-loader-cvs", }, ext_modules=[ Extension( "swh.loader.cvs.rcsparse", sources=[ "swh/loader/cvs/rcsparse/py-rcsparse.c", "swh/loader/cvs/rcsparse/rcsparse.c", ], ) ], )