diff --git a/PKG-INFO b/PKG-INFO index 558401b..405b34e 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,236 +1,240 @@ Metadata-Version: 2.1 Name: swh.loader.cvs -Version: 0.0.2 +Version: 0.1.0 Summary: Software Heritage CVS Loader Home-page: https://forge.softwareheritage.org/diffusion/swh-loader-cvs Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-cvs Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-loader-cvs -Description: Software Heritage - CVS loader - ============================== - - The Software Heritage CVS Loader imports the history of CVS repositories - into the SWH dataset. - - The main entry points are - - - :class:``swh.loader.cvs.loader.CvsLoader`` for the main cvs loader - which ingests content out of a local cvs repository - - Features - -------- - - The CVS loader can access CVS repositories via rsync or via the CVS - pserver protocol, with optional support for tunnelling pserver via SSH. - - The CVS loader does *not* require the cvs program to be installed. - However, the loader's test suite does require cvs to be installed. - - Access via rsync requires the rsync program to be installed. The CVS - loader will then invoke rsync to obtain a temporary local copy of the - entire CVS repository. It will then walk the local copy the CVS - repository and parse history of each RCS file with a built-in RCS - parser. This will usually be the fastest method for importing a given - CVS repository. However, most CVS servers do not offer repository access - via rsync, and CVS repositories which see active commits may see - conversion problems because the CVS repository format was not designed - for lock-less read access. - - Access via the plaintext CVS pserver protocol requires no external - dependencies to be installed, and is compatible with regular CVS - servers. This method will use read-locks on the server side and should - therefore be safe to use with active CVS repositories. The CVS loader - will use a built-in minimal CVS client written in Python to fetch the - output of the cvs rlog command executed on the CVS server. This output - will be processed to obtain repository history information. All versions - of all files will then be fetched from the server and injected into the - SWH archive. - - Access via pserver over SSH requires OpenSSH to be installed. Apart from - using SSH as a transport layer the conversion process is the same as in - the plaintext pserver case. The SSH client will be instructed to trust - SSH host key fingeprints upon first use. If a CVS server changes its SSH - fingerprint then manual intervention may be required in order for future - visits to be successful. - - Regardless of access protocol, the CVS loader uses heuristics to convert - the per-file history stored in CVS into changesets. These changesets - correspond to snapshots in the SWH database model. A given CVS - repository should always yield a consistent series of changesets across - multiple visits. - - The following URL protocol schemes are recognized by the loader: - - - rsync:// - - pserver:// - - ssh:// - - After the protocol scheme, the CVS server hostname must be specified, - with an optional user:password field delimited from the hostname with - the ‘@’ character: - - :: - - pserver://anonymous:password@cvs.example.com/ - - After the hostname, the server-side CVS root path must be specified. The - path will usually contain a CVSROOT directory on the server, though this - directory may be hidden from clients: - - :: - - pserver://anonymous:password@cvs.example.com/var/cvs/ - - The final component of the URL identifies the name of the CVS module - which should be ingested into the SWH archive: - - :: - - pserver://anonymous:password@cvs.example.com/var/cvs/project1 - - As a concrete example, this URL points to the historical CVS repository - of the a2ps project. In this case, the cvsroot path is /sources/a2ps and - the CVS module of the project is called a2ps: - - :: - - pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps - - In order to obtain the history of this repository the CVS loader will - perform the CVS pserver protocol exchange which is also performed by: - - :: - - cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps - - Known Limitations - ----------------- - - CVS repositories which see active commits should be converted with care. - It is possible to end up with a partial conversion of the latest commit - if repository data is fetched via rsync while a commit is in progress. - The pserver protocol is the safer option in such cases. - - Only history of main CVS branch is converted. CVS vendor branch imports - and merges which modify the main branch are modeled as two distinct - commits to the main branch. Other branches will not be represented in - the conversion result at all. - - CVS labels are not converted into corresponding SWH tags/releases yet. - - The converter does not yet support incremental fetching of CVS history. - The entire history will be fetched and processed during every visit. By - design, CVS does not fully support a concept of changesets that span - multiple files and, as such, importing an evolving CVS history - incrementally is a not a trivial problem. Regardless, some improvements - could be made relatively easily, as noted below. - - CVS repositories copied with rsync could be cached locally, such that - rsync will only download RCS files which have changed since the last - visit. At present the local copy of the repository is fetched to a - temporary directory and is deleted once the conversion process is done. - - It might help to store persistent meta-data about blobs imported from - CVS. If such meta-data could be searched via a given CVS repository - name, a path, and an RCS revision number then redundant downloads of - file versions over the pserver protocol could be detected and skipped. - - The minimal CVS client does not yet support the optional gzip extension - offered by the CVS pserver protocol. If this was supported then files - downloaded from a CVS server could be compressed while in transit. - - The built-in minimal CVS client has not been tested against many - versions of CVS. It should work fine against CVS 1.11 and 1.12 servers. - More work may be needed to improve compatibility with older versions of - CVS. - - Acknowledgements - ---------------- - - This software contains code derived from *cvs2gitdump* written by - YASUOKA Masahiko, and from the *rcsparse* library written by Simon - Schubert. - - This software contains code derived from ViewVC: https://www.viewvc.org/ - - Licensing information - --------------------- - - Parts of the software written by SWH developers are licensed under - GPLv3. See the file LICENSE - - cvs2gitdump by YASUOKA Masahiko is licensed under ISC. See the top of - the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py - - rcsparse by Simon Schubert is licensed under AGPLv3. See the file - swh/loader/cvs/rcsparse/COPYRIGHT - - ViewVC is licensed under the 2-clause BSD licence. See the file - swh/loader/cvs/rlog.py - - Running Tests - ============= - - The loader's test suite requires cvs to be installed. - - Because the rcsparse library is implemented in C and accessed via Python - bindings, the CVS loader must be compiled and installed before tests can - be run and the *build* directory must be passed as an argument to - pytest: - - :: - - $ ./setup.py build install - $ pytest ./build - - The test suite uses internal protocol schemes which cannot be reached - from "Save Code Now". These are: - - - fake:// - - file:// - - The fake:// scheme corresponds to pserver:// and ssh://. The test suite - will spawn a 'cvs server' process locally and the loader will connect - to this server via a pipe and communicate using the pserver protocol. - Real ssh:// access lacks test coverage at present and would require - sshd to become part of the test setup. - - The file:// scheme corresponds to rsync:// and behaves as if the rsync - program had already created a local copy of the repository. Real rsync:// - access lacks test coverage at present and would require an rsyncd server - to become part of the test setup. - - CLI run - ======= - - With the configuration: - - /tmp/loader_cvs.yml: - - :: - - storage: - cls: remote - args: - url: http://localhost:5002/ - - Run: - - :: - - swh loader --config-file /tmp/loader_cvs.yml \ - run cvs - Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU Affero General Public License v3 Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/x-rst Provides-Extra: testing +License-File: LICENSE +License-File: AUTHORS + +Software Heritage - CVS loader +============================== + +The Software Heritage CVS Loader imports the history of CVS repositories +into the SWH dataset. + +The main entry points are + +- :class:``swh.loader.cvs.loader.CvsLoader`` for the main cvs loader + which ingests content out of a local cvs repository + +Features +-------- + +The CVS loader can access CVS repositories via rsync or via the CVS +pserver protocol, with optional support for tunnelling pserver via SSH. + +The CVS loader does *not* require the cvs program to be installed. +However, the loader's test suite does require cvs to be installed. + +Access via rsync requires the rsync program to be installed. The CVS +loader will then invoke rsync to obtain a temporary local copy of the +entire CVS repository. It will then walk the local copy the CVS +repository and parse history of each RCS file with a built-in RCS +parser. This will usually be the fastest method for importing a given +CVS repository. However, most CVS servers do not offer repository access +via rsync, and CVS repositories which see active commits may see +conversion problems because the CVS repository format was not designed +for lock-less read access. + +Access via the plaintext CVS pserver protocol requires no external +dependencies to be installed, and is compatible with regular CVS +servers. This method will use read-locks on the server side and should +therefore be safe to use with active CVS repositories. The CVS loader +will use a built-in minimal CVS client written in Python to fetch the +output of the cvs rlog command executed on the CVS server. This output +will be processed to obtain repository history information. All versions +of all files will then be fetched from the server and injected into the +SWH archive. + +Access via pserver over SSH requires OpenSSH to be installed. Apart from +using SSH as a transport layer the conversion process is the same as in +the plaintext pserver case. The SSH client will be instructed to trust +SSH host key fingeprints upon first use. If a CVS server changes its SSH +fingerprint then manual intervention may be required in order for future +visits to be successful. + +Regardless of access protocol, the CVS loader uses heuristics to convert +the per-file history stored in CVS into changesets. These changesets +correspond to snapshots in the SWH database model. A given CVS +repository should always yield a consistent series of changesets across +multiple visits. + +The following URL protocol schemes are recognized by the loader: + +- rsync:// +- pserver:// +- ssh:// + +After the protocol scheme, the CVS server hostname must be specified, +with an optional user:password field delimited from the hostname with +the ‘@’ character: + +:: + + pserver://anonymous:password@cvs.example.com/ + +After the hostname, the server-side CVS root path must be specified. The +path will usually contain a CVSROOT directory on the server, though this +directory may be hidden from clients: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/ + +The final component of the URL identifies the name of the CVS module +which should be ingested into the SWH archive: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/project1 + +As a concrete example, this URL points to the historical CVS repository +of the a2ps project. In this case, the cvsroot path is /sources/a2ps and +the CVS module of the project is called a2ps: + +:: + + pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps + +In order to obtain the history of this repository the CVS loader will +perform the CVS pserver protocol exchange which is also performed by: + +:: + + cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps + +Known Limitations +----------------- + +CVS repositories which see active commits should be converted with care. +It is possible to end up with a partial conversion of the latest commit +if repository data is fetched via rsync while a commit is in progress. +The pserver protocol is the safer option in such cases. + +Only history of main CVS branch is converted. CVS vendor branch imports +and merges which modify the main branch are modeled as two distinct +commits to the main branch. Other branches will not be represented in +the conversion result at all. + +CVS labels are not converted into corresponding SWH tags/releases yet. + +The converter does not yet support incremental fetching of CVS history. +The entire history will be fetched and processed during every visit. By +design, CVS does not fully support a concept of changesets that span +multiple files and, as such, importing an evolving CVS history +incrementally is a not a trivial problem. Regardless, some improvements +could be made relatively easily, as noted below. + +CVS repositories copied with rsync could be cached locally, such that +rsync will only download RCS files which have changed since the last +visit. At present the local copy of the repository is fetched to a +temporary directory and is deleted once the conversion process is done. + +It might help to store persistent meta-data about blobs imported from +CVS. If such meta-data could be searched via a given CVS repository +name, a path, and an RCS revision number then redundant downloads of +file versions over the pserver protocol could be detected and skipped. + +The minimal CVS client does not yet support the optional gzip extension +offered by the CVS pserver protocol. If this was supported then files +downloaded from a CVS server could be compressed while in transit. + +The built-in minimal CVS client has not been tested against many +versions of CVS. It should work fine against CVS 1.11 and 1.12 servers. +More work may be needed to improve compatibility with older versions of +CVS. + +Acknowledgements +---------------- + +This software contains code derived from *cvs2gitdump* written by +YASUOKA Masahiko, and from the *rcsparse* library written by Simon +Schubert. + +This software contains code derived from ViewVC: https://www.viewvc.org/ + +Licensing information +--------------------- + +Parts of the software written by SWH developers are licensed under +GPLv3. See the file LICENSE + +cvs2gitdump by YASUOKA Masahiko is licensed under ISC. See the top of +the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py + +rcsparse by Simon Schubert is licensed under AGPLv3. See the file +swh/loader/cvs/rcsparse/COPYRIGHT + +ViewVC is licensed under the 2-clause BSD licence. See the file +swh/loader/cvs/rlog.py + +Running Tests +============= + +The loader's test suite requires cvs to be installed. + +Because the rcsparse library is implemented in C and accessed via Python +bindings, the CVS loader must be compiled and installed before tests can +be run and the *build* directory must be passed as an argument to +pytest: + +:: + + $ ./setup.py build install + $ pytest ./build + +The test suite uses internal protocol schemes which cannot be reached +from "Save Code Now". These are: + + - fake:// + - file:// + +The fake:// scheme corresponds to pserver:// and ssh://. The test suite +will spawn a 'cvs server' process locally and the loader will connect +to this server via a pipe and communicate using the pserver protocol. +Real ssh:// access lacks test coverage at present and would require +sshd to become part of the test setup. + +The file:// scheme corresponds to rsync:// and behaves as if the rsync +program had already created a local copy of the repository. Real rsync:// +access lacks test coverage at present and would require an rsyncd server +to become part of the test setup. + +CLI run +======= + +With the configuration: + +/tmp/loader_cvs.yml: + +:: + + storage: + cls: remote + args: + url: http://localhost:5002/ + +Run: + +:: + + swh loader --config-file /tmp/loader_cvs.yml \ + run cvs + + diff --git a/debian/changelog b/debian/changelog index 65df49a..63ea0c7 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,18 +1,29 @@ -swh-loader-cvs (0.0.2-2~swh1~bpo10+1) buster-swh; urgency=medium +swh-loader-cvs (0.1.0-1~swh2) unstable-swh; urgency=medium - * Rebuild for buster-swh + * Fix dependency and bump new release - -- Software Heritage autobuilder (on jenkins-debian1) Wed, 15 Dec 2021 14:45:54 +0000 + -- Antoine R. Dumont (@ardumont) Fri, 07 Jan 2022 15:30:29 +0100 + +swh-loader-cvs (0.1.0-1~swh1) unstable-swh; urgency=medium + + * New upstream release 0.1.0 - (tagged by Antoine R. Dumont + (@ardumont) on 2022-01-07 15:15:00 + +0100) + * Upstream changes: - v0.1.0 - Validate input paths in the CVS + loader - swh.loader.cvs.tasks: Fix parameter uses to the ones + needed + + -- Software Heritage autobuilder (on jenkins-debian1) Fri, 07 Jan 2022 14:17:40 +0000 swh-loader-cvs (0.0.2-2~swh1) unstable-swh; urgency=medium * Add missing dh_install override to avoid stomping on the namespace __init__.py -- Nicolas Dandrimont Wed, 15 Dec 2021 15:43:29 +0100 swh-loader-cvs (0.0.2-1~swh1) unstable-swh; urgency=medium * Initial release -- Nicolas Dandrimont Wed, 15 Dec 2021 10:50:25 +0100 diff --git a/debian/control b/debian/control index 9913620..5006ca9 100644 --- a/debian/control +++ b/debian/control @@ -1,29 +1,30 @@ Source: swh-loader-cvs Maintainer: Software Heritage developers Section: python Priority: optional Build-Depends: debhelper-compat (= 13), dh-python (>= 3), python3-all, python3-pytest, + python3-pytest-mock, python3-setuptools, python3-setuptools-scm, python3-swh.core (>= 0.3), python3-swh.loader.core (>= 0.18), python3-swh.model (>= 0.4.0), python3-swh.scheduler (>= 0.0.39), python3-swh.storage (>= 0.11.3), # The above dependencies are automatically generated. Add extra dependencies below: cvs, python3-swh.core.db.pytestplugin, python3-all-dev, Rules-Requires-Root: no Standards-Version: 4.6.0 Homepage: https://forge.softwareheritage.org/source/swh-loader-cvs Package: python3-swh.loader.cvs Architecture: any Depends: ${misc:Depends}, ${python3:Depends}, ${shlibs:Depends}, Description: Software Heritage CVS Loader diff --git a/requirements-test.txt b/requirements-test.txt index e079f8a..584dbb5 100644 --- a/requirements-test.txt +++ b/requirements-test.txt @@ -1 +1,2 @@ pytest +swh.scheduler[testing] diff --git a/swh.loader.cvs.egg-info/PKG-INFO b/swh.loader.cvs.egg-info/PKG-INFO new file mode 100644 index 0000000..405b34e --- /dev/null +++ b/swh.loader.cvs.egg-info/PKG-INFO @@ -0,0 +1,240 @@ +Metadata-Version: 2.1 +Name: swh.loader.cvs +Version: 0.1.0 +Summary: Software Heritage CVS Loader +Home-page: https://forge.softwareheritage.org/diffusion/swh-loader-cvs +Author: Software Heritage developers +Author-email: swh-devel@inria.fr +License: UNKNOWN +Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest +Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-cvs +Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-loader-cvs +Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Intended Audience :: Developers +Classifier: License :: OSI Approved :: GNU Affero General Public License v3 +Classifier: Operating System :: OS Independent +Classifier: Development Status :: 3 - Alpha +Requires-Python: >=3.7 +Description-Content-Type: text/x-rst +Provides-Extra: testing +License-File: LICENSE +License-File: AUTHORS + +Software Heritage - CVS loader +============================== + +The Software Heritage CVS Loader imports the history of CVS repositories +into the SWH dataset. + +The main entry points are + +- :class:``swh.loader.cvs.loader.CvsLoader`` for the main cvs loader + which ingests content out of a local cvs repository + +Features +-------- + +The CVS loader can access CVS repositories via rsync or via the CVS +pserver protocol, with optional support for tunnelling pserver via SSH. + +The CVS loader does *not* require the cvs program to be installed. +However, the loader's test suite does require cvs to be installed. + +Access via rsync requires the rsync program to be installed. The CVS +loader will then invoke rsync to obtain a temporary local copy of the +entire CVS repository. It will then walk the local copy the CVS +repository and parse history of each RCS file with a built-in RCS +parser. This will usually be the fastest method for importing a given +CVS repository. However, most CVS servers do not offer repository access +via rsync, and CVS repositories which see active commits may see +conversion problems because the CVS repository format was not designed +for lock-less read access. + +Access via the plaintext CVS pserver protocol requires no external +dependencies to be installed, and is compatible with regular CVS +servers. This method will use read-locks on the server side and should +therefore be safe to use with active CVS repositories. The CVS loader +will use a built-in minimal CVS client written in Python to fetch the +output of the cvs rlog command executed on the CVS server. This output +will be processed to obtain repository history information. All versions +of all files will then be fetched from the server and injected into the +SWH archive. + +Access via pserver over SSH requires OpenSSH to be installed. Apart from +using SSH as a transport layer the conversion process is the same as in +the plaintext pserver case. The SSH client will be instructed to trust +SSH host key fingeprints upon first use. If a CVS server changes its SSH +fingerprint then manual intervention may be required in order for future +visits to be successful. + +Regardless of access protocol, the CVS loader uses heuristics to convert +the per-file history stored in CVS into changesets. These changesets +correspond to snapshots in the SWH database model. A given CVS +repository should always yield a consistent series of changesets across +multiple visits. + +The following URL protocol schemes are recognized by the loader: + +- rsync:// +- pserver:// +- ssh:// + +After the protocol scheme, the CVS server hostname must be specified, +with an optional user:password field delimited from the hostname with +the ‘@’ character: + +:: + + pserver://anonymous:password@cvs.example.com/ + +After the hostname, the server-side CVS root path must be specified. The +path will usually contain a CVSROOT directory on the server, though this +directory may be hidden from clients: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/ + +The final component of the URL identifies the name of the CVS module +which should be ingested into the SWH archive: + +:: + + pserver://anonymous:password@cvs.example.com/var/cvs/project1 + +As a concrete example, this URL points to the historical CVS repository +of the a2ps project. In this case, the cvsroot path is /sources/a2ps and +the CVS module of the project is called a2ps: + +:: + + pserver://anonymous:anonymous@cvs.savannah.gnu.org/sources/a2ps/a2ps + +In order to obtain the history of this repository the CVS loader will +perform the CVS pserver protocol exchange which is also performed by: + +:: + + cvs -d :pserver:anonymous@cvs.savannah.gnu.org/sources/a2ps rlog a2ps + +Known Limitations +----------------- + +CVS repositories which see active commits should be converted with care. +It is possible to end up with a partial conversion of the latest commit +if repository data is fetched via rsync while a commit is in progress. +The pserver protocol is the safer option in such cases. + +Only history of main CVS branch is converted. CVS vendor branch imports +and merges which modify the main branch are modeled as two distinct +commits to the main branch. Other branches will not be represented in +the conversion result at all. + +CVS labels are not converted into corresponding SWH tags/releases yet. + +The converter does not yet support incremental fetching of CVS history. +The entire history will be fetched and processed during every visit. By +design, CVS does not fully support a concept of changesets that span +multiple files and, as such, importing an evolving CVS history +incrementally is a not a trivial problem. Regardless, some improvements +could be made relatively easily, as noted below. + +CVS repositories copied with rsync could be cached locally, such that +rsync will only download RCS files which have changed since the last +visit. At present the local copy of the repository is fetched to a +temporary directory and is deleted once the conversion process is done. + +It might help to store persistent meta-data about blobs imported from +CVS. If such meta-data could be searched via a given CVS repository +name, a path, and an RCS revision number then redundant downloads of +file versions over the pserver protocol could be detected and skipped. + +The minimal CVS client does not yet support the optional gzip extension +offered by the CVS pserver protocol. If this was supported then files +downloaded from a CVS server could be compressed while in transit. + +The built-in minimal CVS client has not been tested against many +versions of CVS. It should work fine against CVS 1.11 and 1.12 servers. +More work may be needed to improve compatibility with older versions of +CVS. + +Acknowledgements +---------------- + +This software contains code derived from *cvs2gitdump* written by +YASUOKA Masahiko, and from the *rcsparse* library written by Simon +Schubert. + +This software contains code derived from ViewVC: https://www.viewvc.org/ + +Licensing information +--------------------- + +Parts of the software written by SWH developers are licensed under +GPLv3. See the file LICENSE + +cvs2gitdump by YASUOKA Masahiko is licensed under ISC. See the top of +the file swh/loader/cvs/cvs2gitdump/cvs2gitdump.py + +rcsparse by Simon Schubert is licensed under AGPLv3. See the file +swh/loader/cvs/rcsparse/COPYRIGHT + +ViewVC is licensed under the 2-clause BSD licence. See the file +swh/loader/cvs/rlog.py + +Running Tests +============= + +The loader's test suite requires cvs to be installed. + +Because the rcsparse library is implemented in C and accessed via Python +bindings, the CVS loader must be compiled and installed before tests can +be run and the *build* directory must be passed as an argument to +pytest: + +:: + + $ ./setup.py build install + $ pytest ./build + +The test suite uses internal protocol schemes which cannot be reached +from "Save Code Now". These are: + + - fake:// + - file:// + +The fake:// scheme corresponds to pserver:// and ssh://. The test suite +will spawn a 'cvs server' process locally and the loader will connect +to this server via a pipe and communicate using the pserver protocol. +Real ssh:// access lacks test coverage at present and would require +sshd to become part of the test setup. + +The file:// scheme corresponds to rsync:// and behaves as if the rsync +program had already created a local copy of the repository. Real rsync:// +access lacks test coverage at present and would require an rsyncd server +to become part of the test setup. + +CLI run +======= + +With the configuration: + +/tmp/loader_cvs.yml: + +:: + + storage: + cls: remote + args: + url: http://localhost:5002/ + +Run: + +:: + + swh loader --config-file /tmp/loader_cvs.yml \ + run cvs + + diff --git a/swh.loader.cvs.egg-info/SOURCES.txt b/swh.loader.cvs.egg-info/SOURCES.txt new file mode 100644 index 0000000..8530062 --- /dev/null +++ b/swh.loader.cvs.egg-info/SOURCES.txt @@ -0,0 +1,77 @@ +.gitignore +.pre-commit-config.yaml +AUTHORS +CODE_OF_CONDUCT.md +CONTRIBUTORS +LICENSE +MANIFEST.in +Makefile +README.rst +conftest.py +mypy.ini +pyproject.toml +pytest.ini +requirements-swh.txt +requirements-test.txt +requirements.txt +setup.cfg +setup.py +tox.ini +docs/.gitignore +docs/Makefile +docs/README.rst +docs/conf.py +docs/index.rst +docs/_static/.placeholder +docs/_templates/.placeholder +swh/__init__.py +swh.loader.cvs.egg-info/PKG-INFO +swh.loader.cvs.egg-info/SOURCES.txt +swh.loader.cvs.egg-info/dependency_links.txt +swh.loader.cvs.egg-info/entry_points.txt +swh.loader.cvs.egg-info/requires.txt +swh.loader.cvs.egg-info/top_level.txt +swh/loader/__init__.py +swh/loader/cvs/__init__.py +swh/loader/cvs/cvsclient.py +swh/loader/cvs/loader.py +swh/loader/cvs/py.typed +swh/loader/cvs/rcsparse.pyi +swh/loader/cvs/rlog.py +swh/loader/cvs/tasks.py +swh/loader/cvs/cvs2gitdump/README.md +swh/loader/cvs/cvs2gitdump/cvs2gitdump.1 +swh/loader/cvs/cvs2gitdump/cvs2gitdump.py +swh/loader/cvs/cvs2gitdump/.github/workflows/python-app.yml +swh/loader/cvs/rcsparse/COPYRIGHT +swh/loader/cvs/rcsparse/Makefile.test +swh/loader/cvs/rcsparse/README +swh/loader/cvs/rcsparse/bar,v +swh/loader/cvs/rcsparse/extconf.rb +swh/loader/cvs/rcsparse/moo,v +swh/loader/cvs/rcsparse/py-rcsparse.c +swh/loader/cvs/rcsparse/queue.h +swh/loader/cvs/rcsparse/rb-rcsparse.c +swh/loader/cvs/rcsparse/rcsparse.c +swh/loader/cvs/rcsparse/rcsparse.h +swh/loader/cvs/rcsparse/setup.py +swh/loader/cvs/rcsparse/test.rb +swh/loader/cvs/rcsparse/tree.h +swh/loader/cvs/tests/test_loader.py +swh/loader/cvs/tests/test_tasks.py +swh/loader/cvs/tests/data/dino-commitid.tgz +swh/loader/cvs/tests/data/dino-readded-file.tgz +swh/loader/cvs/tests/data/greek-repository.tgz +swh/loader/cvs/tests/data/greek-repository2.tgz +swh/loader/cvs/tests/data/greek-repository3.tgz +swh/loader/cvs/tests/data/greek-repository4.tgz +swh/loader/cvs/tests/data/greek-repository5.tgz +swh/loader/cvs/tests/data/greek-repository6.tgz +swh/loader/cvs/tests/data/greek-repository7.tgz +swh/loader/cvs/tests/data/greek-repository8.tgz +swh/loader/cvs/tests/data/greek-repository9.tgz +swh/loader/cvs/tests/data/nano.rlog.tgz +swh/loader/cvs/tests/data/rcsbase-log-kw-test-repo.tgz +swh/loader/cvs/tests/data/runbaby.tgz +swh/loader/cvs/tests/data/unsafe_rlog_with_unsafe_relative_path.rlog +swh/loader/cvs/tests/data/unsafe_rlog_wrong_arborescence.rlog \ No newline at end of file diff --git a/swh.loader.cvs.egg-info/dependency_links.txt b/swh.loader.cvs.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/swh.loader.cvs.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/swh.loader.cvs.egg-info/entry_points.txt b/swh.loader.cvs.egg-info/entry_points.txt new file mode 100644 index 0000000..3ee69ef --- /dev/null +++ b/swh.loader.cvs.egg-info/entry_points.txt @@ -0,0 +1,4 @@ + + [swh.workers] + loader.cvs=swh.loader.cvs:register + \ No newline at end of file diff --git a/swh.loader.cvs.egg-info/requires.txt b/swh.loader.cvs.egg-info/requires.txt new file mode 100644 index 0000000..1150dca --- /dev/null +++ b/swh.loader.cvs.egg-info/requires.txt @@ -0,0 +1,9 @@ +swh.core[http]>=0.3 +swh.storage>=0.11.3 +swh.model>=0.4.0 +swh.scheduler>=0.0.39 +swh.loader.core>=0.18 + +[testing] +pytest +swh.scheduler[testing] diff --git a/swh.loader.cvs.egg-info/top_level.txt b/swh.loader.cvs.egg-info/top_level.txt new file mode 100644 index 0000000..0cb0f8f --- /dev/null +++ b/swh.loader.cvs.egg-info/top_level.txt @@ -0,0 +1 @@ +swh diff --git a/swh/__init__.py b/swh/__init__.py index 8d9f151..b36383a 100644 --- a/swh/__init__.py +++ b/swh/__init__.py @@ -1,4 +1,3 @@ from pkgutil import extend_path -from typing import List -__path__: List[str] = extend_path(__path__, __name__) +__path__ = extend_path(__path__, __name__) diff --git a/swh/loader/__init__.py b/swh/loader/__init__.py index 8d9f151..b36383a 100644 --- a/swh/loader/__init__.py +++ b/swh/loader/__init__.py @@ -1,4 +1,3 @@ from pkgutil import extend_path -from typing import List -__path__: List[str] = extend_path(__path__, __name__) +__path__ = extend_path(__path__, __name__) diff --git a/swh/loader/cvs/loader.py b/swh/loader/cvs/loader.py index b752247..2a03c3d 100644 --- a/swh/loader/cvs/loader.py +++ b/swh/loader/cvs/loader.py @@ -1,615 +1,639 @@ # Copyright (C) 2015-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU Affero General Public License version 3, or any later version # See top-level LICENSE file for more information """Loader in charge of injecting either new or existing cvs repositories to swh-storage. """ from datetime import datetime import os import os.path import subprocess import tempfile import time from typing import Any, BinaryIO, Dict, Iterator, List, Optional, Sequence, Tuple, cast from urllib3.util import parse_url from swh.loader.core.loader import BaseLoader from swh.loader.core.utils import clean_dangling_folders from swh.loader.cvs.cvs2gitdump.cvs2gitdump import ( CHANGESET_FUZZ_SEC, ChangeSetKey, CvsConv, FileRevision, RcsKeywords, file_path, ) from swh.loader.cvs.cvsclient import CVSClient import swh.loader.cvs.rcsparse as rcsparse from swh.loader.cvs.rlog import RlogConv from swh.loader.exception import NotFound from swh.model import from_disk, hashutil from swh.model.model import ( Content, Directory, Origin, Person, Revision, RevisionType, Sha1Git, SkippedContent, Snapshot, SnapshotBranch, TargetType, TimestampWithTimezone, ) from swh.storage.algos.snapshot import snapshot_get_latest from swh.storage.interface import StorageInterface DEFAULT_BRANCH = b"HEAD" TEMPORARY_DIR_PREFIX_PATTERN = "swh.loader.cvs." +class BadPathException(Exception): + pass + + class CvsLoader(BaseLoader): """Swh cvs loader. The repository is local. The loader deals with update on an already previously loaded repository. """ visit_type = "cvs" cvs_module_name: str cvsclient: CVSClient # remote CVS repository access (history is parsed from CVS rlog): rlog_file: BinaryIO swh_revision_gen: Iterator[ Tuple[List[Content], List[SkippedContent], List[Directory], Revision] ] def __init__( self, storage: StorageInterface, url: str, origin_url: Optional[str] = None, visit_date: Optional[datetime] = None, cvsroot_path: Optional[str] = None, temp_directory: str = "/tmp", max_content_size: Optional[int] = None, ): super().__init__( storage=storage, logging_class="swh.loader.cvs.CvsLoader", max_content_size=max_content_size, ) self.cvsroot_url = url # origin url as unique identifier for origin in swh archive self.origin_url = origin_url if origin_url else self.cvsroot_url self.temp_directory = temp_directory # internal state used to store swh objects self._contents: List[Content] = [] self._skipped_contents: List[SkippedContent] = [] self._directories: List[Directory] = [] self._revisions: List[Revision] = [] # internal state, current visit self._last_revision: Optional[Revision] = None self._visit_status = "full" self.visit_date = visit_date self.cvsroot_path = cvsroot_path self.custom_id_keyword = None self.excluded_keywords: List[str] = [] self.snapshot: Optional[Snapshot] = None self.last_snapshot: Optional[Snapshot] = snapshot_get_latest( self.storage, self.origin_url ) def compute_swh_revision( self, k: ChangeSetKey, logmsg: Optional[bytes] ) -> Tuple[Revision, from_disk.Directory]: """Compute swh hash data per CVS changeset. Returns: tuple (rev, swh_directory) - rev: current SWH revision computed from checked out work tree - swh_directory: dictionary of path, swh hash data with type """ # Compute SWH revision from the on-disk state swh_dir = from_disk.Directory.from_disk(path=os.fsencode(self.worktree_path)) parents: Tuple[Sha1Git, ...] if self._last_revision: parents = (self._last_revision.id,) else: parents = () revision = self.build_swh_revision(k, logmsg, swh_dir.hash, parents) self.log.info("SWH revision ID: %s", hashutil.hash_to_hex(revision.id)) self._last_revision = revision return (revision, swh_dir) + def file_path_is_safe(self, wtpath): + if "%s..%s" % (os.path.sep, os.path.sep) in wtpath: + # Paths with back-references should not appear + # in CVS protocol messages or CVS rlog output + return False + elif ( + os.path.commonpath([self.tempdir_path, os.path.normpath(wtpath)]) + != self.tempdir_path + ): + # The path must be a child of our temporary directory. + return False + else: + return True + def checkout_file_with_rcsparse( self, k: ChangeSetKey, f: FileRevision, rcsfile: rcsparse.rcsfile ) -> None: assert self.cvsroot_path assert self.server_style_cvsroot path = file_path(self.cvsroot_path, f.path) wtpath = os.path.join(self.tempdir_path, path) + if not self.file_path_is_safe(wtpath): + raise BadPathException("unsafe path found in RCS file: %s" % f.path) self.log.info("rev %s state %s file %s" % (f.rev, f.state, f.path)) if f.state == "dead": # remove this file from work tree try: os.remove(wtpath) except FileNotFoundError: pass else: # create, or update, this file in the work tree if not rcsfile: rcsfile = rcsparse.rcsfile(f.path) rcs = RcsKeywords() # We try our best to generate the same commit hashes over both pserver # and rsync. To avoid differences in file content due to expansion of # RCS keywords which contain absolute file paths (such as "Header"), # attempt to expand such paths in the same way as a regular CVS server # would expand them. # Whether this will avoid content differences depends on pserver and # rsync servers exposing the same server-side path to the CVS repository. # However, this is the best we can do, and only matters if an origin can # be fetched over both pserver and rsync. Each will still be treated as # a distinct origin, but will hopefully point at the same SWH snapshot. # In any case, an absolute path based on the origin URL looks nicer than # an absolute path based on a temporary directory used by the CVS loader. server_style_path = f.path.replace( self.cvsroot_path, self.server_style_cvsroot ) if server_style_path[0] != "/": server_style_path = "/" + server_style_path if self.custom_id_keyword is not None: rcs.add_id_keyword(self.custom_id_keyword) contents = rcs.expand_keyword( server_style_path, rcsfile, f.rev, self.excluded_keywords ) os.makedirs(os.path.dirname(wtpath), exist_ok=True) outfile = open(wtpath, mode="wb") outfile.write(contents) outfile.close() def checkout_file_with_cvsclient( self, k: ChangeSetKey, f: FileRevision, cvsclient: CVSClient ): assert self.cvsroot_path path = file_path(self.cvsroot_path, f.path) wtpath = os.path.join(self.tempdir_path, path) + if not self.file_path_is_safe(wtpath): + raise BadPathException("unsafe path found in cvs rlog output: %s" % f.path) self.log.info("rev %s state %s file %s" % (f.rev, f.state, f.path)) if f.state == "dead": # remove this file from work tree try: os.remove(wtpath) except FileNotFoundError: pass else: dirname = os.path.dirname(wtpath) os.makedirs(dirname, exist_ok=True) self.log.debug("checkout to %s\n" % wtpath) fp = cvsclient.checkout(path, f.rev, dirname, expand_keywords=True) os.rename(fp.name, wtpath) try: fp.close() except FileNotFoundError: # Well, we have just renamed the file... pass def process_cvs_changesets( self, cvs_changesets: List[ChangeSetKey], use_rcsparse: bool, ) -> Iterator[ Tuple[List[Content], List[SkippedContent], List[Directory], Revision] ]: """Process CVS revisions. At each CVS revision, check out contents and compute swh hashes. Yields: tuple (contents, skipped-contents, directories, revision) of dict as a dictionary with keys, sha1_git, sha1, etc... """ for k in cvs_changesets: tstr = time.strftime("%c", time.gmtime(k.max_time)) self.log.info( "changeset from %s by %s on branch %s", tstr, k.author, k.branch ) logmsg: Optional[bytes] = b"" # Check out all files of this revision and get a log message. # # The log message is obtained from the first file in the changeset. # The message will usually be the same for all affected files, and # the SWH archive will only store one version of the log message. for f in k.revs: rcsfile = None if use_rcsparse: if rcsfile is None: rcsfile = rcsparse.rcsfile(f.path) if not logmsg: logmsg = rcsfile.getlog(k.revs[0].rev) self.checkout_file_with_rcsparse(k, f, rcsfile) else: if not logmsg: logmsg = self.rlog.getlog(self.rlog_file, f.path, k.revs[0].rev) self.checkout_file_with_cvsclient(k, f, self.cvsclient) # TODO: prune empty directories? (revision, swh_dir) = self.compute_swh_revision(k, logmsg) (contents, skipped_contents, directories) = from_disk.iter_directory( swh_dir ) yield contents, skipped_contents, directories, revision def prepare_origin_visit(self) -> None: self.origin = Origin( url=self.origin_url if self.origin_url else self.cvsroot_url ) def pre_cleanup(self) -> None: """Cleanup potential dangling files from prior runs (e.g. OOM killed tasks) """ clean_dangling_folders( self.temp_directory, pattern_check=TEMPORARY_DIR_PREFIX_PATTERN, log=self.log, ) def cleanup(self) -> None: self.log.info("cleanup") def configure_custom_id_keyword(self, cvsconfig): """Parse CVSROOT/config and look for a custom keyword definition. There are two different configuration directives in use for this purpose. The first variant stems from a patch which was never accepted into upstream CVS and uses the tag directive: tag=MyName With this, the "MyName" keyword becomes an alias for the "Id" keyword. This variant is prelevant in CVS versions shipped on BSD. The second variant stems from upstream CVS 1.12 and looks like: LocalKeyword=MyName=SomeKeyword KeywordExpand=iMyName We only support "SomeKeyword" if it specifies "Id" or "CVSHeader", for now. The KeywordExpand directive can be used to suppress expansion of keywords by listing keywords after an initial "e" character ("exclude", as opposed to an "include" list which uses an initial "i" character). For example, this disables expansion of the Date and Name keywords: KeywordExpand=eDate,Name """ for line in cvsconfig.readlines(): line = line.strip() try: (config_key, value) = line.split("=", 1) except ValueError: continue config_key = config_key.strip() value = value.strip() if config_key == "tag": self.custom_id_keyword = value elif config_key == "LocalKeyword": try: (custom_kwname, kwname) = value.split("=", 1) except ValueError: continue if kwname.strip() in ("Id", "CVSHeader"): self.custom_id_keyword = custom_kwname.strip() elif config_key == "KeywordExpand" and value.startswith("e"): excluded_keywords = value[1:].split(",") for k in excluded_keywords: self.excluded_keywords.append(k.strip()) def fetch_cvs_repo_with_rsync(self, host: str, path: str) -> None: # URL *must* end with a trailing slash in order to get CVSROOT listed url = "rsync://%s%s/" % (host, os.path.dirname(path)) rsync = subprocess.run(["rsync", url], capture_output=True, encoding="ascii") rsync.check_returncode() have_cvsroot = False have_module = False for line in rsync.stdout.split("\n"): self.log.debug("rsync server: %s", line) if line.endswith(" CVSROOT"): have_cvsroot = True elif line.endswith(" %s" % self.cvs_module_name): have_module = True if have_module and have_cvsroot: break if not have_module: raise NotFound( "CVS module %s not found at %s" % (self.cvs_module_name, url) ) if not have_cvsroot: raise NotFound("No CVSROOT directory found at %s" % url) # Fetch the CVSROOT directory and the desired CVS module. assert self.cvsroot_path for d in ("CVSROOT", self.cvs_module_name): target_dir = os.path.join(self.cvsroot_path, d) os.makedirs(target_dir, exist_ok=True) subprocess.run( # Append trailing path separators ("/" in the URL and os.path.sep in the # local target directory path) to ensure that rsync will place files # directly within our target directory . ["rsync", "-a", url + d + "/", target_dir + os.path.sep] ).check_returncode() def prepare(self) -> None: self._last_revision = None self.tempdir_path = tempfile.mkdtemp( suffix="-%s" % os.getpid(), prefix=TEMPORARY_DIR_PREFIX_PATTERN, dir=self.temp_directory, ) url = parse_url(self.origin_url) self.log.debug( "prepare; origin_url=%s scheme=%s path=%s", self.origin_url, url.scheme, url.path, ) if not url.path: raise NotFound("Invalid CVS origin URL '%s'" % self.origin_url) self.cvs_module_name = os.path.basename(url.path) self.server_style_cvsroot = os.path.dirname(url.path) self.worktree_path = os.path.join(self.tempdir_path, self.cvs_module_name) if url.scheme == "file" or url.scheme == "rsync": # local CVS repository conversion if not self.cvsroot_path: self.cvsroot_path = tempfile.mkdtemp( suffix="-%s" % os.getpid(), prefix=TEMPORARY_DIR_PREFIX_PATTERN, dir=self.temp_directory, ) if url.scheme == "file": if not os.path.exists(url.path): raise NotFound elif url.scheme == "rsync": self.fetch_cvs_repo_with_rsync(url.host, url.path) have_rcsfile = False have_cvsroot = False for root, dirs, files in os.walk(self.cvsroot_path): if "CVSROOT" in dirs: have_cvsroot = True dirs.remove("CVSROOT") continue for f in files: filepath = os.path.join(root, f) if f[-2:] == ",v": rcsfile = rcsparse.rcsfile(filepath) # noqa: F841 self.log.debug( "Looks like we have data to convert; " "found a valid RCS file at %s", filepath, ) have_rcsfile = True break if have_rcsfile: break if not have_rcsfile: raise NotFound( "Directory %s does not contain any valid RCS files %s", self.cvsroot_path, ) if not have_cvsroot: self.log.warn( "The CVS repository at '%s' lacks a CVSROOT directory; " "we might be ingesting an incomplete copy of the repository", self.cvsroot_path, ) # The file CVSROOT/config will usually contain ASCII data only. # We allow UTF-8 just in case. Other encodings may result in an # error and will require manual intervention, for now. cvsconfig_path = os.path.join(self.cvsroot_path, "CVSROOT", "config") cvsconfig = open(cvsconfig_path, mode="r", encoding="utf-8") self.configure_custom_id_keyword(cvsconfig) cvsconfig.close() # Unfortunately, there is no way to convert CVS history in an # iterative fashion because the data is not indexed by any kind # of changeset ID. We need to walk the history of each and every # RCS file in the repository during every visit, even if no new # changes will be added to the SWH archive afterwards. # "CVS’s repository is the software equivalent of a telephone book # sorted by telephone number." # https://corecursive.com/software-that-doesnt-suck-with-jim-blandy/ # # An implicit assumption made here is that self.cvs_changesets will # fit into memory in its entirety. If it won't fit then the CVS walker # will need to be modified such that it spools the list of changesets # to disk instead. cvs = CvsConv(self.cvsroot_path, RcsKeywords(), False, CHANGESET_FUZZ_SEC) self.log.info("Walking CVS module %s", self.cvs_module_name) cvs.walk(self.cvs_module_name) cvs_changesets = sorted(cvs.changesets) self.log.info( "CVS changesets found in %s: %d", self.cvs_module_name, len(cvs_changesets), ) self.swh_revision_gen = self.process_cvs_changesets( cvs_changesets, use_rcsparse=True ) elif url.scheme == "pserver" or url.scheme == "fake" or url.scheme == "ssh": # remote CVS repository conversion if not self.cvsroot_path: self.cvsroot_path = os.path.dirname(url.path) self.cvsclient = CVSClient(url) cvsroot_path = os.path.dirname(url.path) self.log.info( "Fetching CVS rlog from %s:%s/%s", url.host, cvsroot_path, self.cvs_module_name, ) self.rlog = RlogConv(cvsroot_path, CHANGESET_FUZZ_SEC) main_rlog_file = self.cvsclient.fetch_rlog() self.rlog.parse_rlog(main_rlog_file) # Find file deletion events only visible in Attic directories. main_changesets = self.rlog.changesets attic_paths = [] attic_rlog_files = [] assert self.cvsroot_path for k in main_changesets: for changed_file in k.revs: path = file_path(self.cvsroot_path, changed_file.path) if path.startswith(self.cvsroot_path): path = path[ len(os.path.commonpath([self.cvsroot_path, path])) + 1 : ] parent_path = os.path.dirname(path) if parent_path.split("/")[-1] == "Attic": continue attic_path = parent_path + "/Attic" if attic_path in attic_paths: continue attic_paths.append(attic_path) # avoid multiple visits # Try to fetch more rlog data from this Attic directory. attic_rlog_file = self.cvsclient.fetch_rlog( path=attic_path, state="dead", ) if attic_rlog_file: attic_rlog_files.append(attic_rlog_file) if len(attic_rlog_files) == 0: self.rlog_file = main_rlog_file else: # Combine all the rlog pieces we found and re-parse. fp = tempfile.TemporaryFile() for attic_rlog_file in attic_rlog_files: for line in attic_rlog_file.readlines(): fp.write(line) attic_rlog_file.close() main_rlog_file.seek(0) for line in main_rlog_file.readlines(): fp.write(line) main_rlog_file.close() fp.seek(0) self.rlog.parse_rlog(cast(BinaryIO, fp)) self.rlog_file = cast(BinaryIO, fp) cvs_changesets = sorted(self.rlog.changesets) self.log.info( "CVS changesets found for %s: %d", self.cvs_module_name, len(cvs_changesets), ) self.swh_revision_gen = self.process_cvs_changesets( cvs_changesets, use_rcsparse=False ) else: raise NotFound("Invalid CVS origin URL '%s'" % self.origin_url) def fetch_data(self) -> bool: """Fetch the next CVS revision.""" try: data = next(self.swh_revision_gen) except StopIteration: assert self._last_revision is not None self.snapshot = self.generate_and_load_snapshot(self._last_revision) self.log.info("SWH snapshot ID: %s", hashutil.hash_to_hex(self.snapshot.id)) self.flush() self.loaded_snapshot_id = self.snapshot.id return False except Exception: self.log.exception("Exception in fetch_data:") + self._visit_status = "failed" return False # Stopping iteration self._contents, self._skipped_contents, self._directories, rev = data self._revisions = [rev] return True def build_swh_revision( self, k: ChangeSetKey, logmsg: Optional[bytes], dir_id: bytes, parents: Sequence[bytes], ) -> Revision: """Given a CVS revision, build a swh revision. Args: k: changeset data logmsg: the changeset's log message dir_id: the tree's hash identifier parents: the revision's parents identifier Returns: The swh revision dictionary. """ author = Person.from_fullname(k.author.encode("UTF-8")) date = TimestampWithTimezone.from_dict(k.max_time) return Revision( type=RevisionType.CVS, date=date, committer_date=date, directory=dir_id, message=logmsg, author=author, committer=author, synthetic=True, extra_headers=[], parents=tuple(parents), ) def generate_and_load_snapshot(self, revision: Revision) -> Snapshot: """Create the snapshot either from existing revision. Args: revision (dict): Last revision seen if any (None by default) Returns: Optional[Snapshot] The newly created snapshot """ snap = Snapshot( branches={ DEFAULT_BRANCH: SnapshotBranch( target=revision.id, target_type=TargetType.REVISION ) } ) self.log.debug("snapshot: %s", snap) self.storage.snapshot_add([snap]) return snap def store_data(self) -> None: "Add our current CVS changeset to the archive." self.storage.skipped_content_add(self._skipped_contents) self.storage.content_add(self._contents) self.storage.directory_add(self._directories) self.storage.revision_add(self._revisions) self.flush() self._skipped_contents = [] self._contents = [] self._directories = [] self._revisions = [] def load_status(self) -> Dict[str, Any]: - assert self.snapshot is not None - if self.last_snapshot == self.snapshot: + if self.snapshot is None: + load_status = "failed" + elif self.last_snapshot == self.snapshot: load_status = "uneventful" else: load_status = "eventful" return { "status": load_status, } def visit_status(self) -> str: return self._visit_status diff --git a/swh/loader/cvs/tasks.py b/swh/loader/cvs/tasks.py index 63867dc..a3753ba 100644 --- a/swh/loader/cvs/tasks.py +++ b/swh/loader/cvs/tasks.py @@ -1,54 +1,40 @@ # Copyright (C) 2015-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU Affero General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime from typing import Optional from celery import shared_task import iso8601 from .loader import CvsLoader def convert_to_datetime(date: Optional[str]) -> Optional[datetime]: if date is None: return None try: assert isinstance(date, str) return iso8601.parse_date(date) except Exception: return None @shared_task(name=__name__ + ".LoadCvsRepository") def load_cvs( - *, - url: str, - origin_url: Optional[str] = None, - destination_path: Optional[str] = None, - swh_revision: Optional[str] = None, - visit_date: Optional[str] = None, + *, url: str, origin_url: Optional[str] = None, visit_date: Optional[str] = None, ): """Import a CVS repository Args: - url: (mandatory) CVS's repository url to ingest data from - origin_url: Optional original url override to use as origin reference in the archive. If not provided, "url" is used as origin. - - destination_path: (optional) root directory to - locally retrieve svn's data - - swh_revision: (optional) extra revision hex to - start from. See swh.loader.svn.CvsLoader.process - docstring - visit_date: Optional date to override the visit date """ loader = CvsLoader.from_configfile( - url=url, - origin_url=origin_url, - destination_path=destination_path, - swh_revision=swh_revision, - visit_date=convert_to_datetime(visit_date), + url=url, origin_url=origin_url, visit_date=convert_to_datetime(visit_date), ) return loader.load() diff --git a/swh/loader/cvs/tests/data/unsafe_rlog_with_unsafe_relative_path.rlog b/swh/loader/cvs/tests/data/unsafe_rlog_with_unsafe_relative_path.rlog new file mode 100644 index 0000000..4650fee --- /dev/null +++ b/swh/loader/cvs/tests/data/unsafe_rlog_with_unsafe_relative_path.rlog @@ -0,0 +1,103 @@ +RCS file: {cvsroot_path}/../greek-tree/alpha,v +head: 1.2 +branch: +locks: strict +access list: +symbolic names: + start: 1.1.1.1 + yoyo: 1.1.1 +keyword substitution: kv +total revisions: 3; selected revisions: 3 +description: +---------------------------- +revision 1.2 +date: 2021-04-20 15:30:37 +0200; author: stsp; state: Exp; lines: +1 -0; commitid: 100607ED77A971503F5; +edit alpha +---------------------------- +revision 1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; commitid: 100607ED74996F4C8AF; +branches: 1.1.1; +Initial revision +---------------------------- +revision 1.1.1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; lines: +0 -0; commitid: 100607ED74996F4C8AF; +initial import +============================================================================= + +RCS file: {cvsroot_path}/greek-tree/Attic/../beta,v +head: 1.2 +branch: +locks: strict +access list: +symbolic names: + start: 1.1.1.1 + yoyo: 1.1.1 +keyword substitution: kv +total revisions: 3; selected revisions: 3 +description: +---------------------------- +revision 1.2 +date: 2021-04-20 15:30:52 +0200; author: stsp; state: dead; lines: +0 -0; commitid: 100607ED78A9726BA11; +remove beta +---------------------------- +revision 1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; commitid: 100607ED74996F4C8AF; +branches: 1.1.1; +Initial revision +---------------------------- +revision 1.1.1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; lines: +0 -0; commitid: 100607ED74996F4C8AF; +initial import +============================================================================= + +RCS file: {cvsroot_path}/../../etc/passwd +head: 1.3 +branch: +locks: strict +access list: +symbolic names: + start: 1.1.1.1 + yoyo: 1.1.1 +keyword substitution: kv +total revisions: 4; selected revisions: 4 +description: +---------------------------- +revision 1.3 +date: 2021-04-20 15:32:45 +0200; author: stsp; state: Exp; lines: +1 -1; commitid: 100607ED7F29770C997; +reviving zeta +---------------------------- +revision 1.2 +date: 2021-04-20 15:31:57 +0200; author: stsp; state: dead; lines: +0 -0; commitid: 100607ED7C89753114E; +remove epsilon/zeta +---------------------------- +revision 1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; commitid: 100607ED74996F4C8AF; +branches: 1.1.1; +Initial revision +---------------------------- +revision 1.1.1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; lines: +0 -0; commitid: 100607ED74996F4C8AF; +initial import +============================================================================= + +RCS file: {cvsroot_path}/greek-tree/gamma/../../../../../../etc/passwd +head: 1.1 +branch: 1.1.1 +locks: strict +access list: +symbolic names: + start: 1.1.1.1 + yoyo: 1.1.1 +keyword substitution: kv +total revisions: 2; selected revisions: 2 +description: +---------------------------- +revision 1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; commitid: 100607ED74996F4C8AF; +branches: 1.1.1; +Initial revision +---------------------------- +revision 1.1.1.1 +date: 2021-04-20 15:29:48 +0200; author: stsp; state: Exp; lines: +0 -0; commitid: 100607ED74996F4C8AF; +initial import +============================================================================= diff --git a/swh/loader/cvs/tests/data/unsafe_rlog_wrong_arborescence.rlog b/swh/loader/cvs/tests/data/unsafe_rlog_wrong_arborescence.rlog new file mode 100644 index 0000000..665dc31 --- /dev/null +++ b/swh/loader/cvs/tests/data/unsafe_rlog_wrong_arborescence.rlog @@ -0,0 +1,18 @@ +RCS file: /etc/passwd +head: 1.2 +branch: +locks: strict +access list: +symbolic names: +keyword substitution: kv +total revisions: 2; selected revisions: 2 +description: +---------------------------- +revision 1.2 +date: 2021-04-20 15:32:18 +0200; author: stsp; state: Exp; lines: +1 -0; commitid: 100607ED7DF9763EBB7; +edit psi +---------------------------- +revision 1.1 +date: 2021-04-20 15:31:15 +0200; author: stsp; state: Exp; commitid: 100607ED7999735979A; +add epsilon/psi +============================================================================= diff --git a/swh/loader/cvs/tests/test_loader.py b/swh/loader/cvs/tests/test_loader.py index 6d3c9d1..f14183d 100644 --- a/swh/loader/cvs/tests/test_loader.py +++ b/swh/loader/cvs/tests/test_loader.py @@ -1,1083 +1,1146 @@ # Copyright (C) 2016-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU Affero General Public License version 3, or any later version # See top-level LICENSE file for more information import os +import tempfile from typing import Any, Dict -from swh.loader.cvs.loader import CvsLoader +import pytest + +from swh.loader.cvs.loader import BadPathException, CvsLoader from swh.loader.tests import ( assert_last_visit_matches, check_snapshot, get_stats, prepare_repository_from_archive, ) from swh.model.hashutil import hash_to_bytes from swh.model.model import Snapshot, SnapshotBranch, TargetType RUNBABY_SNAPSHOT = Snapshot( id=hash_to_bytes("e64667c400049f560a3856580e0d9e511ffa66c9"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("0f6db8ce49472d7829ddd6141f71c68c0d563f0e"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_not_found_no_mock(swh_storage, tmp_path): """Given an unknown repository, the loader visit ends up in status not_found""" unknown_repo_url = "unknown-repository" loader = CvsLoader(swh_storage, unknown_repo_url, cvsroot_path=tmp_path) assert loader.load() == {"status": "uneventful"} assert_last_visit_matches( swh_storage, unknown_repo_url, status="not_found", type="cvs", ) def test_loader_cvs_visit(swh_storage, datadir, tmp_path): """Eventful visit should yield 1 snapshot""" archive_name = "runbaby" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RUNBABY_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 5, "directory": 1, "origin": 1, "origin_visit": 1, "release": 0, "revision": 1, "skipped_content": 0, "snapshot": 1, } check_snapshot(RUNBABY_SNAPSHOT, loader.storage) def test_loader_cvs_2_visits_no_change(swh_storage, datadir, tmp_path): """Eventful visit followed by uneventful visit should yield the same snapshot """ archive_name = "runbaby" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} visit_status1 = assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RUNBABY_SNAPSHOT.id, ) loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "uneventful"} visit_status2 = assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RUNBABY_SNAPSHOT.id, ) assert visit_status1.date < visit_status2.date assert visit_status1.snapshot == visit_status2.snapshot stats = get_stats(loader.storage) assert stats["origin_visit"] == 1 + 1 # computed twice the same snapshot assert stats["snapshot"] == 1 GREEK_SNAPSHOT = Snapshot( id=hash_to_bytes("c76f8b58a6dfbe6fccb9a85b695f914aa5c4a95a"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("e138207ddd5e1965b5ab9a522bfc2e0ecd233b67"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_with_file_additions_and_deletions(swh_storage, datadir, tmp_path): """Eventful conversion of history with file additions and deletions""" archive_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 8, "directory": 13, "origin": 1, "origin_visit": 1, "release": 0, "revision": 7, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT, loader.storage) def test_loader_cvs_pserver_with_file_additions_and_deletions( swh_storage, datadir, tmp_path ): """Eventful CVS pserver conversion with file additions and deletions""" archive_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 8, "directory": 13, "origin": 1, "origin_visit": 1, "release": 0, "revision": 7, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT, loader.storage) GREEK_SNAPSHOT2 = Snapshot( id=hash_to_bytes("e3d2e8860286000f546c01aa2a3e1630170eb3b6"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("f1ff9a3c7624b1be5e5d51f9ec0abf7dcddbf0b2"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_2_visits_with_change(swh_storage, datadir, tmp_path): """Eventful visit followed by eventful visit should yield two snapshots""" archive_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} visit_status1 = assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 8, "directory": 13, "origin": 1, "origin_visit": 1, "release": 0, "revision": 7, "skipped_content": 0, "snapshot": 1, } archive_name2 = "greek-repository2" archive_path2 = os.path.join(datadir, f"{archive_name2}.tgz") repo_url = prepare_repository_from_archive(archive_path2, archive_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} visit_status2 = assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT2.id, ) stats = get_stats(loader.storage) assert stats == { "content": 10, "directory": 15, "origin": 1, "origin_visit": 2, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 2, } check_snapshot(GREEK_SNAPSHOT2, loader.storage) assert visit_status1.date < visit_status2.date assert visit_status1.snapshot != visit_status2.snapshot def test_loader_cvs_visit_pserver(swh_storage, datadir, tmp_path): """Eventful visit to CVS pserver should yield 1 snapshot""" archive_name = "runbaby" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/runbaby" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RUNBABY_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 5, "directory": 1, "origin": 1, "origin_visit": 1, "release": 0, "revision": 1, "skipped_content": 0, "snapshot": 1, } check_snapshot(RUNBABY_SNAPSHOT, loader.storage) GREEK_SNAPSHOT3 = Snapshot( id=hash_to_bytes("6e9910ed072662cb482d9017cbf5e1973e6dc09f"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("d9f4837dc55a87d83730c6e277c88b67dae80272"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_visit_pserver_no_eol(swh_storage, datadir, tmp_path): """Visit to CVS pserver with file that lacks trailing eol""" archive_name = "greek-repository3" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT3.id, ) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 15, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT3, loader.storage) GREEK_SNAPSHOT4 = Snapshot( id=hash_to_bytes("a8593e9233601b31e012d36975f817d2c993d04b"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("51bb99655225c810ee259087fcae505899725360"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_visit_expand_id_keyword(swh_storage, datadir, tmp_path): """Visit to CVS repository with file with an RCS Id keyword""" archive_name = "greek-repository4" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT4.id, ) stats = get_stats(loader.storage) assert stats == { "content": 12, "directory": 20, "origin": 1, "origin_visit": 1, "release": 0, "revision": 11, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT4, loader.storage) def test_loader_cvs_visit_pserver_expand_id_keyword(swh_storage, datadir, tmp_path): """Visit to CVS pserver with file with an RCS Id keyword""" archive_name = "greek-repository4" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT4.id, ) stats = get_stats(loader.storage) assert stats == { "content": 12, "directory": 20, "origin": 1, "origin_visit": 1, "release": 0, "revision": 11, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT4, loader.storage) GREEK_SNAPSHOT5 = Snapshot( id=hash_to_bytes("6484ec9bfff677731cbb6d2bd5058dabfae952ed"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("514b3bef07d56e393588ceda18cc1dfa2dc4e04a"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_with_file_deleted_and_readded(swh_storage, datadir, tmp_path): """Eventful conversion of history with file deletion and re-addition""" archive_name = "greek-repository5" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT5.id, ) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 14, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT5, loader.storage) def test_loader_cvs_pserver_with_file_deleted_and_readded( swh_storage, datadir, tmp_path ): """Eventful pserver conversion with file deletion and re-addition""" archive_name = "greek-repository5" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT5.id, ) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 14, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT5, loader.storage) DINO_SNAPSHOT = Snapshot( id=hash_to_bytes("6cf774cec1030ff3e9a301681303adb537855d09"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("b7d3ea1fa878d51323b5200ad2c6ee9d5b656f10"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_readded_file_in_attic(swh_storage, datadir, tmp_path): """Conversion of history with RCS files in the Attic""" # This repository has some file revisions marked "dead" in the Attic only. # This is different to the re-added file tests above, where the RCS file # was moved out of the Attic again as soon as the corresponding deleted # file was re-added. Failure to detect the "dead" file revisions in the # Attic would result in errors in our converted history. archive_name = "dino-readded-file" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/src" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=DINO_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 38, "directory": 70, "origin": 1, "origin_visit": 1, "release": 0, "revision": 35, "skipped_content": 0, "snapshot": 1, } check_snapshot(DINO_SNAPSHOT, loader.storage) def test_loader_cvs_pserver_readded_file_in_attic(swh_storage, datadir, tmp_path): """Conversion over pserver with RCS files in the Attic""" # This repository has some file revisions marked "dead" in the Attic only. # This is different to the re-added file tests above, where the RCS file # was moved out of the Attic again as soon as the corresponding deleted # file was re-added. Failure to detect the "dead" file revisions in the # Attic would result in errors in our converted history. # This has special implications for the pserver case, because the "dead" # revisions will not appear in in the output of 'cvs rlog' by default. archive_name = "dino-readded-file" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/src" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=DINO_SNAPSHOT.id, ) stats = get_stats(loader.storage) assert stats == { "content": 38, "directory": 70, "origin": 1, "origin_visit": 1, "release": 0, "revision": 35, "skipped_content": 0, "snapshot": 1, } check_snapshot(DINO_SNAPSHOT, loader.storage) DINO_SNAPSHOT2 = Snapshot( id=hash_to_bytes("afdeca6b8ec8f58367b4e014e2210233f1c5bf3d"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("84e428103d42b84713c77afb9420d667062f8676"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_split_commits_by_commitid(swh_storage, datadir, tmp_path): """Conversion of RCS history which needs to be split by commit ID""" # This repository has some file revisions which use the same log message # and can only be told apart by commit IDs. Without commit IDs, these commits # would get merged into a single commit in our conversion result. archive_name = "dino-commitid" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/dino" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=DINO_SNAPSHOT2.id, ) check_snapshot(DINO_SNAPSHOT2, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 18, "directory": 18, "origin": 1, "origin_visit": 1, "release": 0, "revision": 18, "skipped_content": 0, "snapshot": 1, } def test_loader_cvs_pserver_split_commits_by_commitid(swh_storage, datadir, tmp_path): """Conversion via pserver which needs to be split by commit ID""" # This repository has some file revisions which use the same log message # and can only be told apart by commit IDs. Without commit IDs, these commits # would get merged into a single commit in our conversion result. archive_name = "dino-commitid" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/dino" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=DINO_SNAPSHOT2.id, ) check_snapshot(DINO_SNAPSHOT2, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 18, "directory": 18, "origin": 1, "origin_visit": 1, "release": 0, "revision": 18, "skipped_content": 0, "snapshot": 1, } GREEK_SNAPSHOT6 = Snapshot( id=hash_to_bytes("859ae7ca5b31fee594c98abecdd41eff17cae079"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("fa48fb4551898cd8d3305cace971b3b95639e83e"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_empty_lines_in_log_message(swh_storage, datadir, tmp_path): """Conversion of RCS history with empty lines in a log message""" archive_name = "greek-repository6" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT6.id, ) check_snapshot(GREEK_SNAPSHOT6, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 14, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } def test_loader_cvs_pserver_empty_lines_in_log_message(swh_storage, datadir, tmp_path): """Conversion via pserver with empty lines in a log message""" archive_name = "greek-repository6" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT6.id, ) check_snapshot(GREEK_SNAPSHOT6, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 14, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } def get_head_revision_paths_info(loader: CvsLoader) -> Dict[bytes, Dict[str, Any]]: assert loader.snapshot is not None root_dir = loader.snapshot.branches[b"HEAD"].target revision = loader.storage.revision_get([root_dir])[0] assert revision is not None paths = {} for entry in loader.storage.directory_ls(revision.directory, recursive=True): paths[entry["name"]] = entry return paths def test_loader_cvs_with_header_keyword(swh_storage, datadir, tmp_path): """Eventful conversion of history with Header keyword in a file""" archive_name = "greek-repository7" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} repo_url = f"fake://{repo_url[7:]}" loader2 = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader2.load() == {"status": "eventful"} # We cannot verify the snapshot ID. It is unpredicable due to use of the $Header$ # RCS keyword which contains the temporary directory where the repository is stored. expected_stats = { "content": 9, "directory": 14, "origin": 2, "origin_visit": 2, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } stats = get_stats(loader.storage) assert stats == expected_stats stats = get_stats(loader2.storage) assert stats == expected_stats # Ensure that file 'alpha', which contains a $Header$ keyword, # was imported with equal content via file:// and fake:// URLs. paths = get_head_revision_paths_info(loader) paths2 = get_head_revision_paths_info(loader2) alpha = paths[b"alpha"] alpha2 = paths2[b"alpha"] assert alpha["sha1"] == alpha2["sha1"] GREEK_SNAPSHOT8 = Snapshot( id=hash_to_bytes("5278a1f73ed0f804c68f72614a5f78ca5074ab9c"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("b389258fec8151d719e79da80b5e5355a48ec8bc"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_expand_log_keyword(swh_storage, datadir, tmp_path): """Conversion of RCS history with Log keyword in files""" archive_name = "greek-repository8" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT8.id, ) check_snapshot(GREEK_SNAPSHOT8, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 14, "directory": 20, "origin": 1, "origin_visit": 1, "release": 0, "revision": 11, "skipped_content": 0, "snapshot": 1, } def test_loader_cvs_pserver_expand_log_keyword(swh_storage, datadir, tmp_path): """Conversion of RCS history with Log keyword in files""" archive_name = "greek-repository8" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT8.id, ) check_snapshot(GREEK_SNAPSHOT8, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 14, "directory": 20, "origin": 1, "origin_visit": 1, "release": 0, "revision": 11, "skipped_content": 0, "snapshot": 1, } GREEK_SNAPSHOT9 = Snapshot( id=hash_to_bytes("3d08834666df7a589abea07ac409771ebe7e8fe4"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("9971cbb3b540dfe75f3bcce5021cb73d63b47df3"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_visit_expand_custom_keyword(swh_storage, datadir, tmp_path): """Visit to CVS repository with file with a custom RCS keyword""" archive_name = "greek-repository9" extracted_name = "greek-repository" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, extracted_name, tmp_path) repo_url += "/greek-tree" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, extracted_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=GREEK_SNAPSHOT9.id, ) stats = get_stats(loader.storage) assert stats == { "content": 9, "directory": 14, "origin": 1, "origin_visit": 1, "release": 0, "revision": 8, "skipped_content": 0, "snapshot": 1, } check_snapshot(GREEK_SNAPSHOT9, loader.storage) RCSBASE_SNAPSHOT = Snapshot( id=hash_to_bytes("2c75041ba8868df04349c1c8f4c29f992967b8aa"), branches={ b"HEAD": SnapshotBranch( target=hash_to_bytes("46f076387ff170dc3d4da5e43d953c1fc744c821"), target_type=TargetType.REVISION, ) }, ) def test_loader_cvs_expand_log_keyword2(swh_storage, datadir, tmp_path): """Another conversion of RCS history with Log keyword in files""" archive_name = "rcsbase-log-kw-test-repo" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/src" # CVS module name loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RCSBASE_SNAPSHOT.id, ) check_snapshot(RCSBASE_SNAPSHOT, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 2, "directory": 3, "origin": 1, "origin_visit": 1, "release": 0, "revision": 3, "skipped_content": 0, "snapshot": 1, } def test_loader_cvs_pserver_expand_log_keyword2(swh_storage, datadir, tmp_path): """Another conversion of RCS history with Log keyword in files""" archive_name = "rcsbase-log-kw-test-repo" archive_path = os.path.join(datadir, f"{archive_name}.tgz") repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) repo_url += "/src" # CVS module name # Ask our cvsclient to connect via the 'cvs server' command repo_url = f"fake://{repo_url[7:]}" loader = CvsLoader( swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name) ) assert loader.load() == {"status": "eventful"} assert_last_visit_matches( loader.storage, repo_url, status="full", type="cvs", snapshot=RCSBASE_SNAPSHOT.id, ) check_snapshot(RCSBASE_SNAPSHOT, loader.storage) stats = get_stats(loader.storage) assert stats == { "content": 2, "directory": 3, "origin": 1, "origin_visit": 1, "release": 0, "revision": 3, "skipped_content": 0, "snapshot": 1, } + + +@pytest.mark.parametrize( + "rlog_unsafe_path", + [ + # paths that walk to parent directory: + "unsafe_rlog_with_unsafe_relative_path.rlog", + # absolute path outside the CVS server's root directory: + "unsafe_rlog_wrong_arborescence.rlog", + ], +) +def test_loader_cvs_weird_paths_in_rlog( + swh_storage, datadir, tmp_path, mocker, rlog_unsafe_path +): + """Handle cvs rlog output which contains unsafe paths""" + archive_name = "greek-repository" + archive_path = os.path.join(datadir, f"{archive_name}.tgz") + repo_url = prepare_repository_from_archive(archive_path, archive_name, tmp_path) + repo_url += "/greek-tree" # CVS module name + + # Ask our cvsclient to connect via the 'cvs server' command + repo_url = f"fake://{repo_url[7:]}" + + # And let's pretend the server returned this rlog output instead of + # what it would actually return. + rlog_file = tempfile.NamedTemporaryFile( + dir=tmp_path, mode="w+", delete=False, prefix="weird-path-rlog-" + ) + rlog_file_path = rlog_file.name + + rlog_weird_paths = open(os.path.join(datadir, rlog_unsafe_path)) + for line in rlog_weird_paths.readlines(): + rlog_file.write(line.replace("{cvsroot_path}", os.path.dirname(repo_url[7:]))) + rlog_file.close() + rlog_file_override = open(rlog_file_path, "rb") # re-open as bytes instead of str + mock_read = mocker.patch("swh.loader.cvs.cvsclient.CVSClient.fetch_rlog") + mock_read.return_value = rlog_file_override + + def side_effect(self, path="", state=""): + return None + + mock_read.side_effect = side_effect(side_effect) + + try: + loader = CvsLoader( + swh_storage, repo_url, cvsroot_path=os.path.join(tmp_path, archive_name), + ) + except BadPathException: + pass + + assert loader.load() == {"status": "failed"} + + assert_last_visit_matches( + swh_storage, repo_url, status="failed", type="cvs", + ) + + assert mock_read.called + + rlog_file_override.close() + os.unlink(rlog_file_path) diff --git a/swh/loader/cvs/tests/test_tasks.py b/swh/loader/cvs/tests/test_tasks.py index 332c43e..126fd3c 100644 --- a/swh/loader/cvs/tests/test_tasks.py +++ b/swh/loader/cvs/tests/test_tasks.py @@ -1,25 +1,43 @@ # Copyright (C) 2019-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime, timezone import pytest from swh.loader.cvs.tasks import convert_to_datetime @pytest.mark.parametrize( "date,expected_result", [ (None, None), ( "2021-11-23 09:41:02.434195+00:00", datetime(2021, 11, 23, 9, 41, 2, 434195, tzinfo=timezone.utc), ), ("23112021", None,), # failure to parse ], ) def test_convert_to_datetime(date, expected_result): assert convert_to_datetime(date) == expected_result + + +def test_cvs_loader( + mocker, swh_scheduler_celery_app, swh_scheduler_celery_worker, swh_config +): + mock_loader = mocker.patch("swh.loader.cvs.loader.CvsLoader.load") + mock_loader.return_value = {"status": "eventful"} + + res = swh_scheduler_celery_app.send_task( + "swh.loader.cvs.tasks.LoadCvsRepository", + kwargs=dict(url="some-technical-url", origin_url="origin-url"), + ) + assert res + res.wait() + assert res.successful() + + assert res.result == {"status": "eventful"} + assert mock_loader.called diff --git a/tox.ini b/tox.ini index 010ee68..377b686 100644 --- a/tox.ini +++ b/tox.ini @@ -1,73 +1,77 @@ [tox] envlist=black,flake8,mypy,py3 [testenv] extras = testing deps = pytest-cov + # the dependency below is needed for now as a workaround for + # https://github.com/pypa/pip/issues/6239 + # TODO: remove when this issue is fixed + swh.scheduler[testing] commands = pytest --doctest-modules \ {envsitepackagesdir}/swh/loader/cvs \ --cov={envsitepackagesdir}/swh/loader/cvs \ --cov-branch {posargs} [testenv:black] skip_install = true deps = black==19.10b0 commands = {envpython} -m black --check swh [testenv:flake8] skip_install = true deps = flake8 commands = {envpython} -m flake8 [testenv:mypy] extras = testing deps = - mypy + mypy==0.920 commands = mypy swh # build documentation outside swh-environment using the current # git HEAD of swh-docs, is executed on CI for each diff to prevent # breaking doc build [testenv:sphinx] whitelist_externals = make usedevelop = true extras = testing deps = # fetch and install swh-docs in develop mode -e git+https://forge.softwareheritage.org/source/swh-docs#egg=swh.docs setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = make -I ../.tox/sphinx/src/swh-docs/swh/ -C docs # build documentation only inside swh-environment using local state # of swh-docs package [testenv:sphinx-dev] whitelist_externals = make usedevelop = true extras = testing deps = # install swh-docs in develop mode -e ../swh-docs setenv = SWH_PACKAGE_DOC_TOX_BUILD = 1 # turn warnings into errors SPHINXOPTS = -W commands = make -I ../.tox/sphinx-dev/src/swh-docs/swh/ -C docs