Page MenuHomeSoftware Heritage

handle Attic-only RCS files over CVS pserver
ClosedPublic

Authored by stsp on Oct 31 2021, 11:31 PM.

Details

Summary

CVS repositories may contain RCS history in file,v as well as
a corresponding Attic/file,v where each file contains separate
events that occurred in history. The Attic version of the file
results from file deletion events.

The rsync access method already uses history found in the Attic.
However, a CVS server will only return RCS files from the Attic
if we request them explicitly. If we do not request them then our
converted history may end up missing deletion events for some files.
Unfortunately, we cannot tell which RCS files have a corresponding
file in the Attic, so we need to search all Attic directories by
running the equivalent of 'cvs rlog' in each directory. This slows
down pserver access considerably (and it was already quite slow
compared to rsync). But we need to pay this price in order to
obtain a valid conversion result.

This patch contains related fixes to cvsroot path handling, which
was broken for the pserver case. Without these fixes we cannot
create the correct paths for Attic directories to search.

Problem found while comparing conversion results of rsync and
pserver access methods for the GNU dino CVS repository at
cvs.savannah.gnu.org/sources/dino
Add two new test cases based on RCS files from this repository.

Without this fix in place history would diverge at this commit:

8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC

Because the files midievent.cpp and midievent.hpp would not get deleted
when converting this commit via the pserver protocol.

Diff Detail

Repository
rDLDCVS CVS Loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24955
Build 38997: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 38996: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6593 (id=23955)

Could not rebase; Attempt merge onto beb7fc8a02...

Updating beb7fc8..6535c68
Fast-forward
 swh/loader/cvs/cvsclient.py                     |  54 +++-
 swh/loader/cvs/loader.py                        | 106 ++++++--
 swh/loader/cvs/rlog.py                          |   7 +-
 swh/loader/cvs/tests/data/dino-readded-file.tgz | Bin 0 -> 18140 bytes
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 0 -> 12393 bytes
 swh/loader/cvs/tests/data/greek-repository5.tgz | Bin 0 -> 12380 bytes
 swh/loader/cvs/tests/test_loader.py             | 311 +++++++++++++++++++++++-
 7 files changed, 442 insertions(+), 36 deletions(-)
 create mode 100644 swh/loader/cvs/tests/data/dino-readded-file.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository4.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository5.tgz
Changes applied before test
commit 6535c680d860e807255609da27b189fa9b240c0b
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

commit 0eb3b30c24d60b6e0b72b983d4e90acdde33ba77
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:56:58 2021 +0200

    improve test coverage of file additions and deletions
    
    Make an existing test case run over pserver as well.
    This access method uses a different way of detecting file
    additions and deletions and should be tested separately.
    
    Add new tests to cover the re-addition of a file after it
    was deleted.

commit 32ff92ed3e13f3fe63fff53d79e1d07881d64291
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:55:23 2021 +0200

    display file state in progress logging output

commit acc2601ae97ca729e3987de7cbd3548120e5d5a1
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 11:31:27 2021 +0200

    add support for RCS keyword expansion over pserver protocol
    
    We can simply ask the CVS server to expand keywords for us, instead
    of forcing binary file mode with the -kb option. The CVS repository
    contains per-file keyword expansion defaults the server will use.
    Files checked out by cvsclient.py should now match what a regular
    CVS client would check out by default.
    
    Add test cases which verify that we create the same snapshot ID
    for a repository which uses the Id keyword in a file, regardless
    of whether this repository is accessed via rsync or pserver.

commit 2563ac0917db60491fd3d3218a47d16c97083d13
Author: Stefan Sperling <stsp@stsp.name>
Date:   Thu Oct 28 19:04:05 2021 +0200

    in cvs loader tests, use f-strings to build repository URLs
    
    Suggested by ardumont in D6566

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/46/ for more details.

stsp requested review of this revision.Oct 31 2021, 11:33 PM
vlorentz added inline comments.
swh/loader/cvs/cvsclient.py
292–298

equivalent, but more pythonic

312

double space? how come?

swh/loader/cvs/loader.py
106–111

why move this out of the constructor?

273

I don't think telling mypy to ignore the type error is the right solution (see my other comment)

387–388

can you use longer variable names?

389–390

hard to tell, there are many implicitly typed variables here.

You can use reveal_type(xxx) to make mypy print what type it inferred for xxx (no need to import it, it's only used while type-checking)

swh/loader/cvs/loader.py
389–390

try rebasing on top of D6598, I added a bunch of type annotations that should help mypy

swh/loader/cvs/loader.py
389–390

Thanks, I have tried this by running 'arc patch D6598' and creating a temporary branch based on the arcpatch-D6598 branch. Then I used git cherry-pick to apply my oustanding patches in sequence.

The result is that mypy still complains as follows:

swh/loader/cvs/loader.py:143: error: Argument 1 to "file_path" has incompatible type "Optional[str]"; expected "str"
swh/loader/cvs/loader.py:166: error: Argument 1 to "file_path" has incompatible type "Optional[str]"; expected "str"
swh/loader/cvs/loader.py:222: error: Argument 1 to "getlog" of "RlogConv" has incompatible type "IO[bytes]"; expected "BinaryIO"
swh/loader/cvs/loader.py:391: error: Incompatible types in assignment (expression has type "FileRevision", variable has type "str")
swh/loader/cvs/loader.py:395: error: Argument 1 to "startswith" of "str" has incompatible type "Optional[str]"; expected "Union[str, Tuple[str, ...]]"
swh/loader/cvs/loader.py:440: error: Argument 1 to "parse_rlog" of "RlogConv" has incompatible type "IO[bytes]"; expected "BinaryIO"
Found 6 errors in 1 file (checked 9 source files)

I know how to fix the one on swh/loader/cvs/loader.py:222 (declare rlog_file member as BinaryIO) but the rest elude me... it seems like a tempfile.TemporaryFile() object cannot be used with anything that requires BinaryIO, is it?

swh/loader/cvs/cvsclient.py
312

This is simply what the cvs server sends back.

swh/loader/cvs/loader.py
273

Agreed. Thanks for taking the time to help me figure out what needs to be done to get it work right.

swh/loader/cvs/loader.py
389–390

It can, depending on the value of the parameter (t vs b).

Unfortunately, mypy does not support dependent types, so it can't know that. Use https://docs.python.org/3/library/typing.html#typing.cast

swh/loader/cvs/loader.py
389–390

Fine, I could fix all these warnings with explicit casts.

fix mypy errors with much help from vlorentz

Build is green

Patch application report for D6593 (id=24024)

Could not rebase; Attempt merge onto 7e2d8d8996...

Auto-merging swh/loader/cvs/loader.py
Merge made by the 'recursive' strategy.
 swh/loader/cvs/cvsclient.py                     |  52 +++-
 swh/loader/cvs/loader.py                        |  96 ++++++--
 swh/loader/cvs/rlog.py                          |  15 +-
 swh/loader/cvs/tests/data/dino-readded-file.tgz | Bin 0 -> 18140 bytes
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 0 -> 12393 bytes
 swh/loader/cvs/tests/data/greek-repository5.tgz | Bin 0 -> 12380 bytes
 swh/loader/cvs/tests/test_loader.py             | 307 +++++++++++++++++++++++-
 7 files changed, 432 insertions(+), 38 deletions(-)
 create mode 100644 swh/loader/cvs/tests/data/dino-readded-file.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository4.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository5.tgz
Changes applied before test
commit 39df8f42d49a7c807a690ab034123ed21e491592
Merge: 7e2d8d8 d543b0e
Author: Jenkins user <jenkins@localhost>
Date:   Fri Nov 5 12:50:51 2021 +0000

    Merge branch 'diff-target' into HEAD

commit d543b0eb0976c56544c2d3dccc821f316e1572ff
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

commit 79867ac6d86723449528f7baf7fea7e4cbc5ef63
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:56:58 2021 +0200

    improve test coverage of file additions and deletions
    
    Make an existing test case run over pserver as well.
    This access method uses a different way of detecting file
    additions and deletions and should be tested separately.
    
    Add new tests to cover the re-addition of a file after it
    was deleted.

commit db33b8006c282e2a030996d8de6099f8f4130825
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:55:23 2021 +0200

    display file state in progress logging output

commit 2ae33d6c66f93850d23fdbb2e410b3ae27942242
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 11:31:27 2021 +0200

    add support for RCS keyword expansion over pserver protocol
    
    We can simply ask the CVS server to expand keywords for us, instead
    of forcing binary file mode with the -kb option. The CVS repository
    contains per-file keyword expansion defaults the server will use.
    Files checked out by cvsclient.py should now match what a regular
    CVS client would check out by default.
    
    Add test cases which verify that we create the same snapshot ID
    for a repository which uses the Id keyword in a file, regardless
    of whether this repository is accessed via rsync or pserver.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/53/ for more details.

stsp marked an inline comment as done.Nov 5 2021, 1:54 PM
  • apply style change in fetch_rlog() suggested by vlorentz

Build is green

Patch application report for D6593 (id=24027)

Could not rebase; Attempt merge onto 7e2d8d8996...

Auto-merging swh/loader/cvs/loader.py
Merge made by the 'recursive' strategy.
 swh/loader/cvs/cvsclient.py                     |  49 +++-
 swh/loader/cvs/loader.py                        |  96 ++++++--
 swh/loader/cvs/rlog.py                          |  15 +-
 swh/loader/cvs/tests/data/dino-readded-file.tgz | Bin 0 -> 18140 bytes
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 0 -> 12393 bytes
 swh/loader/cvs/tests/data/greek-repository5.tgz | Bin 0 -> 12380 bytes
 swh/loader/cvs/tests/test_loader.py             | 307 +++++++++++++++++++++++-
 7 files changed, 429 insertions(+), 38 deletions(-)
 create mode 100644 swh/loader/cvs/tests/data/dino-readded-file.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository4.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository5.tgz
Changes applied before test
commit e7ffea5ae16ea0064cdf781c77f9e3ae10773f77
Merge: 7e2d8d8 b41d868
Author: Jenkins user <jenkins@localhost>
Date:   Fri Nov 5 12:54:38 2021 +0000

    Merge branch 'diff-target' into HEAD

commit b41d868cd041a5f151079d062ffa9856018ffbf3
Author: Stefan Sperling <stsp@stsp.name>
Date:   Fri Nov 5 13:53:19 2021 +0100

    apply style change in fetch_rlog() suggested by vlorentz

commit d543b0eb0976c56544c2d3dccc821f316e1572ff
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

commit 79867ac6d86723449528f7baf7fea7e4cbc5ef63
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:56:58 2021 +0200

    improve test coverage of file additions and deletions
    
    Make an existing test case run over pserver as well.
    This access method uses a different way of detecting file
    additions and deletions and should be tested separately.
    
    Add new tests to cover the re-addition of a file after it
    was deleted.

commit db33b8006c282e2a030996d8de6099f8f4130825
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:55:23 2021 +0200

    display file state in progress logging output

commit 2ae33d6c66f93850d23fdbb2e410b3ae27942242
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 11:31:27 2021 +0200

    add support for RCS keyword expansion over pserver protocol
    
    We can simply ask the CVS server to expand keywords for us, instead
    of forcing binary file mode with the -kb option. The CVS repository
    contains per-file keyword expansion defaults the server will use.
    Files checked out by cvsclient.py should now match what a regular
    CVS client would check out by default.
    
    Add test cases which verify that we create the same snapshot ID
    for a repository which uses the Id keyword in a file, regardless
    of whether this repository is accessed via rsync or pserver.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/54/ for more details.

swh/loader/cvs/loader.py
106–111

ping?

swh/loader/cvs/loader.py
106–111

The cvsroot path has different semantics based on the access protocol.

With rsync it is indeed a temporary directory where a full copy of the cvs repository is located within. Support for rsync was implemented first (without giving much thought to the pserver case) so this variable ended up in the constructor.

But in the pserver case we need to initialize this to something else. The cvsroot is just a path on the server in that case, and there is no local copy of the repository. In order to construct paths into the Attic directories correctly, we need to use the server-side cvsroot path, not a path to a local directory somewhere in /tmp. And we should skip the creation of an unused temporary directory in the pserver case.

Does this make sense now?

swh/loader/cvs/loader.py
106–111

I see.

Then, instead of cast(str, self.cvsroot_path), which converts *any* type to a string, thereby bypassing type-checking, you should just use assert self.cvsroot_path, which causes mypy to infer the type is str instead of Optional[str].

assert self.cvsroot_path instead of casting and squash all changes

swh/loader/cvs/loader.py
106–111

Thanks, done.

Build is green

Patch application report for D6593 (id=24061)

Could not rebase; Attempt merge onto 7e2d8d8996...

Updating 7e2d8d8..3cd00f7
Fast-forward
 swh/loader/cvs/cvsclient.py                     |  49 +++-
 swh/loader/cvs/loader.py                        |  92 +++++--
 swh/loader/cvs/rlog.py                          |  15 +-
 swh/loader/cvs/tests/data/dino-readded-file.tgz | Bin 0 -> 18140 bytes
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 0 -> 12393 bytes
 swh/loader/cvs/tests/data/greek-repository5.tgz | Bin 0 -> 12380 bytes
 swh/loader/cvs/tests/test_loader.py             | 307 +++++++++++++++++++++++-
 7 files changed, 427 insertions(+), 36 deletions(-)
 create mode 100644 swh/loader/cvs/tests/data/dino-readded-file.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository4.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository5.tgz
Changes applied before test
commit 3cd00f7009c973818f94873a61dafe4aeab5b458
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

commit e2a329bb243af7b162d4364e15f46ccdb2f406e2
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:56:58 2021 +0200

    improve test coverage of file additions and deletions
    
    Make an existing test case run over pserver as well.
    This access method uses a different way of detecting file
    additions and deletions and should be tested separately.
    
    Add new tests to cover the re-addition of a file after it
    was deleted.

commit 3b9e82cdb3a22cf295c2e8f3b34efa368ddcdb52
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:55:23 2021 +0200

    display file state in progress logging output

commit 7333831c767206c2fcb1daead4428e090f7d9303
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 11:31:27 2021 +0200

    add support for RCS keyword expansion over pserver protocol
    
    We can simply ask the CVS server to expand keywords for us, instead
    of forcing binary file mode with the -kb option. The CVS repository
    contains per-file keyword expansion defaults the server will use.
    Files checked out by cvsclient.py should now match what a regular
    CVS client would check out by default.
    
    Add test cases which verify that we create the same snapshot ID
    for a repository which uses the Id keyword in a file, regardless
    of whether this repository is accessed via rsync or pserver.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/55/ for more details.

This revision is now accepted and ready to land.Nov 9 2021, 2:27 PM

Build is green

Patch application report for D6593 (id=24080)

Rebasing onto d72f15f24d...

Current branch diff-target is up to date.
Changes applied before test
commit d28a4b21c56a782f1c1db4b6f209a0d313a7f7dc
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/60/ for more details.