Page MenuHomeSoftware Heritage

add CVS commit ID support to rlog.py
ClosedPublic

Authored by stsp on Nov 9 2021, 1:27 PM.

Details

Summary

Newer CVS clients tag commits with a commit ID which allows us to
correctly convert commits which changed several RCS files at once.
The rsync access method based on cvs2gitdump was already taking
advantage of this. To ensure that conversions over the pserver
protocol yield the same result as conversions over rsync we need
to add commit ID support to rlog.py.

Add two new test cases which convert the same repository over
rsync and pserver respectively, and ensure that they yield the
same result. Without commit ID support conversion over pserver
produces a different result for this particular test repository.

Diff Detail

Repository
rDLDCVS CVS Loader
Branch
rlog-commitid
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24958
Build 39002: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 39001: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6623 (id=24064)

Could not rebase; Attempt merge onto 7e2d8d8996...

Updating 7e2d8d8..8760add
Fast-forward
 swh/loader/cvs/cvsclient.py                     |  49 ++-
 swh/loader/cvs/loader.py                        |  92 ++++--
 swh/loader/cvs/rlog.py                          |  40 ++-
 swh/loader/cvs/tests/data/dino-commitid.tgz     | Bin 0 -> 17083 bytes
 swh/loader/cvs/tests/data/dino-readded-file.tgz | Bin 0 -> 18140 bytes
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 0 -> 12393 bytes
 swh/loader/cvs/tests/data/greek-repository5.tgz | Bin 0 -> 12380 bytes
 swh/loader/cvs/tests/test_loader.py             | 391 +++++++++++++++++++++++-
 8 files changed, 534 insertions(+), 38 deletions(-)
 create mode 100644 swh/loader/cvs/tests/data/dino-commitid.tgz
 create mode 100644 swh/loader/cvs/tests/data/dino-readded-file.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository4.tgz
 create mode 100644 swh/loader/cvs/tests/data/greek-repository5.tgz
Changes applied before test
commit 8760add65aafc164298b66b77aba0496b4465d25
Author: Stefan Sperling <stsp@stsp.name>
Date:   Tue Nov 9 11:57:16 2021 +0100

    add CVS commit ID support to rlog.py
    
    Newer CVS clients tag commits with a commit ID which allows us to
    correctly convert commits which changed several RCS files at once.
    The rsync access method based on cvs2gitdump was already taking
    advantage of this. To ensure that conversions over the pserver
    protocol yield the same result as conversions over rsync we need
    to add commit ID support to rlog.py.
    
    Add two new test cases which convert the same repository over
    rsync and pserver respectively, and ensure that they yield the
    same result. Without commit ID support conversion over pserver
    produces a different result for this particular test repository.

commit 3cd00f7009c973818f94873a61dafe4aeab5b458
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 16:55:51 2021 +0200

    handle Attic-only RCS files over CVS pserver
    
    CVS repositories may contain RCS history in file,v as well as
    a corresponding Attic/file,v where each file contains separate
    events that occurred in history. The Attic version of the file
    results from file deletion events.
    
    The rsync access method already uses history found in the Attic.
    However, a CVS server will only return RCS files from the Attic
    if we request them explicitly. If we do not request them then our
    converted history may end up missing deletion events for some files.
    Unfortunately, we cannot tell which RCS files have a corresponding
    file in the Attic, so we need to search all Attic directories by
    running the equivalent of 'cvs rlog' in each directory. This slows
    down pserver access considerably (and it was already quite slow
    compared to rsync). But we need to pay this price in order to
    obtain a valid conversion result.
    
    This patch contains related fixes to cvsroot path handling, which
    was broken for the pserver case. Without these fixes we cannot
    create the correct paths for Attic directories to search.
    
    Problem found while comparing conversion results of rsync and
    pserver access methods for the GNU dino CVS repository at
    cvs.savannah.gnu.org/sources/dino
    Add two new test cases based on RCS files from this repository.
    
    Without this fix in place history would diverge at this commit:
      8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
    Because the files midievent.cpp and midievent.hpp would not get deleted
    when converting this commit via the pserver protocol.

commit e2a329bb243af7b162d4364e15f46ccdb2f406e2
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:56:58 2021 +0200

    improve test coverage of file additions and deletions
    
    Make an existing test case run over pserver as well.
    This access method uses a different way of detecting file
    additions and deletions and should be tested separately.
    
    Add new tests to cover the re-addition of a file after it
    was deleted.

commit 3b9e82cdb3a22cf295c2e8f3b34efa368ddcdb52
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 13:55:23 2021 +0200

    display file state in progress logging output

commit 7333831c767206c2fcb1daead4428e090f7d9303
Author: Stefan Sperling <stsp@stsp.name>
Date:   Sat Oct 30 11:31:27 2021 +0200

    add support for RCS keyword expansion over pserver protocol
    
    We can simply ask the CVS server to expand keywords for us, instead
    of forcing binary file mode with the -kb option. The CVS repository
    contains per-file keyword expansion defaults the server will use.
    Files checked out by cvsclient.py should now match what a regular
    CVS client would check out by default.
    
    Add test cases which verify that we create the same snapshot ID
    for a repository which uses the Id keyword in a file, regardless
    of whether this repository is accessed via rsync or pserver.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/56/ for more details.

stsp requested review of this revision.Nov 9 2021, 1:28 PM
vlorentz requested changes to this revision.Nov 9 2021, 2:27 PM
vlorentz added a subscriber: vlorentz.

Looks good

A few nitpicks below about the code style

swh/loader/cvs/rlog.py
48

Could you import string and use string.ascii_lowercase etc below?

This avoid lowercased constants in the global namespace

421–427

avoids building a string

swh/loader/cvs/tests/test_loader.py
660–672

see D6607

697–710

ditto

This revision now requires changes to proceed.Nov 9 2021, 2:27 PM

changes suggested by vlorentz

Build is green

Patch application report for D6623 (id=24083)

Rebasing onto d28a4b21c5...

Current branch diff-target is up to date.
Changes applied before test
commit f5b974a0095120805cc3e570f08a9fa5582fc3f8
Author: Stefan Sperling <stsp@stsp.name>
Date:   Tue Nov 9 11:57:16 2021 +0100

    add CVS commit ID support to rlog.py
    
    Newer CVS clients tag commits with a commit ID which allows us to
    correctly convert commits which changed several RCS files at once.
    The rsync access method based on cvs2gitdump was already taking
    advantage of this. To ensure that conversions over the pserver
    protocol yield the same result as conversions over rsync we need
    to add commit ID support to rlog.py.
    
    Add two new test cases which convert the same repository over
    rsync and pserver respectively, and ensure that they yield the
    same result. Without commit ID support conversion over pserver
    produces a different result for this particular test repository.
    
    With feedback about coding style from vlorentz.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/61/ for more details.

This revision is now accepted and ready to land.Nov 10 2021, 11:14 AM
This revision was automatically updated to reflect the committed changes.