Page MenuHomeSoftware Heritage

fix regular expression used for matching RCS keywords
AbandonedPublic

Authored by stsp on Nov 24 2021, 12:29 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T3691: Implement CVS loader
Summary

The function RcsKeywords.expand_keyword() uses a regular expression
to detect keywords in lines of a file. This expression treated the
closing $-sign of the keyword as optional. This fails when a line
contains an invalid keyword declaration which lacks the trailing
$-sign, followed by a valid keyword declaration.

For example, consider this line from a file in the ccvs CVS repository:

#ident	"@(#)cvs/contrib/pcl-cvs:$Name:  $Id$"

Here, the CVS server expands the $Id$ keyword, and it does not expand
the $Name$ keyword (which lacks a trailing $-sign).

Our regular expression would split this line into two match groups:

1: #ident	"@(#)cvs/contrib/pcl-cvs:$Name:  $
2: Id$

Neither match produces a valid keyword and hence expand_keyword()
would not expand keywords on this line.

Tweak our test data to cover the above example, and fix the regular
expression such that the trailing $-sign is required. With this change
in place we match the $Id$ part of the above example and expand keywords
on this line just like the CVS server does.

Diff Detail

Repository
rDLDCVS CVS Loader
Branch
kwexpand-regex
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 25151
Build 39303: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 39302: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6684 (id=24276)

Could not rebase; Attempt merge onto 1f6580c4c8...

Updating 1f6580c..00b218b
Fast-forward
 swh/loader/cvs/cvs2gitdump/cvs2gitdump.py       |   2 +-
 swh/loader/cvs/loader.py                        |  23 ++++++++++++++++++++++-
 swh/loader/cvs/tests/data/greek-repository4.tgz | Bin 12393 -> 12525 bytes
 swh/loader/cvs/tests/test_loader.py             |  20 ++++++++++----------
 4 files changed, 33 insertions(+), 12 deletions(-)
Changes applied before test
commit 00b218b3ef4ac51ec6547ae03dbc339c72db257e
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix regular expression used for matching RCS keywords
    
    The function RcsKeywords.expand_keyword() uses a regular expression
    to detect keywords in lines of a file. This expression treated the
    closing $-sign of the keyword as optional. This fails when a line
    contains an invalid keyword declaration which lacks the trailing
    $-sign, followed by a valid keyword declaration.
    
    For example, consider this line from a file in the ccvs CVS repository:
    
      #ident        "@(#)cvs/contrib/pcl-cvs:$Name:  $Id$"
    
    Here, the CVS server expands the $Id$ keyword, and it does not expand
    the $Name$ keyword (which lacks a trailing $-sign).
    
    Our regular expression would split this line into two match groups:
    
      1: #ident     "@(#)cvs/contrib/pcl-cvs:$Name:  $
      2: Id$
    
    Neither match produces a valid keyword and hence expand_keyword()
    would not expand keywords on this line.
    
    Tweak our test data to cover the above example, and fix the regular
    expression such that the trailing $-sign is required. With this change
    in place we match the $Id$ part of the above example and expand keywords
    on this line just like the CVS server does.

commit 5539ccb67b2a3398019eff571f633a227a498fcd
Author: Stefan Sperling <stsp@stsp.name>
Date:   Tue Nov 23 15:16:11 2021 +0100

    attempt to avoid content differences due to paths in keywords
    
    Some RCS keywords, such has "Header", contain absolute file paths
    derived from the on-disk filesystem path of the CVS repository.
    
    When we fetch files over the pserver protocol such keywords are
    expanded by the CVS server. But when using the rsync protocol we
    will first copy the CVS repository to local disk and the path to
    this local copy will correspond to some temporary directory.
    
    Try to avoid file content differences between pserver and rsync
    access methods by deriving a likely server-side path from path
    information found in the rsync:// origin URL.
    This will work as expected as long as the CVS server-side setup
    exposes the same path to the CVS repository over both access
    methods, which is the case for GNU savannah for example.
    
    In general, we should recommend treating pserver and rsync as distinct
    origins and not rely on them to be interchangable and always produce
    the same conversion result. But we can still try our best to avoid
    needless differences in content hashes.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/66/ for more details.

stsp requested review of this revision.Nov 24 2021, 12:31 PM

This patch is wrong. Keywords stored in already-expanded form in the repository are no longer being expanded. We need a different fix for this issue.