Page MenuHomeSoftware Heritage

attempt to avoid content differences due to paths in keywords
ClosedPublic

Authored by stsp on Nov 24 2021, 10:52 AM.

Details

Summary

Some RCS keywords, such has "Header", contain absolute file paths
derived from the on-disk filesystem path of the CVS repository.

When we fetch files over the pserver protocol such keywords are
expanded by the CVS server. But when using the rsync protocol we
will first copy the CVS repository to local disk and the path to
this local copy will correspond to some temporary directory.

Try to avoid file content differences between pserver and rsync
access methods by deriving a likely server-side path from path
information found in the rsync:// origin URL.
This will work as expected as long as the CVS server-side setup
exposes the same path to the CVS repository over both access
methods, which is the case for GNU savannah for example.

In general, we should recommend treating pserver and rsync as distinct
origins and not rely on them to be interchangable and always produce
the same conversion result. But we can still try our best to avoid
needless differences in content hashes.

Diff Detail

Repository
rDLDCVS CVS Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6678 (id=24264)

Rebasing onto 1f6580c4c8...

Current branch diff-target is up to date.
Changes applied before test
commit 5539ccb67b2a3398019eff571f633a227a498fcd
Author: Stefan Sperling <stsp@stsp.name>
Date:   Tue Nov 23 15:16:11 2021 +0100

    attempt to avoid content differences due to paths in keywords
    
    Some RCS keywords, such has "Header", contain absolute file paths
    derived from the on-disk filesystem path of the CVS repository.
    
    When we fetch files over the pserver protocol such keywords are
    expanded by the CVS server. But when using the rsync protocol we
    will first copy the CVS repository to local disk and the path to
    this local copy will correspond to some temporary directory.
    
    Try to avoid file content differences between pserver and rsync
    access methods by deriving a likely server-side path from path
    information found in the rsync:// origin URL.
    This will work as expected as long as the CVS server-side setup
    exposes the same path to the CVS repository over both access
    methods, which is the case for GNU savannah for example.
    
    In general, we should recommend treating pserver and rsync as distinct
    origins and not rely on them to be interchangable and always produce
    the same conversion result. But we can still try our best to avoid
    needless differences in content hashes.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/65/ for more details.

stsp requested review of this revision.Nov 24 2021, 10:54 AM

wow...

Good catch, thanks. Could you add a test for this (so we don't accidentally change it if we ever refactor the loader in the future)?

wow...

Good catch, thanks. Could you add a test for this (so we don't accidentally change it if we ever refactor the loader in the future)?

The tests do not actually use rsync:// URLs, they use file:// URLs which simply pretend that the rsync copy process has already happened.
Otherwise, running the tests would require an rsyncd on localhost with a config that exposes the right set of temporary repository directories etc.
I do not want to impose such a burden on people just to have tests passing with pytest.

And since our tests control the absolute paths to the test repositories they also control the expanded value of keywords like $Header$.
This means the tests would pass even without this fix in place. So I am not sure such a test would really be meaningful.

In D6678#174093, @stsp wrote:

This means the tests would pass even without this fix in place. So I am not sure such a test would really be meaningful.

We do however get something useful from this fix. This fix allows me to proceed with trying to get CVS's own history converted
over both pserver and rsync in the same way. This is not yet working and I am uncovering other issues in this process which
I am adding test coverage for.

In D6678#174093, @stsp wrote:

And since our tests control the absolute paths to the test repositories they also control the expanded value of keywords like $Header$.
This means the tests would pass even without this fix in place. So I am not sure such a test would really be meaningful.

Thinking about this some more, I might be wrong here. I can add a test which expands $Header$ and see what happens.

  • add a test for conversion of a file which contains a Header keyword
In D6678#174111, @stsp wrote:
  • add a test for conversion of a file which contains a Header keyword

This commit adds a test. But as expected it passes regardless of whether the fix is in place. So the test is somewhat pointless.
Any better ideas?

Build is green

Patch application report for D6678 (id=24320)

Rebasing onto 1f6580c4c8...

Current branch diff-target is up to date.
Changes applied before test
commit bc00d6b16979a733f560e1a07db0da8f5bb92fb3
Author: Stefan Sperling <stsp@stsp.name>
Date:   Fri Nov 26 15:32:58 2021 +0100

    add a test for conversion of a file which contains a Header keyword

commit 5539ccb67b2a3398019eff571f633a227a498fcd
Author: Stefan Sperling <stsp@stsp.name>
Date:   Tue Nov 23 15:16:11 2021 +0100

    attempt to avoid content differences due to paths in keywords
    
    Some RCS keywords, such has "Header", contain absolute file paths
    derived from the on-disk filesystem path of the CVS repository.
    
    When we fetch files over the pserver protocol such keywords are
    expanded by the CVS server. But when using the rsync protocol we
    will first copy the CVS repository to local disk and the path to
    this local copy will correspond to some temporary directory.
    
    Try to avoid file content differences between pserver and rsync
    access methods by deriving a likely server-side path from path
    information found in the rsync:// origin URL.
    This will work as expected as long as the CVS server-side setup
    exposes the same path to the CVS repository over both access
    methods, which is the case for GNU savannah for example.
    
    In general, we should recommend treating pserver and rsync as distinct
    origins and not rely on them to be interchangable and always produce
    the same conversion result. But we can still try our best to avoid
    needless differences in content hashes.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/67/ for more details.

Ok. Make sure you document it (in docs/), so we don't forget. It might be useful to add some rsync:// tests, but that's out of scope for this diff

This revision is now accepted and ready to land.Nov 26 2021, 4:33 PM