Page MenuHomeSoftware Heritage

fix expansion of the Log keyword with rsync origins
ClosedPublic

Authored by stsp on Dec 4 2021, 5:39 PM.

Details

Summary

Align our expansion of Log keywords with the behaviour of a real
CVS server. With this, such keywords expand the same way over
the pserver and rsync access methods.

This is the last change required to consistently ingest CVS's own
CVS repository over both pserver and rsync. Otherwise we get commit
hash mis-matches due to differently expanded Log keywords.

Diff Detail

Repository
rDLDCVS CVS Loader
Branch
kwexpand-log
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 25376
Build 39663: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 39662: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6745 (id=24494)

Rebasing onto f36332c7e3...

Current branch diff-target is up to date.
Changes applied before test
commit 97a215a68a89403b9097e308dd0c696209a0cc4e
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix expansion of the Log keyword with rsync origins
    
    Align our expansion of Log keywords with the behaviour of a real
    CVS server. With this, such keywords expand the same way over
    the pserver and rsync access methods.
    
    This is the last change required to consistently ingest CVS's own
    CVS repository over both pserver and rsync. Otherwise we get commit
    hash mis-matches due to differently expanded Log keywords.

Link to build: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/69/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/69/console

Harbormaster returned this revision to the author for changes because remote builds failed.Dec 4 2021, 5:40 PM
Harbormaster failed remote builds in B25368: Diff 24494!

fix formatting for flake8

Build is green

Patch application report for D6745 (id=24495)

Rebasing onto f36332c7e3...

Current branch diff-target is up to date.
Changes applied before test
commit 1b62b5b58dba58189b064736212cc8dea5db3454
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix expansion of the Log keyword with rsync origins
    
    Align our expansion of Log keywords with the behaviour of a real
    CVS server. With this, such keywords expand the same way over
    the pserver and rsync access methods.
    
    This is the last change required to consistently ingest CVS's own
    CVS repository over both pserver and rsync. Otherwise we get commit
    hash mis-matches due to differently expanded Log keywords.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/70/ for more details.

stsp requested review of this revision.Dec 4 2021, 7:06 PM
ardumont added a subscriber: ardumont.

Thanks.

lgtm. But, I mostly understand the rationale and the tests.

Note: There is still the question of the high number of snapshots which should really be 1 (1 snapshot per visit).
I guess you'll look into it when you got the chance to.

swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
578

At some point when you get a chance to it, please roughly describe the expansion
algorithm in the docstring.

So the code below gets a bit more understandable.

669

This new branching part is missing coverage.

This revision is now accepted and ready to land.Dec 6 2021, 10:05 AM
douardda added inline comments.
swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
585

Not sure it makes a big difference, but you could keep ret a list and return b''.join(ret) (no idea how big this can get and how many lines are expected, but in doubt, I would stuck with the "proper" list-based version).

634

could you explain what this is needed?

vlorentz added inline comments.
swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
585

BTW, the motivation for this is that by default, appending to a string in Python requires copying the string, making the whole function quadratic in the number of inputs (n copies of a string of size O(n)).

(Here it's probably fine because CPython has an optimization to reallocate strings when appending when they have a single reference pointing to them, but I wouldn't bet my life on it; it's easy to accidentally add stray references (eg. in closures))

swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
669

This new branching part is missing coverage.

I might have added this without good reason. I will check whether there was a case in CVS's CVS repository which requires this. Otherwise we can just drop the statement.

However, I have already addressed several other comments, and running test conversions will take a while. So I will soon upload what I have to address the other comments, with this new branch still in place.

swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
578

At some point when you get a chance to it, please roughly describe the expansion
algorithm in the docstring.

It is not much of an "algorithm" , it is more like "whatever CVS does".
I will be adding a link to relevant CVS docs. For most details however the CVS implementation behaviour dictates what we need to do.

585

Thanks, this makes sense. I will be switching back to a list in the next version of the patch.

634

could you explain what this is needed?

This is hard to explain without an example. I will add a comment in the next iteration of the patch which should hopefully make things more clear.

addressed review comments; see my responses to various comments for details

Build is green

Patch application report for D6745 (id=24502)

Rebasing onto f36332c7e3...

Current branch diff-target is up to date.
Changes applied before test
commit 3a3d6a1050a78fdacfb253226c7a9f3fadbb49a2
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix expansion of the Log keyword with rsync origins
    
    Align our expansion of Log keywords with the behaviour of a real
    CVS server. With this, such keywords expand the same way over
    the pserver and rsync access methods.
    
    This is the last change required to consistently ingest CVS's own
    CVS repository over both pserver and rsync. Otherwise we get commit
    hash mis-matches due to differently expanded Log keywords.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/71/ for more details.

Build is green

Patch application report for D6745 (id=24503)

Rebasing onto f36332c7e3...

Current branch diff-target is up to date.
Changes applied before test
commit 37fbd0932e6aaf33f8c9b7da291c3249ab24fe97
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix expansion of the Log keyword with rsync origins
    
    Align our expansion of Log keywords with the behaviour of a real
    CVS server. With this, such keywords expand the same way over
    the pserver and rsync access methods.
    
    This is the last change required to consistently ingest CVS's own
    CVS repository over both pserver and rsync. Otherwise we get commit
    hash mis-matches due to differently expanded Log keywords.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/72/ for more details.

swh/loader/cvs/cvs2gitdump/cvs2gitdump.py
669

Tweaking of the final \n was indeed not necessary in order to convert CVS's own history properly. I will remove these two lines in the next version of the patch.

remove unnecessary tweaking of final \n in log buffer

Build is green

Patch application report for D6745 (id=24532)

Rebasing onto f36332c7e3...

Current branch diff-target is up to date.
Changes applied before test
commit 099959bbfa73438eda5c08ebfa0f1f766808fcc7
Author: Stefan Sperling <stsp@stsp.name>
Date:   Wed Nov 24 12:13:11 2021 +0100

    fix expansion of the Log keyword with rsync origins
    
    Align our expansion of Log keywords with the behaviour of a real
    CVS server. With this, such keywords expand the same way over
    the pserver and rsync access methods.
    
    This is the last change required to consistently ingest CVS's own
    CVS repository over both pserver and rsync. Otherwise we get commit
    hash mis-matches due to differently expanded Log keywords.

See https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/73/ for more details.