Page MenuHomeSoftware Heritage

npm: Include package version id in ExtID manifest
ClosedPublic

Authored by vlorentz on Mar 30 2022, 11:33 AM.

Details

Summary

In at least one occurence, replicate.npmjs.com shows the exact same
tarball for two different versions.

As we only used the shasum to identify a release (formerly revision)
when loading incrementally, this causes two ExtIDs to be created for the
same target, which causes a crash when re-loading incrementally
(formerly, this caused duplicate branches' revision to be replaced by
the other one).

For example, https://archive.softwareheritage.org/swh:1:snp:9fe3ba7515a9b567ab7f66e20d1f4ff7f9f5cf23;origin=https://www.npmjs.com/package/koreanbots
has a branch named "releases/3.0.0-alpha3" which points to a revision
with the right date and directory, but "3.0.0-beta" as commit message.
This may have been caused by both revisions having the same directory.

This seems to have confused the next load of the loader, whose result
can be seen at https://archive.softwareheritage.org/swh:1:snp:e9dca2ee671f8d12a0163fe4cacfc61eabf7d8fe;origin=https://www.npmjs.com/package/koreanbots
as it used the revision in the "releases/3.0.0-alpha3" branch as the
target of the "releases/3.0.0-beta" branch.

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D7469 (id=27083)

Rebasing onto 4409e4b341...

Current branch diff-target is up to date.
Changes applied before test
commit 290018832527c854be90deb4f54e6220e1f19fd8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 30 10:47:25 2022 +0200

    npm: Include package version id in ExtID manifest
    
    In at least one occurence, replicate.npmjs.com shows the exact same
    tarball for two different versions.
    
    As we only used the shasum to identify a release (formerly revision)
    when loading incrementally, this causes two ExtIDs to be created for the
    same target, which causes a crash when re-loading incrementally
    (formerly, this caused duplicate branches' revision to be replaced by
    the other one).
    
    For example, https://archive.softwareheritage.org/swh:1:snp:9fe3ba7515a9b567ab7f66e20d1f4ff7f9f5cf23;origin=https://www.npmjs.com/package/koreanbots
    has a branch named "releases/3.0.0-alpha3" which points to a revision
    with the right date and directory, but "3.0.0-beta" as commit message.
    This may have been caused by both revisions having the same directory.
    
    This seems to have confused the next load of the loader, whose result
    can be seen at https://archive.softwareheritage.org/swh:1:snp:e9dca2ee671f8d12a0163fe4cacfc61eabf7d8fe;origin=https://www.npmjs.com/package/koreanbots
    as it used the revision in the "releases/3.0.0-alpha3" branch as the
    target of the "releases/3.0.0-beta" branch.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/703/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/703/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 30 2022, 11:35 AM
Harbormaster failed remote builds in B28000: Diff 27083!

Looks good to me but we got hit again by the race condition between mypy and flake8 on Jeninks so the diff is still in draft state.

Build is green

Patch application report for D7469 (id=27083)

Rebasing onto 4409e4b341...

Current branch diff-target is up to date.
Changes applied before test
commit 290018832527c854be90deb4f54e6220e1f19fd8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 30 10:47:25 2022 +0200

    npm: Include package version id in ExtID manifest
    
    In at least one occurence, replicate.npmjs.com shows the exact same
    tarball for two different versions.
    
    As we only used the shasum to identify a release (formerly revision)
    when loading incrementally, this causes two ExtIDs to be created for the
    same target, which causes a crash when re-loading incrementally
    (formerly, this caused duplicate branches' revision to be replaced by
    the other one).
    
    For example, https://archive.softwareheritage.org/swh:1:snp:9fe3ba7515a9b567ab7f66e20d1f4ff7f9f5cf23;origin=https://www.npmjs.com/package/koreanbots
    has a branch named "releases/3.0.0-alpha3" which points to a revision
    with the right date and directory, but "3.0.0-beta" as commit message.
    This may have been caused by both revisions having the same directory.
    
    This seems to have confused the next load of the loader, whose result
    can be seen at https://archive.softwareheritage.org/swh:1:snp:e9dca2ee671f8d12a0163fe4cacfc61eabf7d8fe;origin=https://www.npmjs.com/package/koreanbots
    as it used the revision in the "releases/3.0.0-alpha3" branch as the
    target of the "releases/3.0.0-beta" branch.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/704/ for more details.

Build is green

Patch application report for D7469 (id=27083)

Rebasing onto 4409e4b341...

Current branch diff-target is up to date.
Changes applied before test
commit 290018832527c854be90deb4f54e6220e1f19fd8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 30 10:47:25 2022 +0200

    npm: Include package version id in ExtID manifest
    
    In at least one occurence, replicate.npmjs.com shows the exact same
    tarball for two different versions.
    
    As we only used the shasum to identify a release (formerly revision)
    when loading incrementally, this causes two ExtIDs to be created for the
    same target, which causes a crash when re-loading incrementally
    (formerly, this caused duplicate branches' revision to be replaced by
    the other one).
    
    For example, https://archive.softwareheritage.org/swh:1:snp:9fe3ba7515a9b567ab7f66e20d1f4ff7f9f5cf23;origin=https://www.npmjs.com/package/koreanbots
    has a branch named "releases/3.0.0-alpha3" which points to a revision
    with the right date and directory, but "3.0.0-beta" as commit message.
    This may have been caused by both revisions having the same directory.
    
    This seems to have confused the next load of the loader, whose result
    can be seen at https://archive.softwareheritage.org/swh:1:snp:e9dca2ee671f8d12a0163fe4cacfc61eabf7d8fe;origin=https://www.npmjs.com/package/koreanbots
    as it used the revision in the "releases/3.0.0-alpha3" branch as the
    target of the "releases/3.0.0-beta" branch.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/705/ for more details.

So, from what I understand, we're currently generating releases combining intrinsic data (the target directory hash) and extrinsic data provided by the package index (release name and message).

Shouldn't we just ensure that the content of the generated release only uses intrinsic metadata parsed from the contents of the tarball?

In D7469#195352, @olasd wrote:

So, from what I understand, we're currently generating releases combining intrinsic data (the target directory hash) and extrinsic data provided by the package index (release name and message).

Correct.

Shouldn't we just ensure that the content of the generated release only uses intrinsic metadata parsed from the contents of the tarball?

Then we would lose information provided by NPM, but not included in the tarball (package name and version).

The right fix is to make all fields used in the release part of the ExtID.

  • npm: Add all fields we use to the ExtID manifest

Build is green

Patch application report for D7469 (id=27212)

Rebasing onto 4409e4b341...

Current branch diff-target is up to date.
Changes applied before test
commit 9c158d615d4955411a1e221a821a33025604e5db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Apr 5 10:35:15 2022 +0200

    npm: Add all fields we use to the ExtID manifest
    
    Just in case they also vary despite using the same id.

commit 290018832527c854be90deb4f54e6220e1f19fd8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 30 10:47:25 2022 +0200

    npm: Include package version id in ExtID manifest
    
    In at least one occurence, replicate.npmjs.com shows the exact same
    tarball for two different versions.
    
    As we only used the shasum to identify a release (formerly revision)
    when loading incrementally, this causes two ExtIDs to be created for the
    same target, which causes a crash when re-loading incrementally
    (formerly, this caused duplicate branches' revision to be replaced by
    the other one).
    
    For example, https://archive.softwareheritage.org/swh:1:snp:9fe3ba7515a9b567ab7f66e20d1f4ff7f9f5cf23;origin=https://www.npmjs.com/package/koreanbots
    has a branch named "releases/3.0.0-alpha3" which points to a revision
    with the right date and directory, but "3.0.0-beta" as commit message.
    This may have been caused by both revisions having the same directory.
    
    This seems to have confused the next load of the loader, whose result
    can be seen at https://archive.softwareheritage.org/swh:1:snp:e9dca2ee671f8d12a0163fe4cacfc61eabf7d8fe;origin=https://www.npmjs.com/package/koreanbots
    as it used the revision in the "releases/3.0.0-alpha3" branch as the
    target of the "releases/3.0.0-beta" branch.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/707/ for more details.

anlambert added inline comments.
swh/loader/package/npm/loader.py
57

looks like you can avoid string concatenation here

152

To avoid mixing up releases, we every field used to build the

This revision is now accepted and ready to land.Apr 5 2022, 3:53 PM
This revision was landed with ongoing or failed builds.Apr 7 2022, 10:17 AM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D7469 (id=27274)

Rebasing onto 62270989a2...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-709-D7469.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/709/ for more details.