Page MenuHomeSoftware Heritage

npm: write metadata on revisions instead of snapshots.
ClosedPublic

Authored by vlorentz on Oct 5 2020, 2:27 PM.

Details

Summary

Writing them on snapshot allowed us to write the raw metadata from the API,
but it causes a lot of duplication; after running for only a couple of months,
the metadata storage is already 700GB in size, mostly because of these
(eg. there are 150k over 1MB each).

The metadata we wrote on snapshot was made of:

  • a 'versions' dict, whose content is moved to revisions
  • a 'time' dict, with one timestamp per version, which is used as the data of revision objects
  • 'dist-tags', which is currently ignored, but should be converted to ALIAS branches in a future commit.
  • a '_rev' property, which is internal to NPM, so not useful to archive
  • everything else can be recomputed from the metadata of the latest version.

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4142 (id=14605)

Rebasing onto 64f2361c85...

Current branch diff-target is up to date.
Changes applied before test
commit 53f778c6b9e4af9dae389d5f306d32cc7627f9a5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 5 14:26:31 2020 +0200

    npm: write metadata on revisions instead of snapshots.
    
    Writing them on snapshot allowed us to write the raw metadata from the API,
    but it causes a lot of duplication; after running for only a couple of months,
    the metadata storage is already 700GB in size, mostly because of these
    (eg. there are 150k over 1MB each).

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/322/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/322/console

vlorentz added a reviewer: olasd.

Build has FAILED

Patch application report for D4142 (id=14606)

Rebasing onto 64f2361c85...

Current branch diff-target is up to date.
Changes applied before test
commit 58673a23f0095fb4613fdb148d5a6bfec658c897
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 5 14:26:31 2020 +0200

    npm: write metadata on revisions instead of snapshots.
    
    Writing them on snapshot allowed us to write the raw metadata from the API,
    but it causes a lot of duplication; after running for only a couple of months,
    the metadata storage is already 700GB in size, mostly because of these
    (eg. there are 150k over 1MB each).
    
    The metadata we wrote on snapshot was made of:
    
    * a 'versions' dict, whose content is moved to revisions
    * a 'time' dict, with one timestamp per version, which is used as the
      data of revision objects
    * 'dist-tags', which is currently ignored, but should be converted to
      ALIAS branches in a future commit.
    * a '_rev' property, which is internal to NPM, so not useful to archive
    * everything else can be recomputed from the metadata of the latest
      version.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/323/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/323/console

vlorentz edited the summary of this revision. (Show Details)

fix typo in commit message + remove irrelevant FIXME.

Build has FAILED

Patch application report for D4142 (id=14607)

Rebasing onto 64f2361c85...

Current branch diff-target is up to date.
Changes applied before test
commit bfff50ff98681f6a8cd4328142e84bbded2878b3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 5 14:26:31 2020 +0200

    npm: write metadata on revisions instead of snapshots.
    
    Writing them on snapshot allowed us to write the raw metadata from the API,
    but it causes a lot of duplication; after running for only a couple of months,
    the metadata storage is already 700GB in size, mostly because of these
    (eg. there are 150k over 1MB each).
    
    The metadata we wrote on snapshots was made of:
    
    * a 'versions' dict, whose content is moved to revisions
    * a 'time' dict, with one timestamp per version, which is used as the
      data of revision objects
    * 'dist-tags', which is currently ignored, but should be converted to
      ALIAS branches in a future commit.
    * a '_rev' property, which is internal to NPM, so not useful to archive
    * everything else can be recomputed from the metadata of the latest
      version.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/324/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/324/console

Build is green

Patch application report for D4142 (id=14609)

Could not rebase; Attempt merge onto 64f2361c85...

Updating 64f2361..fe59ce8
Fast-forward
 swh/loader/package/debian/loader.py      |  8 +++-
 swh/loader/package/npm/loader.py         | 15 ++++---
 swh/loader/package/npm/tests/test_npm.py | 71 +++++++++++++++++---------------
 3 files changed, 51 insertions(+), 43 deletions(-)
Changes applied before test
commit fe59ce84e53b40797f86f03c3943e83889fffec6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 5 14:26:31 2020 +0200

    npm: write metadata on revisions instead of snapshots.
    
    Writing them on snapshot allowed us to write the raw metadata from the API,
    but it causes a lot of duplication; after running for only a couple of months,
    the metadata storage is already 700GB in size, mostly because of these
    (eg. there are 150k over 1MB each).
    
    The metadata we wrote on snapshots was made of:
    
    * a 'versions' dict, whose content is moved to revisions
    * a 'time' dict, with one timestamp per version, which is used as the
      data of revision objects
    * 'dist-tags', which is currently ignored, but should be converted to
      ALIAS branches in a future commit.
    * a '_rev' property, which is internal to NPM, so not useful to archive
    * everything else can be recomputed from the metadata of the latest
      version.

commit 7d3d8ffa304d9674cf2f3253d3c9df543135ca10
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 5 14:47:24 2020 +0200

    debian: Fix mypy error by asserting ChangeBlock.package is not None.
    
    Under some conditions, mypy can detect it is declared as Optional[str].

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/326/ for more details.

anlambert added a subscriber: anlambert.

Looks good to me.

This revision is now accepted and ready to land.Oct 5 2020, 3:09 PM