Differential D4242

pypi: write metadata on revisions instead of snapshots.
ClosedPublic
Actions

Authored by vlorentz on Oct 13 2020, 11:04 AM.

Details

Reviewers

ardumont

Group Reviewers

Reviewers

Maniphest Tasks

T2667: Decide what to do with PyPI snapshot metadata

Commits

rDLDBASE0c766379f216: pypi: write metadata on revisions instead of snapshots.

Summary

Writing them on snapshot allowed us to write the raw metadata from the API,
but it causes a lot of duplication; after running for only a couple of months,
the metadata storage is already 700GB in size, mostly because of NPM
metadata, but also because of these (eg. many over 1MB each).

The metadata we wrote on snapshots was made of:

intrinsic metadata that PyPI extracted from the last upload
info on each file (sdist or otherwise)

The former we don't need to archive like this (as they are intrinsic),
and we keep loading the latter but only for source files and discard
extrinsic metadata for binary files, as they are not useful.

Closes T2667

Diff Detail

Repository

rDLDBASE Generic VCS/Package Loader

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Oct 13 2020, 11:04 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 13 2020, 11:04 AM

Build is green

Patch application report for D4242 (id=14987)

Rebasing onto fe59ce84e5...

Current branch diff-target is up to date.

Changes applied before test

commit 0c766379f216c3c6174d6e749921f9effac235cf
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Oct 13 11:01:58 2020 +0200

    pypi: write metadata on revisions instead of snapshots.
    
    Writing them on snapshot allowed us to write the raw metadata from the API,
    but it causes a lot of duplication; after running for only a couple of months,
    the metadata storage is already 700GB in size, mostly because of NPM
    metadata, but also because of these (eg. many over 1MB each).
    
    The metadata we wrote on snapshots was made of:
    
    * intrinsic metadata that PyPI extracted from the last upload
    * info on each file (sdist or otherwise)
    
    The former we don't need to archive like this (as they are intrinsic),
    and we keep loading the latter but only for source files and discard
    extrinsic metadata for binary files, as they are not useful.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/327/ for more details.

Harbormaster completed remote builds in B16211: Diff 14987.Oct 13 2020, 11:05 AM

ardumont added a subscriber: ardumont.Oct 13 2020, 11:25 AM

ardumont added inline comments.