Page MenuHomeSoftware Heritage

package loader: Discuss revision metadata normalization
Closed, ResolvedPublic

Description

Auditing the state of the revision's metadata field, we use the field metadata to convey multiple meanings.
It'd be great we have a look at how we want to set that.

Understanding that this should not be there (but it is for now).

current reimplementation for gnu/pypi/npm (T1389)

'metadata': {
  'intrinsic_metadata': intrinsic_metadata,  # raw metadata parsed out of internal file (PKG-INFO, package.json)
  'original_artifact': a_metadata,             # extrinsic artifact metadata, raw data from api/provider
  'hashes_artifact': a_c_metadata,          # extra artifact metadata computed by us
},

in-production pypi

'metadata': {
  'original_artifact': artifact,  # extrinsic metadata about the release artifact to download
  'project': project_info,  # extrinsic metadata about the project
},

in-production npm

'metadata': {
    'package_source': package_source_data,
    'package': package_metadata,
},

mercurial

all intrinsic metadata:

'node': hash_to_hex(header['node']),  # -> this probably should be in extra headers, must check if it is or not (unsure)
'extra_headers': [
    ['time_offset_seconds',
     str(commit['time_offset_seconds']).encode('utf-8')],
] + extra_meta

git

ret['metadata'] = {
    'extra_headers': git_metadata,  # intrinsic metadata
}

svn

all intrinsic metadata

'extra_headers': [
    ['svn_repo_uuid', repo_uuid],
    ['svn_revision', str(rev).encode('utf-8')]
]

tar

Those are computations and extrinsic metadata:

'metadata': {
  'original_artifact': {
    'name': filename
    'archive_type': 'zip' || 'tar'
    **hashes
  }
}

debian

'metadata': {
  'original_artifact': [{
    name,
    **hashes,
  }],           # <- built from intrinsic metadata parsed out of *.dsc
  package_info: # <- built from intrinsic metadata parsed out of debian/changelog
}

Event Timeline

ardumont triaged this task as Normal priority.Sep 27 2019, 12:18 PM
ardumont created this task.
ardumont updated the task description. (Show Details)Sep 28 2019, 1:32 PM
moranegg added projects: Restricted Project, Metadata workflow.Sep 30 2019, 11:22 AM
ardumont added a comment.EditedOct 1 2019, 2:31 PM

For the following, we will focus on package loader metadata.

We will continue using revision and the following metadata format is this:

{
  original_artifact: [{
    filename: <value>,
    checksums: {
      **swh-hashes   # sha1, sha1_git, sha256, blake2
    },
    length: <value>,
  }],
  extrinsic: {
    provider: <value>,
    when: <value>,
    raw: <raw-data-provided>,
  },
  intrinsic: [{
    tool: <value>,  # PKG-INFO, package.json, etc...
    raw: <raw-parse-intrinsic-metadata>,   # think PKG-INFO, package.json, *.dsc dict output
  }],
}
ardumont renamed this task from Discuss revision metadata normalization to package loader: Discuss revision metadata normalization.Oct 1 2019, 6:45 PM
ardumont closed this task as Resolved.
ardumont claimed this task.