Page MenuHomeSoftware Heritage

Specification of extrinsic origin metadata and their storage.
ClosedPublic

Authored by vlorentz on Jun 19 2019, 3:21 PM.

Diff Detail

Repository
rDSTO Storage manager
Branch
ext-metadata-spec
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6592
Build 9195: tox-on-jenkinsJenkins
Build 9194: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Very nice output !

docs/extrinsic-metadata-specification.rst
9

It can be available also as part of a deposit.

25

they can provide metadata about different software artifacts, here we deal with origins, but the authorities aren't specifically related to origins.
i would change origin to software artifacts.

61

I don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch code and if it changes functionality, it should be a different tool.

70

is there a reason gatherers doesn't have the :term: item?

100

I don't like future uses and arbitrary.
I propose delete arbitrary and change to:

With authory, the metadata field is reserved for information describing and qualifying the authority.
With tools, the metadata field is reerved for configuration metadata and other technical usage.

149

I think a [ ] and adding a second origin_metadata entry, can clarify.

[{
  'authority': {'name': ..., 'url': ...},
  'tool': {'name': ..., 'version': ...},
  'discovery_date': ...,
  'metadata': b'...'
},
{
  'authority': {'name': ..., 'url': ...},
  'tool': {'name': ..., 'version': ...},
  'discovery_date': ...,
  'metadata': b'...'
}]
151

the term deposited is too connected to the deposit and here seems that you talk about all authorities.

Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries?

also, this explanation should come before the example.

docs/extrinsic-metadata-specification.rst
25

We don't support any object other than origins yet. If you think we should support other objects, I can amend the last part of spec accordingly, but I don't think we need it yet.

70

Because there is no "gatherer" entry in the glossary yet.

151

Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries?

No, it returns all of them, but paginated

also, this explanation should come before the example.

It's not an example, it's the format of the output

Other comment, specs folder might be a better place to hold the file. we might have other specs in swh-docs:-)

docs/extrinsic-metadata-specification.rst
25

ok. change a origin to an origin.

70

ack.

151

No, it returns all of them, but paginated

Is that really useful? to have it all?

It's not an example, it's the format of the output

This explanation should come before the format output :-)

docs/extrinsic-metadata-specification.rst
151

Is that really useful? to have it all?

Yes, for the same reason we can get the list of snapshots of an origins.

This explanation should come before the format output :-)

* shrug *

zack requested changes to this revision.Jun 23 2019, 3:31 PM

looks great !

I've only noted down minor things to be changed.

docs/extrinsic-metadata-specification.rst
8–9

I propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information provided at source code archival time."

That should address @moranegg concern and it feels pretty clear to (the biased) me.

11–12

"Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed."

25

s/code hosts/code hosting places/

26

It's not the deposit client that has the metadata, that is just a dumb software component; it's the person doing the deposit who has them.

Hence, I suggest to use "deposit submitters" here instead.

35–45
  • the gitlab rows should be about two different instances, e.g. the main one and the inria one
  • i don't understand the swh row
  • we want an example (better: two) of deposit lines here
50

Having a non ambiguous name here would be helpful to streamline language. From this text I'm assuming you'd be ok with "metadata fetcher"? (It's OK with me.)

Hence, s/Metadata fetching tools/*Metadata fetchers*/ here.

61

A loader here is consistent with what we discussed f2f though, at least IIRC.

The idea was that you might have a generic "git loader", and sub-class it (or whatever) into a "gitlab loader", a "github loader", etc. While the most generic one will only load source code artifacts, the host-specific instances will also fetch extrinsic metadata.

TL;DR: this seems correct to me.

(and is also consistent with the lister example just below)

66–67

Echoing my previous comment, the authority here is the deposit submitter. As we don't' have their identity, for the purpose of the authority table we should probably just use "deposit" here.

69–70

gatherer v. fetcher starts becoming clumsy.

How about "metadata crawler" here?

I'm open to other suggestions if that doesn't work…

85

what does the "_by" adds here? wouldn't metadata_authority_get be better/clearer?

92

"_by" → ditto

This revision now requires changes to proceed.Jun 23 2019, 3:31 PM
vlorentz added inline comments.
docs/extrinsic-metadata-specification.rst
35–45
  • i don't understand the swh row

That was a typo

  • we want an example (better: two) of deposit lines here

Do you have example URLs for the deposit you want to use for the deposit?

69–70

Much better indeed, thanks!

85

Indeed. It made sense when (name, url) was not an intrinsic identifier, but it's no longer true.

vlorentz marked 2 inline comments as done.

apply @moranegg's and @zack's comments.

remove commit that wasn't supposed to be there

zack requested changes to this revision.Jul 1 2019, 5:46 PM
zack added inline comments.
docs/extrinsic-metadata-specification.rst
11

s/provided/available/

(sorry, this issue come from my suggestion, i know, but it didn't make sense that way :))

14

missing trailing '.'

26

"moral entities" is a false friend from french; just use "entities", I guess?

35–45

Do you have example URLs for the deposit you want to use for the deposit?

I personally don't. Maybe @moranegg does?

Alternatively, we can just provide a sample deposit URL with '...' where applicable.

115–116

given they're (correctly) grouped together in the result dictionary, maybe you want to also group them together here authority_name/authority_url (as a pair) and same thing for fetcher_name/fetcher_version

(unless, dunno, this is mapped to an HTTP API somewhere and it's easier to avoid the packing

conceptually they are really two things a fetcher and an authority, so it'd make sense to have 2 args instead of 4

This revision now requires changes to proceed.Jul 1 2019, 5:46 PM
vlorentz marked 5 inline comments as done.

Apply @zack's comments.

docs/extrinsic-metadata-specification.rst
115–116

Good point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON anyway.

LGTM. (Still waiting on @moranegg for the deposit URL example.)

docs/extrinsic-metadata-specification.rst
35

provider isn't authority now?

35–45

https://hal.inria.fr/ and https://hal.archives-ouvertes.fr/
I'm not sure if this is what you have in mind?

docs/extrinsic-metadata-specification.rst
35–45

What is the difference between (name=hal, url= https://hal.archives-ouvertes.fr/) and (name=deposit, url= https://hal.archives-ouvertes.fr/)?

  • s/name/type/
  • add a lot more examples
docs/extrinsic-metadata-specification.rst
35–45

Here is the paste for a full table example:
https://forge.softwareheritage.org/P457

moranegg added inline comments.
docs/extrinsic-metadata-specification.rst
35–45

The following idea surfaced during discussion:
keeping only the URL and metametadata to represent an authority, and attach a format field to each metadata blob

This revision is now accepted and ready to land.Jul 3 2019, 4:46 PM

Add a 'format' key to each metadata entry.

This revision now requires review to proceed.Jul 3 2019, 6:00 PM
This revision is now accepted and ready to land.Jul 4 2019, 10:21 AM
This revision was automatically updated to reflect the committed changes.