Page MenuHomeSoftware Heritage

Specification of extrinsic origin metadata and their storage.
ClosedPublic

Authored by vlorentz on Jun 19 2019, 3:21 PM.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
vlorentz updated this revision to Diff 5362.Jun 19 2019, 3:26 PM

retitle git commit

Very nice output !

docs/extrinsic-metadata-specification.rst
10

It can be available also as part of a deposit.

26

they can provide metadata about different software artifacts, here we deal with origins, but the authorities aren't specifically related to origins.
i would change origin to software artifacts.

62

I don't think a loader will be dedicated to fetching metadata, because it is dedicated to fetch code and if it changes functionality, it should be a different tool.

71

is there a reason gatherers doesn't have the :term: item?

101

I don't like future uses and arbitrary.
I propose delete arbitrary and change to:

With authory, the metadata field is reserved for information describing and qualifying the authority.
With tools, the metadata field is reerved for configuration metadata and other technical usage.

150

I think a [ ] and adding a second origin_metadata entry, can clarify.

[{
  'authority': {'name': ..., 'url': ...},
  'tool': {'name': ..., 'version': ...},
  'discovery_date': ...,
  'metadata': b'...'
},
{
  'authority': {'name': ..., 'url': ...},
  'tool': {'name': ..., 'version': ...},
  'discovery_date': ...,
  'metadata': b'...'
}]
152

the term deposited is too connected to the deposit and here seems that you talk about all authorities.

Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries?

also, this explanation should come before the example.

vlorentz added inline comments.Jun 20 2019, 5:06 PM
docs/extrinsic-metadata-specification.rst
26

We don't support any object other than origins yet. If you think we should support other objects, I can amend the last part of spec accordingly, but I don't think we need it yet.

71

Because there is no "gatherer" entry in the glossary yet.

152

Did you mean, that a list of the latest origin_metadata entries for a given authority is returned instead of all origin_metadata entries?

No, it returns all of them, but paginated

also, this explanation should come before the example.

It's not an example, it's the format of the output

Other comment, specs folder might be a better place to hold the file. we might have other specs in swh-docs:-)

docs/extrinsic-metadata-specification.rst
26

ok. change a origin to an origin.

71

ack.

152

No, it returns all of them, but paginated

Is that really useful? to have it all?

It's not an example, it's the format of the output

This explanation should come before the format output :-)

vlorentz added inline comments.Jun 20 2019, 5:30 PM
docs/extrinsic-metadata-specification.rst
152

Is that really useful? to have it all?

Yes, for the same reason we can get the list of snapshots of an origins.

This explanation should come before the format output :-)

* shrug *

zack requested changes to this revision.Jun 23 2019, 3:31 PM

looks great !

I've only noted down minor things to be changed.

docs/extrinsic-metadata-specification.rst
9–10

I propose "Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information provided at source code archival time."

That should address @moranegg concern and it feels pretty clear to (the biased) me.

12–13

"Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed."

26

s/code hosts/code hosting places/

27

It's not the deposit client that has the metadata, that is just a dumb software component; it's the person doing the deposit who has them.

Hence, I suggest to use "deposit submitters" here instead.

36–46
  • the gitlab rows should be about two different instances, e.g. the main one and the inria one
  • i don't understand the swh row
  • we want an example (better: two) of deposit lines here
51

Having a non ambiguous name here would be helpful to streamline language. From this text I'm assuming you'd be ok with "metadata fetcher"? (It's OK with me.)

Hence, s/Metadata fetching tools/*Metadata fetchers*/ here.

62

A loader here is consistent with what we discussed f2f though, at least IIRC.

The idea was that you might have a generic "git loader", and sub-class it (or whatever) into a "gitlab loader", a "github loader", etc. While the most generic one will only load source code artifacts, the host-specific instances will also fetch extrinsic metadata.

TL;DR: this seems correct to me.

(and is also consistent with the lister example just below)

67–68

Echoing my previous comment, the authority here is the deposit submitter. As we don't' have their identity, for the purpose of the authority table we should probably just use "deposit" here.

70–71

gatherer v. fetcher starts becoming clumsy.

How about "metadata crawler" here?

I'm open to other suggestions if that doesn't work…

86

what does the "_by" adds here? wouldn't metadata_authority_get be better/clearer?

93

"_by" → ditto

This revision now requires changes to proceed.Jun 23 2019, 3:31 PM
vlorentz marked 16 inline comments as done.Jul 1 2019, 12:16 PM
vlorentz added inline comments.
docs/extrinsic-metadata-specification.rst
36–46
  • i don't understand the swh row

That was a typo

  • we want an example (better: two) of deposit lines here

Do you have example URLs for the deposit you want to use for the deposit?

70–71

Much better indeed, thanks!

86

Indeed. It made sense when (name, url) was not an intrinsic identifier, but it's no longer true.

vlorentz updated this revision to Diff 5583.Jul 1 2019, 12:17 PM
vlorentz marked 2 inline comments as done.

apply @moranegg's and @zack's comments.

vlorentz updated this revision to Diff 5586.Jul 1 2019, 12:18 PM

remove commit that wasn't supposed to be there

vlorentz marked an inline comment as done.Jul 1 2019, 12:18 PM
zack requested changes to this revision.Jul 1 2019, 5:46 PM
zack added inline comments.
docs/extrinsic-metadata-specification.rst
12

s/provided/available/

(sorry, this issue come from my suggestion, i know, but it didn't make sense that way :))

15

missing trailing '.'

27

"moral entities" is a false friend from french; just use "entities", I guess?

36–46

Do you have example URLs for the deposit you want to use for the deposit?

I personally don't. Maybe @moranegg does?

Alternatively, we can just provide a sample deposit URL with '...' where applicable.

116–117

given they're (correctly) grouped together in the result dictionary, maybe you want to also group them together here authority_name/authority_url (as a pair) and same thing for fetcher_name/fetcher_version

(unless, dunno, this is mapped to an HTTP API somewhere and it's easier to avoid the packing

conceptually they are really two things a fetcher and an authority, so it'd make sense to have 2 args instead of 4

This revision now requires changes to proceed.Jul 1 2019, 5:46 PM
vlorentz updated this revision to Diff 5606.Jul 1 2019, 5:57 PM
vlorentz marked 5 inline comments as done.

Apply @zack's comments.

docs/extrinsic-metadata-specification.rst
116–117

Good point. There is no HTTP API; and I'm using byte strings so it's not possible is JSON anyway.

zack accepted this revision.Jul 1 2019, 6:03 PM

LGTM. (Still waiting on @moranegg for the deposit URL example.)

moranegg added inline comments.Jul 2 2019, 1:28 PM
docs/extrinsic-metadata-specification.rst
36

provider isn't authority now?

36–46

https://hal.inria.fr/ and https://hal.archives-ouvertes.fr/
I'm not sure if this is what you have in mind?

vlorentz added inline comments.Jul 2 2019, 1:32 PM
docs/extrinsic-metadata-specification.rst
36–46

What is the difference between (name=hal, url= https://hal.archives-ouvertes.fr/) and (name=deposit, url= https://hal.archives-ouvertes.fr/)?

vlorentz updated this revision to Diff 5618.Jul 2 2019, 2:51 PM
  • s/name/type/
  • add a lot more examples
vlorentz updated this revision to Diff 5619.Jul 2 2019, 3:02 PM
  • more registry examples
moranegg added inline comments.Jul 3 2019, 2:44 PM
docs/extrinsic-metadata-specification.rst
36–46

Here is the paste for a full table example:
https://forge.softwareheritage.org/P457

moranegg accepted this revision.Jul 3 2019, 4:46 PM
moranegg added inline comments.
docs/extrinsic-metadata-specification.rst
36–46

The following idea surfaced during discussion:
keeping only the URL and metametadata to represent an authority, and attach a format field to each metadata blob

This revision is now accepted and ready to land.Jul 3 2019, 4:46 PM
vlorentz updated this revision to Diff 5652.Jul 3 2019, 4:53 PM

Add a 'format' key to each metadata entry.

moranegg accepted this revision.Jul 3 2019, 5:00 PM
zack added 1 blocking reviewer(s): zack.Jul 3 2019, 6:00 PM
This revision now requires review to proceed.Jul 3 2019, 6:00 PM
zack accepted this revision.Jul 4 2019, 10:21 AM
This revision is now accepted and ready to land.Jul 4 2019, 10:21 AM
This revision was automatically updated to reflect the committed changes.