Page MenuHomeSoftware Heritage

Review metadata deposit specs for metadata-only deposit
Closed, MigratedEdits Locked

Event Timeline

moranegg triaged this task as Normal priority.Mar 12 2020, 3:57 PM
moranegg created this task.
vlorentz changed the task status from Open to Work in Progress.Aug 28 2020, 1:20 PM
vlorentz moved this task from Backlog to Work in progress on the Roadmap 2020 board.

@vlorentz :

serialization format: @type is missing

zack renamed this task from Review metadata deposit specs of a metadata only deposit to Review metadata deposit specs for metadata-only deposit.Sep 1 2020, 6:51 PM

After this morning's meeting with @vlorentz and @ardumont:
We will keep the metadata-only deposit specs with the idea of a separate namespace swh for which we need to write the schema (not sure we have that).

This way, the xml with metadata has a section where the identified artifact is mentioned:

Reference a snapshot, revision or release:

With ${type} in {snp (snapshot), rev (revision), rel (release) }:
<swh:deposit>
  <swh:reference>
    <swh:object id="swh:1:${type}:aaaaaaaaaaaaaa..."/>
  </swh:reference>
</swh:deposit>

We need to add to the list of types: directory and content

The possibility to deposit metadata on an origin should be implemented as well, but is not suited for institutional repositories (e.g HAL).
Reference an origin:

<swh:deposit>
  <swh:reference>
    <swh:origin url="https://github.com/user/repo"/>
  </swh:reference>
</swh:deposit>

This specs fits the POST of a new deposit in SWORD and is described in the SWORD v2 documentation (6.3.3. Creating a Resource with an Atom Entry)

@vlorentz can you please review the naming and the choice of the tag with or without the attribute (e.g id, url)?

I think we would want to "mention" SWHIDs there, by replacing <swh:object id=" with either <swh:swhid id=" or <swh:object swhid=" (weak preference for the latter)

Additionally, should the SWHID be a core SWHID, or do we allow context? In the latter case, what do we do if there's a line context?

I don't recall what the conclusion was about the proposal of <swh:swhid>$actual_swid</swh:swhid> which i found simpler and clearer.
(I have no clue if that proposal is irrelevant or not)

I guess a question which could help answering that also would be "Do we intend to add other attributes to <swh:object>"?

I don't recall what the conclusion was about the proposal of <swh:swhid>$actual_swid</swh:swhid> which i found simpler and clearer.

We didn't conclude anything, I said I'd think about it ;)

Since it's a simple text value, it should be an attribute, IMO. No point in allowing content in that tag

I see we have three-four options:

Option A1: value of swhid in argument id

<swhid id='swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  lines=9-15'/>

Option A2: value of swhid in argument swhid

<object swhid='swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  lines=9-15'/>

Option B: value of swhid in element

<swhid>
  swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  lines=9-15
</swhid>

Option C: Value of swhid separated in element

<swhid>
   <core> swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b</core>
   <origin_ctxt> https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git </origin_ctxt> 
   <visit_ctxt>swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 </visit_ctxt>
   <anchor_ctxt>swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0 </anchor_ctxt>
   <path_ctxt>/Examples/SimpleFarm/simplefarm.ml </path_ctxt>
   <fragment_qualifier> 9-15</fragment_qualifier>
</swhid>

I don't have a preference, but I do think that we don't want clients to dismember the SWHID into Option C.
So if we say that the burden of the understanding of the context is on our side, we should go with A or B.
@vlorentz is right when saying that the element is only text and is not a complexe element (where other elements are included).
@ardumont is right when saying that the use of only an element looks clearer, but we should use that only if there is a reason to include more elements in the identified object

So the questions are:

  1. do we think we will need that outside of the scenario we have seen yesterday (metadata-only deposit)?
  2. and do we think that on the long-term maybe option C will have a "raison d'être"?
  3. the evolution to json-ld will be easier with what schema?

We can use option A1, which allows extending to option C in the future if the need araises (but I doubt it will)

Actually, I prefer A2, to make the distinction between origins (identified by an URL, <swh:origin url=...) and objects (identified by a SWHID, <swh:object swhid='...)

Actually, I prefer A2, to make the distinction between origins (identified by an URL, <swh:origin url=...) and objects (identified by a SWHID, <swh:object swhid='...)

yes, described this way, A2 is more appealing ;)