Page MenuHomeSoftware Heritage

deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information
Closed, MigratedEdits Locked

Description

To cover several use cases, the deposit of a tarball/zipfile must return as main swh-id the id of the root directory, and as ancillary information the id of the synthetic commit created at the moment of the deposit.

Event Timeline

moranegg triaged this task as Normal priority.Jul 18 2018, 5:57 PM
moranegg created this task.

Can you (and/or @rdicosmo ) elaborate on the rationale for this?

It makes the deposit inconsistent with, e.g., the tarball loader.

Also, if HAL only points back to a directory instead of a revision, there will be no way of showing the metadata on archive.s.o, which I thought was a big part of what HAL wants to see.

It is essential for reproducibility that the shw-id offered to researchers
to reference a deposited piece of software depend only on the software
deposited itself: if three papers use the same software tree, they must
show the same swh-id, no matter whether this software tree has been
deposited once, twice, or three times.

In the case of .zip/.tar files this is the swh-id of the root directory,
not the shw-id of the synthetic commit.

Of course, we keep the synthetic commit with the metadata in our own SWH
archive, exactly as before, and nothing changes in the deposit process; the
only change is the swh-id researchers will be offered as reference, and
shown on the HAL page.

Let's keep in mind that HAL points back to SWH with the proper "origin"
additional information in the identifiers, that is enough for properly
displaying the content in context.

Le mer. 18 juil. 2018 à 18:02, zack (Stefano Zacchiroli) <
forge@softwareheritage.org> a écrit :

zack added a comment.

Can you (and/or @rdicosmo https://forge.softwareheritage.org/p/rdicosmo/
) elaborate on the rationale for this?

It makes the deposit inconsistent with, e.g., the tarball loader.

Also, if HAL only points back to a directory instead of a revision, there
will be no way of showing the metadata on archive.s.o, which I thought was
a big part of what HAL wants to see.

*TASK DETAIL*
https://forge.softwareheritage.org/T1152

*EMAIL PREFERENCES*
https://forge.softwareheritage.org/settings/panel/emailpreferences/

*To: *ardumont, zack
*Cc: *zack, moranegg, rdicosmo

It is essential for reproducibility that the shw-id offered to researchers
to reference a deposited piece of software depend only on the software
deposited itself: if three papers use the same software tree, they must
show the same swh-id, no matter whether this software tree has been
deposited once, twice, or three times.

In the case of .zip/.tar files this is the swh-id of the root directory,
not the shw-id of the synthetic commit.

I don't understand how reproducibility is impacted. What matters for that use case is that you receive a byte-identical source tree w.r.t. what you initially submitted. And that is guaranteed by returning commit IDs, even if they're different for the same directory deposit. What you actually lose is the ability to compare at face value, e.g., across different papers, that the deposit is the same. (But note that you can do that via the API, it's enough to ask what's the directory ID behind two different commit IDs. So I'm still not clear on which problem you're actually trying to solve here. But read below about why this actually worries me.

Of course, we keep the synthetic commit with the metadata in our own SWH
archive, exactly as before, and nothing changes in the deposit process; the
only change is the swh-id researchers will be offered as reference, and
shown on the HAL page.

Let's keep in mind that HAL points back to SWH with the proper "origin"
additional information in the identifiers, that is enough for properly
displaying the content in context.

Deposit is a generic process/protocol, of which HAL is just one user. We are gonna have more in the future, which might have different needs.

The conceptual and practical problem I see in returning only the directory ID is the following one.

When you deposit something, you're actually depositing two different pieces of information:

  1. a source code tree (in a tarball), and
  2. associated metadata.

The directory object we create in our DAG corresponds to (1), the revision object we create corresponds to (2) and (1). I can see how there are use cases that don't care about (2), but I can see also use cases that do care about it. And if we believe that persistent identifiers are important (and I think we do), we should enable deposit users to access identifiers for both (1) and (2).

By returning only a directory ID, we are making needlessly fragile the process of retrieving the ID that also covers (2) — because provenance information is not, strictly speaking, part of the DAG, it's accessory information. While by returning a revision ID we are enabling users to have both an ID that covers (2) and — trivially by using the API — an ID that only covers (1).

So, if really HAL wants to points to a directory (I think it should be their call, not ours) I think we should do one of the following:

  • either return a revision ID and tell them to use the API to retrieve the associated directory ID
  • or return both a revision ID and a directory ID (but I'm not entirely sure SWORD supports that), and tell them to use the directory ID

TL;DR: by ingesting a revision and not returning its ID, we will have a protocol that — at the protocol level — loses information, and that is a bad idea.

So, if really HAL wants to points to a directory (I think it should be
their call, not ours)

For reproducibility, it's my call, so let me try to explain, for the record, in this thread :-)

When a .tar/.zip is deposited in the SWH Archive, two things happen:

  1. the .tar/.zip is unpacked, and the content ingested, creating (or finding an existing) directory object which has an associated swh-id D; this swh-id D is intrinsic to the content ingested, and can be recomputed independently from us by any other person having a copy of the same .tar/.zip or directory: this is the important part for reproducibility
  2. we create a synthetic commit to store in the Merkle tree a node containing the metadata associated to the deposit, leading to a revision object that was not present in the original source, and that has an associated swh-id SR; this synthetic commit contains valuable traceability information, that we surely want to keep, but SR is not the intrinsic persistent identifier we want to expose, because:
  • nobody (not even HAL, or any other depositor, including the ones concerned by the compliance use cases) can recompute independently from us this swh-id SR, because it depends not only on the metadata added, but also on the particular mangling of this metadata done during the ingestion, that may well change over time; providing only SR as a swh-id for such a deposit makes it impossible for somebody that may have a copy of the same code and an article mentioning the swh-id SR to check that the code is the same withouth accessing SWH: that would make us a middle man and for our long term strategy we do not want middle men, not even us
  • recent experiences with debugging the HAL deposit have made clear that we will be confronted with the same source code being deposited several times, not just because several authors will deposit the same code, but also because the same author will deposit the same code again and again just to fix this or that bit of medatata, generating SR1, SR2, SR3... for the same D

Hence, we want the swh-id D of the actual source code deposited, not the shw-id SR of the synthetic commit, to be the persistent identifier we return as the canonical way of designating the source code deposited.

  • or return *both* a revision ID and a directory ID (but I'm not entirely sure SWORD supports that), and tell them to use the directory ID

That's a perfectly fine solution and is related to T1098:

  • the identifier D of the directory extracted from the .tar/.zip is the value of the swh-id property
  • the identifier SR of the synthetic commit can be returned as the value of an anchor attribute, along the lines of what we are discussing in T1098

Just adding another piece to the rational puzzle:
The software citation

I completely agree that the rev-id is fabricated and can't be reproduced.
The rev-id is affected by adding or changing a metadata property even only by changing the propertie's name:

  • for example using the property contributor instead of comitter in the metadata entry will completely change the rev-id

Therefore when using the rev-id in a citation, to retrieve or check the content of the software you must pass through SWH.

@zack: When you deposit something, you're actually depositing two different pieces of information:

  1. a source code tree (in a tarball), and
  2. associated metadata.

For a citation (not only a software citation), an important question is to what target does the identifier points to:

  1. the object (article/dataset/software)
  2. a landing page with the metadata where one of the properties is a link to the object

Here is the citation format we created with HAL and that is proposed on HAL:

Sascha Hunold, Raphaël Bleuse, Grégory Mounié. moldableILP. 2017, swh:1:rev:a27a59f6b14c9fb13a6f998d8316628dafc1f60c. hal-01727745

With the rev-id, the citation uses two metadata landing page identifiers !


@zack: Also, if HAL only points back to a directory instead of a revision, there will be no way of showing the metadata on archive.s.o, which I thought was a big part of what HAL wants to see.

It's true that part of the specs and during the implementation process, the metadata where a big part of what was expected to be found on SWH
However, faced with reality, the place we (SWH) have on HAL's metadata page is an easy access to the browsable content and neither users or moderators checks their metadata on SWH
because they already have it on HAL and I'm not even sure they know where to find it.

With that, I do think it is important to have the metadata accessible and keep in mind that with the contextual URL which is used by HAL, the metadata is easily found !

@zack: So, if really HAL wants to points to a directory (I think it should be their call, not ours) I think we should do one of the following:

  • either return a revision ID and tell them to use the API to retrieve the associated directory ID
  • or return both a revision ID and a directory ID (but I'm not entirely sure SWORD supports that), and tell them to use the directory ID

HAL doesn't want something specific.

We should review what kind of entry point to the archive we would like to give the client with the swh-id and if it is compatible with our identifier agenda...
[uniqueness, non ambiguity, persistence, integrity, no middle man, abstraction (opacity), gratis (free of charge)]

another option is to have the contextual dir-id sent back (today only the rev-id without context is sent and the context is added on the link to SWH by HAL) :

swh:1:dir:42a13fc721c8716ff695d0d62fc851d641f3a12b;origin=https://hal.archives-ouvertes.fr/hal-01727745

for SWORD protocol, we can add another id or even a url into the deposit_status entry point, but if we do so, we should be clear on our intentions.

In T1152#21326, @zack wrote:

TL;DR: by ingesting a revision and not returning its ID, we will have a protocol that — at the protocol level — loses information, and that is a bad idea.

Ok, I think I see the source of confusion.

When depositing a .tar/.zip, we are not ingesting a revision, but a directory, so we must return the id of that directory.

We (SWH) also create a synthetic revision to hold extra information when ingesting a directory, but that's not the object that is being deposited (hence "synthetic" :-))

  • nobody (not even HAL, or any other depositor, including the ones concerned by the compliance use cases) can recompute independently from us this swh-id SR, because it depends not only on the metadata added, but also on the particular mangling of this metadata done during the ingestion, that may well change over time; providing only SR as a swh-id for such a deposit makes it impossible for somebody that may have a copy of the same code and an article mentioning the swh-id SR to check that the code is the same withouth accessing SWH: that would make us a middle man and for our long term strategy we do not want middle men, not even us

Right. But I consider this a bug. Pretty much as we have documented the way IDs are computed (and now also provided tools that allow to independently compute it, that will soon have independent implementations from our own) , we should document the specific metadata mangling that we do and how to do it independently from us. That is the right fix for this problem.

  • or return *both* a revision ID and a directory ID (but I'm not entirely sure SWORD supports that), and tell them to use the directory ID

That's a perfectly fine solution and is related to T1098:

Great, let's do this then.

With that, I do think it is important to have the metadata accessible and keep in mind that with the contextual URL which is used by HAL, the metadata is easily found !

This is precisely why I'm saying that the risk here is designing something that should be generic, based on HAL needs only. The metadata is easily find only because HAL uses a different origin for each "paper". We cannot be sure that that is going to be the case for other deposits use cases (e.g., industrial deposit of CCSC tarballs). If that assumption breaks down, than we will effectively be losing information that the submitter cannot retrieve.

for SWORD protocol, we can add another id or even a url into the deposit_status entry point, but if we do so, we should be clear on our intentions.

As mentioned in the previous message, I'm fine with this solution.
(FWIW, if the returned ID for the dir is an ID and not an URL, the revision one should also be an ID, for consistency.)

rdicosmo renamed this task from deposit: change swh-id for a deposit to the directory id instead of the revision id to deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.Jul 19 2018, 2:23 PM
rdicosmo updated the task description. (Show Details)
In T1152#21351, @zack wrote:
  • or return *both* a revision ID and a directory ID (but I'm not entirely sure SWORD supports that), and tell them to use the directory ID

That's a perfectly fine solution and is related to T1098:

Great, let's do this then.

Fantastic, we have converged!
Retitled/updated accordingly, and now ready for @ardumont :-)

Just to be clear about this:

  1. the property swh-id returned to the client is the directory, with or without context?
  2. How should we call the new property added for the revision ID ?
  3. And will it be a revision ID with or without context?
moranegg mentioned this in Unknown Object (Maniphest Task).Jul 19 2018, 2:50 PM

We want to return in the SWORD response all the information (including context), but structured, so the receiver does not need to do parsing:

  1. swh-id is without context 1bis) add a property swh-id-context containing the context
  2. I propose to call the new property swh-anchor-id, containing the id of the synthetic commit, without context 2bis) add a property swh-anchor-id-context containing the context of the synthetic commit

Just to be clear about this:

  1. the property swh-id returned to the client is the directory, with or without context?
  2. How should we call the new property added for the revision ID ?
  3. And will it be a revision ID with or without context?