Page MenuHomeSoftware Heritage

Document the `slug` usage on the swh-deposit client for origin creation
Open, NormalPublic

Description

In T2377 we identified that a deposit via the deposit client with the param`slug` resulted with an generated uuid in it's origin.
This behavior should be documented in the client documentation and in the code.

There are two scenarios with the deposit:

  • a deposit with a publicly available location that can be identified as the origin
  • a deposit without a publicly available location

Event Timeline

moranegg triaged this task as Normal priority.Tue, May 5, 10:52 AM
moranegg created this task.

When we set up an explicit deposit collection with a client we configure on SWH side a root URL for the deposits, named provider_url, and whenever a new deposit comes in, we record as a reference url for that deposit the concatenation of the root URL with the value passed to the swh deposit client via the --slug option.

The --slug option isn't mandatory, therefore deposits without --slug will have a uuid to replace the slug value.

The concatenated reference url will create in the SWH archive the origin element, which is the resource location on the web (where the software artifacts were found when using the pull mechanism).

With collections like HAL or IPOL, having a correct url makes sense, since the artifacts are on the web with a persistent identifier (a HAL-ID or a DOI). In some cases, like Intel, the deposit is not publicly available on the internet or the url is not persistent, creating a synthetic url and origin with a uuid is the best solution.

An example with IPOL,
an article where the software artifact is available has as reference URL https://doi.org/10.5201/ipol.2018.236, which we split in two:

  • the prefix https://doi.org/10.5201/ that corresponds to IPOL(We keep on our side this as the provider_url)
  • the suffix ipol.2018.23 that we expect to see passed as a parameter to swh deposit with the --slug option

The concatenation of these two elements is the value that we show as "origin url" when you browse the code in the archive.

An example with Intel, where the software artifacts are not available on the Intel website,
We have a synthetic origin with a fixed prefix and a generated suffix: