Page MenuHomeSoftware Heritage

Simplify/update the SWHIDs returned from a SWORD deposit
Closed, MigratedEdits Locked

Description

For a SWORD deposit we must now return SWHIDs with maximal information corresponding to the new SWHID specification.

According to the current Deposit documentation, the information currently returned is the following:

{
 'deposit_id': '11',
 'deposit_status': 'done',
 'deposit_swh_id': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9',
 'deposit_swh_id_context': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;origin=https://forge.softwareheritage.org/source/jesuisgpl/',
 'deposit_swh_anchor_id': 'swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb',
 'deposit_swh_anchor_id_context': 'swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;origin=https://forge.softwareheritage.org/source/jesuisgpl/',
 'deposit_status_detail': 'The deposit has been successfully \
                           loaded into the Software Heritage archive'
}

This can be simplified (and enhanced) using the new SWHID qualifiers anchor, path and visit, and returning for the swh_id_context the full contextual information (like is now done in the new permalinks box), as follows:

{
 'deposit_id': '11',
 'deposit_status': 'done',
 'deposit_swh_id': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9',
 'deposit_swh_id_context': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;origin=https://forge.softwareheritage.org/source/jesuisgpl/;visit=swh:1:snp:68c0d26104d47e278dd6be07ed61fafb561d0d20;anchor=swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;path=/',
 'deposit_status_detail': 'The deposit has been successfully \
                           loaded into the Software Heritage archive'
}

The deposit_swh_anchor_id and deposit_swh_anchor_id_context fields are no longer needed.

Related https://gitlab.ccsd.cnrs.fr/ccsd/hal/-/issues/264

Event Timeline

rdicosmo created this task.

Once this is ready, check with HAL that everything works in software deposits (see https://gitlab.ccsd.cnrs.fr/ccsd/hal/-/issues/264)

ardumont renamed this task from Simplify/uptade the SWHIDs returned from a SWORD deposit to Simplify/update the SWHIDs returned from a SWORD deposit.May 12 2020, 10:22 AM

I got a couple of questions:

  • Do we migrate the old deposit values to the new ones? (sounds reasonable to do so)
  • Do we keep what seems to be no longer relevant fields deposit_swh_anchor_id and `deposit_swh_anchor_id_context' (or can we drop them)?
  • Do we need to keep deposit_swh_anchor_id at all? Can't we just use always set the deposit_swh_id field with all relevant information there now?

Once this is ready, check with HAL that everything works in software deposits
(see https://gitlab.ccsd.cnrs.fr/ccsd/hal/-/issues/264)

Yes, I recall hal uses that field deposit_swh_id_context.
With some luck, they just use it directly and everything will be fine.

Note: The gitlab instance is not public (need authentication to access it which
I don't have). I don't know if you expected that.

  • Do we migrate the old deposit values to the new ones? (sounds reasonable to do so)

Absolutely!

  • Do we keep what seems to be no longer relevant fields deposit_swh_anchor_id and `deposit_swh_anchor_id_context' (or can we drop them)?

We definitely want to drop them, after checking that HAL and Intel are not using them (and/or telling them to upgrade their workflow)

Once this is ready, check with HAL that everything works in software deposits
(see https://gitlab.ccsd.cnrs.fr/ccsd/hal/-/issues/264)

Note: The gitlab instance is not public (need authentication to access it which
I don't have). I don't know if you expected that.

It's their choice... I'll ask them to open you an account right away :-)

Also, i edited because one more question hit me:

Do we need to keep deposit_swh_anchor_id at all? Can't we just use always set the deposit_swh_id field with all relevant information there now (migration question aside ;)?

Thanks for your feedback ;)

Do we need to keep deposit_swh_anchor_id at all? Can't we just use always set the deposit_swh_id field with all relevant information there now (migration question aside ;)?

Let's keep it for now.

HAL uses the deposit_swh_id and the concatenation with the origin_url is done on HAL side.
I think it's safer to have HAL change this and use the deposit_swh_id_context.

I'm not sure which identifier Intel is using.
@vlorentz has commented on IRC:

I don't know about deposits themselves, but in the main archive, I see six visits with metadata from them, each of an independent origin, but they are > about the same software (vtune)
dates: 2019-05-14, 2019-05-14, 2019-05-17, 2019-05-18, 2019-08-16, 2019-12-16

@zack do you know about Intel ?

I'm not sure which identifier Intel is using.

They are using the deposit client.

I don't know about deposits themselves, but in the main archive, I see six
visits with metadata from them, each of an independent origin, but they are >
about the same software (vtune) dates: 2019-05-14, 2019-05-14, 2019-05-17,
2019-05-18, 2019-08-16, 2019-12-16

Most probably without mentioning the --slug. So, it ends up being generated
thus that many different deposit origins for the same software.

Do we migrate the old deposit values to the new ones? (sounds reasonable to do so)

Absolutely!

And now I realize, it's not as simple as I initially thought. The current
deposit swh_id data does not contain snapshot information (which is needed in
that new format). So that means, that migration step is at least a script which
do the necessary snapshot resolution.

Fortunately, I started this task as a multiple steps one (D3141 explicitely
mentions it does not deal with the migration).

Indeed! Getting the swh:rev and swh:snp for the swh:dir for the deposit
should not be that complicated navigating the Merkle tree upwards, though,
as we expect little deduplication there, but we'll need to see..

Roberto

Le mer. 13 mai 2020 à 10:47, ardumont (Antoine R. Dumont) <
forge@softwareheritage.org> a écrit :

ardumont added a comment. View Task
https://forge.softwareheritage.org/T2398

Do we migrate the old deposit values to the new ones? (sounds reasonable
to do so)

Absolutely!

And now I realize, it's not as simple as I initially thought. The current
deposit swh_id data does not contain snapshot information (which is needed
in
that new format). So that means, that migration step is at least a script
which
do the necessary snapshot resolution.

Fortunately, I started this task as a multiple steps one (D3141
https://forge.softwareheritage.org/D3141 explicitely
mentions it does not deal with the migration).

*TASK DETAIL*
https://forge.softwareheritage.org/T2398

*EMAIL PREFERENCES*
https://forge.softwareheritage.org/settings/panel/emailpreferences/

*To: *ardumont
*Cc: *zack, vlorentz, anlambert, moranegg, ardumont, rdicosmo

Indeed! Getting the swh:rev and swh:snp for the swh:dir for the deposit
should not be that complicated navigating the Merkle tree upwards, though, as
we expect little deduplication there, but we'll need to see..

I don't foresee much complication.

It's just that I usually envision migration as something closed on itself
(deposit here). It's not the case as we need the main archive to resolve the
snapshot. So the migration won't be something as simple as a sql migration
script.

I don't foresee much complication.

I was wrong.

Aside technical issues (django migration is painful to me right now, see the
wip which is still wip...).

I see we have the first deposits with only swh_id set to the revision SWHIDs
(so no dir, and no origin_url). I recall we decided not not migrate those at
the time. Shall we migrate them nonetheless to align all deposit SWHIDs?

The origin can be inferred from the deposit data (provider_url + external_id) .
The directory can be resolved with the archive from the revision (the current
wip is taking care of those -> still needs testing though).

The main concern would be hal (they are mostly hal deposits and some of our own
deposit checks). Even if hal does not update those swhid on their side, those
are still resolvable by us. So, I think it's fine if we align completely our
deposit data.

@moranegg @rdicosmo What's your take on this?

I would say go... HAL identifiers may be updated (not a big deal for them to update them) or may not be updated, but as you say they will be resolvable, so better have a uniform status of all deposits.

I agree.
From the HAL point of view, this is not a problem because they don't use the deposit DB to request the identifier, they directly link to the swh-revision (At that time we decided to point from HAL to revisions) and the revision identifier is still correct.

Aside technical issues (django migration is painful to me right now, see the
wip which is still wip...).

I was wrong.

For a change. Anyway, done.

Hell is in the details heh (but also in D3153 ;)

Positive me sees that i finally had a closer look to the django migration thing ;)

ardumont changed the task status from Open to Work in Progress.May 15 2020, 12:41 PM

Needs to package, deploy and migrate all the thing (/me -> food)

Needs to package, deploy and migrate all the thing (/me -> food)

done [1]

And triggered an icinga check to make sure everything is fine.

And then checking the status:

$ swh deposit status --username swh --password "$(swhpass ls operations/deposit.softwareheritage.org/http-auth/swh | head -1)" --deposit-id 610 --format json| jq .
{
  "deposit_id": "610",
  "deposit_status": "done",
  "deposit_status_detail": "The deposit has been successfully loaded into the Software Heritage archive",
  "deposit_swh_id": "swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea",
  "deposit_swh_id_context": "swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-15T12:26:27.825057;visit=swh:1:snp:a74c75c9e7c3345a9d7318bd2c9cc634431f4a85;anchor=swh:1:rev:041cb1662a876403c798b464d159baa5425f8bbb;path=/",
  "deposit_swh_anchor_id": "swh:1:rev:041cb1662a876403c798b464d159baa5425f8bbb",
  "deposit_swh_anchor_id_context": "swh:1:rev:041cb1662a876403c798b464d159baa5425f8bbb;origin=https://www.softwareheritage.org/check-deposit-2020-05-15T12:26:27.825057",
  "deposit_external_id": "check-deposit-2020-05-15T12:26:27.825057"
}

To ease check, deposit link out of the swh_id_context now [2]

Everything is fine!

[1] As expected and described in D3153, some 24 "old" deposits are left on the
side, see P674. It should be possible to reschedule to reduce down that number.
I would expect some to fail though due to our checks which are different...

[2] https://archive.softwareheritage.org/swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-15T12:26:27.825057;visit=swh:1:snp:a74c75c9e7c3345a9d7318bd2c9cc634431f4a85;anchor=swh:1:rev:041cb1662a876403c798b464d159baa5425f8bbb;path=/

Cheers,

[1] As expected and described in D3153, some 24 "old" deposits are left on the
side, see P674. It should be possible to reschedule to reduce down that number.
I would expect some to fail though due to our checks which are different...

Those were rescheduled for check (and ingestion if ok). All were ok! I also updated
their check_task_id (scheduler detail).

No more "done" deposit without SWHIDs [1]

I don't have anything more to do on that task.
I guess remains only communication (I updated
the ccsd-cnrs gitlab issue)

[1]

softwareheritage-deposit=> select count(*) from deposit where status='done' and swh_id_context is null;
 count
-------
     0
(1 row)

I'm waiting for a test from HAL to resolve this task.
Who do we contact on Intel side? (@zack ?)

This comment was removed by zack.

Here is the planned email:
subject: [Software Heritage] We are updating our SWHID

We are happy to share that we have improved the specificity of the SWHID, now able to pinpoint a complete tree from top to bottom.

Which means we are updating the deposit as well, changing the deposit_swh_id_context to the new schema.

Also with this change we are deleting completely the deposit_swh_anchor_id and the deposit_swh_anchor_id_context.

I hope this change is not an inconvenience for your workflow with the Software Heritage archive.

All the best,
MG

I don't have anything more to do on that task.

I created T2412 for the cleanup part.