Page MenuHomeSoftware Heritage

Migrate deposit SWHIDs (data) to the new specification
ClosedPublic

Authored by ardumont on May 13 2020, 6:39 PM.

Details

Summary

Migrate deposit SWHIDs (data) to the new specification

Migrate both "recent" and "old" format deposits [1] to the new specification.

That means the deposit swh_id* fields will be set to:

  • swh_id: directory SWHID (no context)
  • swh_id_context: directory SWHID (with context, origin, visit, anchor path)

Optionally, those 2 fields will be kept (for now) and realigned where it was not
set ("old" deposits) to:

  • swh_anchor_id: revision SWHID (no context)
  • swh_anchor_id_context: revision SWHID (context with only origin)

It's expected some very "old" deposits won't be migrated as we cannot resolve
those values. They will be rescheduled when it will be possible to do
so (deploy [2]).

[1] "recent" format means all swh_id fields are set:

  • swh_id: directory SWHID (no context)
  • swh_id_context: directory SWHID (context with only origin)
  • swh_anchor_id: revision SWHID (no context)
  • swh_anchor_id_context: revision SWHID (context with only origin)

"old" format:

  • swh_id: revision SWHID (no context)
  • swh_id_context: not set
  • swh_anchor_id: not set
  • swh_anchor_id_context: not set

[2] Related to D3141

Related to T2398

Test Plan

Dump out of production db restored in staging db.
And run the migration scripts:

$ SWH_CONFIG_FILENAME=/etc/softwareheritage/deposit/server.yml django-admin migrate --settings=swh.deposit.settings.production --verbosity 3

"Recent" deposits

From

 id  | status |                       swh_id                       |                                                           swh_id_context
-----+--------+----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------
 608 | done   | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9 | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9;origin=https://hal.archives-ouvertes.fr/hal-02560320
 607 | done   | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9 | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9;origin=https://hal.archives-ouvertes.fr/hal-02560320
 606 | done   | swh:1:dir:d85591aeefea2c1c58142e34683fd1923b19c895 | swh:1:dir:d85591aeefea2c1c58142e34683fd1923b19c895;origin=https://doi.org/10.5201/ipol.2018.236
 605 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-14T08:28:05.683282
 603 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-09T14:09:50.098364
 602 | done   | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab;origin=https://doi.org/10.5201/ipol.2018.236
 601 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T16:05:49.106202
 600 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T14:09:14.062873
 599 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T12:52:53.361776
 598 | done   | swh:1:dir:43b7a45a89c836b1baad8849215a51e65a67f80e | swh:1:dir:43b7a45a89c836b1baad8849215a51e65a67f80e;origin=https://hal.archives-ouvertes.fr/hal-02546057
 597 | done   | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab;origin=https://doi.org/10.5201/ipol.2018.236
...

to

 id  | status |                       swh_id                       |                                                                                                                        swh_id_context                                    $
-----+--------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
 608 | done   | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9 | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9;origin=https://hal.archives-ouvertes.fr/hal-02560320;visit=swh:1:snp:e5e82d064a9c3df7464223042e0c55d72ccff7f0;anchor=s$
 607 | done   | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9 | swh:1:dir:c3d06da1a556900e295b64aea1cc5a413b374ae9;origin=https://hal.archives-ouvertes.fr/hal-02560320;visit=swh:1:snp:3e95ef6e04c381a34cc2f314576bc5644f2c797f;anchor=s$
 606 | done   | swh:1:dir:d85591aeefea2c1c58142e34683fd1923b19c895 | swh:1:dir:d85591aeefea2c1c58142e34683fd1923b19c895;origin=https://doi.org/10.5201/ipol.2018.236;visit=swh:1:snp:07c80b96ab64e714fb69ed725f6b18caf87763ba;anchor=swh:1:rev$
 605 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-14T08:28:05.683282;visit=swh:1:snp:4577ab1375d35bab6e316$
 603 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-09T14:09:50.098364;visit=swh:1:snp:7e09ab0433291e2c5ea14$
 602 | done   | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab;origin=https://doi.org/10.5201/ipol.2018.236;visit=swh:1:snp:994f6ca7c49b1012768c4a5a6470f17f28d0e294;anchor=swh:1:rev$
 601 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T16:05:49.106202;visit=swh:1:snp:7c6ad0d82051bce0d5ebd$
 600 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T14:09:14.062873;visit=swh:1:snp:8f2341e340bd883300885$
 599 | done   | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea | swh:1:dir:ef04a768181417fbc5eef4243e2507915f24deea;origin=https://www.softwareheritage.org/check-deposit-2020-05-07T12:52:53.361776;visit=swh:1:snp:ce3d7eb9b08b839171c01$
 598 | done   | swh:1:dir:43b7a45a89c836b1baad8849215a51e65a67f80e | swh:1:dir:43b7a45a89c836b1baad8849215a51e65a67f80e;origin=https://hal.archives-ouvertes.fr/hal-02546057;visit=swh:1:snp:526c43a6e4459f2c72c67031adf931ed6d3bdca7;anchor=s$
 597 | done   | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab | swh:1:dir:a10423592dd061a00f7d34e4a3c102ba00c3d2ab;origin=https://doi.org/10.5201/ipol.2018.236;visit=swh:1:snp:f7decde6a26a4fa5f0886d71c010ceae827bae92;anchor=swh:1:rev$
 ...

"Old" deposits:

From

 id  | status |                       swh_id                       | swh_id_context
-----+--------+----------------------------------------------------+----------------
 156 | done   | swh:1:rev:698771f9ca7ce7605fdcabf27b5851f322ea692c |
 155 | done   | swh:1:rev:6c9bdcaac6b1b22726752d5d46d04865313d78aa |
 154 | done   | swh:1:rev:8127063816bd4f75e00c2986c0a95fd95d78d876 |
 153 | done   | swh:1:rev:2176d2be0d7e13e89a90447d7d0853af5cbab973 |
 152 | done   | swh:1:rev:e2655c5b28552465a7be15c06f31aa066f64535a |
 151 | done   | swh:1:rev:504a90c58872a8a594886fcf75fc5bfebe151e68 |
 150 | done   | swh:1:rev:c648730299c2a4f4df3c1fe6e527ef3681f9527e |
 149 | done   | swh:1:rev:bb8d72c6646316967ac08a7bc4acc95c50c14d79 |
 147 | done   | swh:1:rev:c8fca417ee9eefe25683042192da67470147be07 |
 146 | done   | swh:1:rev:cccf789c12617208fe188ad3dbc2746d4c884ab7 |

to

 id  | status |                       swh_id                       |                                                                                                          swh_id_context                                                  $
-----+--------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
 156 | done   | swh:1:dir:2c01e745c6d89e0eeb9a6ec9590f7ef0750b7002 | swh:1:dir:2c01e745c6d89e0eeb9a6ec9590f7ef0750b7002;origin=https://hal.archives-ouvertes.fr/hal-01831369;visit=swh:1:snp:42f0897956e700a23f5b8aafce43360b8699c0f1;anchor=s$
 155 | done   | swh:1:rev:6c9bdcaac6b1b22726752d5d46d04865313d78aa |
 154 | done   | swh:1:dir:3cb45c908fdad87542c5090e9464fc7f504e1509 | swh:1:dir:3cb45c908fdad87542c5090e9464fc7f504e1509;origin=https://hal.archives-ouvertes.fr/hal-01836266;visit=swh:1:snp:1fbb294bc458809e043bba9073f9d7a8b0b40fc9;anchor=s$
 153 | done   | swh:1:dir:95486800004625900d8365ee968683c7608a3b9d | swh:1:dir:95486800004625900d8365ee968683c7608a3b9d;origin=https://hal.archives-ouvertes.fr/hal-01837101;visit=swh:1:snp:2c2c2e4dcd61753b61739a45669ffbb89104d17a;anchor=s$
 152 | done   | swh:1:dir:f23a9f9d65671aaad715012a1781cb5de6451a3e | swh:1:dir:f23a9f9d65671aaad715012a1781cb5de6451a3e;origin=https://hal.archives-ouvertes.fr/hal-01831364;visit=swh:1:snp:f34ffc4d2fb57ba19a8586b88091fe99714a970a;anchor=s$
 151 | done   | swh:1:dir:f5cba66f896192d98641cf2d801de11dfca9f2a7 | swh:1:dir:f5cba66f896192d98641cf2d801de11dfca9f2a7;origin=https://hal.archives-ouvertes.fr/hal-01836189;visit=swh:1:snp:0e0f73db37ae7d26bf4b29d5599da2bfced30d63;anchor=s$
 150 | done   | swh:1:dir:accc6076ec6104d2125567e4a0c7685fb91f71e7 | swh:1:dir:accc6076ec6104d2125567e4a0c7685fb91f71e7;origin=https://hal.archives-ouvertes.fr/hal-01836169;visit=swh:1:snp:e3640bbfa187762803f29012b02693dd48e0ac88;anchor=s$
 149 | done   | swh:1:rev:bb8d72c6646316967ac08a7bc4acc95c50c14d79 |
 147 | done   | swh:1:dir:f23a9f9d65671aaad715012a1781cb5de6451a3e | swh:1:dir:f23a9f9d65671aaad715012a1781cb5de6451a3e;origin=https://hal.archives-ouvertes.fr/hal-01831364;visit=swh:1:snp:2cce797c46e9d06eb424e2f806a8d7d1fab6bf38;anchor=s$
 146 | done   | swh:1:dir:8a9521f0228d4f79a20d8d20f28523d557f9d2f8 | swh:1:dir:8a9521f0228d4f79a20d8d20f28523d557f9d2f8;origin=https://hal.archives-ouvertes.fr/hal-01831369;visit=swh:1:snp:a0f733bb6f16d6fe65c95194ad76c471fe739e75;anchor=s$

Expectedly, there could be some deposits that are not migrated (see description)

Leftover to reschedule

swh-deposit=> select id, status, swh_id, swh_id_context from deposit where status='done' and swh_id_context is null order by id desc;
 id  | status |                       swh_id                       | swh_id_context
-----+--------+----------------------------------------------------+----------------
 155 | done   | swh:1:rev:6c9bdcaac6b1b22726752d5d46d04865313d78aa |
 149 | done   | swh:1:rev:bb8d72c6646316967ac08a7bc4acc95c50c14d79 |
 127 | done   | swh:1:rev:d76cf5c02ce421f157d3fa624ad134a2efd18193 |
 126 | done   | swh:1:rev:84567c10d3c2383a878a9d8ab6773c1665e08419 |
 125 | done   | swh:1:rev:35ff14e6e4514adae3f950825a4b8b9b9f22767f |
 124 | done   | swh:1:rev:279a8ea930ddd6ef54f10f2f0784ea14a2205215 |
 123 | done   | swh:1:rev:e2a3373925db0f9f4307699e913b9fea9516cf6b |
 116 | done   | swh:1:rev:e2cdf2d3ce49f933ac6d23054183f92eacc4faef |
 114 | done   | swh:1:rev:a5e8b3d276e3a05989d00628e6e611ec7c51252a |
 112 | done   | swh:1:rev:b167902daf3a8a163d947adb62ad4269df471597 |
 110 | done   | swh:1:rev:b260ac6c02987fdf66e7dd1d2e647134cc3bed72 |
 108 | done   | swh:1:rev:d3f9947006289c67be6fd2a5081e466d61a80996 |
  93 | done   | swh:1:rev:734786ca12ca626b3a82a9d2a6fb5f6b968e7bd6 |
  92 | done   | swh:1:rev:4eb1d36683af77b946cdcb5875798d03bd6b775a |
  86 | done   | swh:1:rev:a0b9fc8f8a8bd7e1d29a18b9ac1a7d6e402d31cd |
  85 | done   | swh:1:rev:c29acbad74bb6cc01f9b7d61dd4f01ac747d771d |
  84 | done   | swh:1:rev:afb67a44c5de98891f4f21d04c449cc200b7e739 |
  83 | done   | swh:1:rev:bc3a12c0a288d74eafeb564ba03d8466f5fdb0f2 |
  82 | done   | swh:1:rev:31578998456025e4ebdb396b08dda0a63777b80e |
  81 | done   | swh:1:rev:85a127f023c84b2326c72fa669f0e3ad73a4fb68 |
  80 | done   | swh:1:rev:2a97f21995bab29548d7b41ec75fdd5639dbd325 |
  79 | done   | swh:1:rev:03987f056eaf4596cd20d7b2ee01c9b84ceddfa8 |
  78 | done   | swh:1:rev:7b844a98f54466cb189d27dbc1eede17f39e1c52 |
  77 | done   | swh:1:rev:4cf243a0645d5cd10c689eafd22ab38d685ad2d4 |
(24 rows)

Diff Detail

Repository
rDDEP Push deposit
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.May 13 2020, 6:39 PM

Build has FAILED

Patch application report for D3153 (id=11194)

Could not rebase; Attempt merge onto 85e1ff3eea...

Updating 85e1ff3e..3d4a38f4
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 ++++++---
 .../migrations/0018_deposit_migrate_swhids.py      | 115 ++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 ++++++++++++++-------
 6 files changed, 260 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_deposit_migrate_swhids.py
Changes applied before test
commit 3d4a38f431602fd20098f27a5550a4124e17d149
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit 51ce63737eb3a03b5c5cce2413b6dff108e06f07
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/23/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/23/console

ardumont edited the summary of this revision. (Show Details)May 13 2020, 6:40 PM
ardumont added a project: SWORD deposit.
ardumont updated this revision to Diff 11195.May 13 2020, 6:58 PM

Add the second part of the migration (initial deposits in the old format)

ardumont planned changes to this revision.May 13 2020, 6:59 PM
ardumont added a subscriber: moranegg.

Build has FAILED

Patch application report for D3153 (id=11195)

Could not rebase; Attempt merge onto 85e1ff3eea...

Updating 85e1ff3e..f1ddd79f
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 +++++---
 .../migrations/0018_deposit_migrate_swhids.py      | 160 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 ++++++++++++-------
 6 files changed, 305 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_deposit_migrate_swhids.py
Changes applied before test
commit f1ddd79f64d4e67897f35c73591cc8e4f83fab64
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit 51ce63737eb3a03b5c5cce2413b6dff108e06f07
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/24/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/24/console

ardumont edited the summary of this revision. (Show Details)May 13 2020, 10:07 PM
ardumont updated this revision to Diff 11197.May 14 2020, 12:41 PM

Rebase on latest master

Build has FAILED

Patch application report for D3153 (id=11197)

Could not rebase; Attempt merge onto aaf05610dc...

Updating aaf05610..c8d6e57d
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 +++++--
 swh/deposit/migrations/0018_migrate_swhids.py      | 188 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 ++++++++++------
 6 files changed, 333 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit c8d6e57de45d312d92d4b1680e87020dee364e3a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit 6604038ec7b11faf58e5600a7087ce9b575a6578
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/26/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/26/console

ardumont edited the summary of this revision. (Show Details)May 14 2020, 2:34 PM
ardumont updated this revision to Diff 11199.May 14 2020, 2:38 PM
  • Deal with edge case on old "swh" client deposits
  • Fix so migration actually run
  • Rebase on latest master

TODO:

  • Actually test it
ardumont planned changes to this revision.May 14 2020, 2:39 PM

Build is green

Patch application report for D3153 (id=11199)

Could not rebase; Attempt merge onto 9113b6107e...

Updating 9113b610..9a216703
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 +++++--
 swh/deposit/migrations/0018_migrate_swhids.py      | 209 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 +++++++++-----
 6 files changed, 354 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit 9a2167031b36c8deba16009c890c7b50a259d6f0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit fb97bf9832bed13516d62848934cd0eea6a29f81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/28/ for more details.

ardumont updated this revision to Diff 11201.May 14 2020, 3:23 PM

Add more checks and logs

Build is green

Patch application report for D3153 (id=11201)

Could not rebase; Attempt merge onto 9113b6107e...

Updating 9113b610..fee1637d
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 +++--
 swh/deposit/migrations/0018_migrate_swhids.py      | 315 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 ++++++----
 6 files changed, 460 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit fee1637d8daf72f438ec59dea68a3f1c4e0cf660
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit fb97bf9832bed13516d62848934cd0eea6a29f81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/29/ for more details.

ardumont updated this revision to Diff 11202.May 14 2020, 5:29 PM

Fix following testing on staging deposit

Build is green

Patch application report for D3153 (id=11202)

Could not rebase; Attempt merge onto 9113b6107e...

Updating 9113b610..2d1775b8
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 ++--
 swh/deposit/migrations/0018_migrate_swhids.py      | 328 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 +++++----
 6 files changed, 473 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit 2d1775b8b8f073cf12307a87e371ac87383955e6
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit fb97bf9832bed13516d62848934cd0eea6a29f81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/30/ for more details.

ardumont updated this revision to Diff 11203.May 14 2020, 6:34 PM

Other fixes on the second part of the migration

Tested through production dump from deposit db restored in staging deposit db.

Build is green

Patch application report for D3153 (id=11203)

Could not rebase; Attempt merge onto 9113b6107e...

Updating 9113b610..fc871ad0
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 ++--
 swh/deposit/migrations/0018_migrate_swhids.py      | 332 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 +++++----
 6 files changed, 477 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit fc871ad0b358dd8a92a67430bf0b25db59d048d3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    wip: Migrate deposit SWHIDs (data) to the new specification
    
    Related to T2398

commit fb97bf9832bed13516d62848934cd0eea6a29f81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/31/ for more details.

ardumont edited the summary of this revision. (Show Details)May 14 2020, 6:38 PM
ardumont edited the test plan for this revision. (Show Details)

Dump out of production db restored in staging db.
Run the migration scripts.
It's not entirely complete (some ~30 "initial/testing" deposits are currently
failing)

Details.

Early deposits were not necessarily always quite consistent regarding origins,
Here are those:

|------------+-----------------------------------------------+-----+-----+------------------------------------------------------------|
| deposit id | origin                                        | swh | hal | real-origins                                               |
|------------+-----------------------------------------------+-----+-----+------------------------------------------------------------|
|         76 | https://hal.archives-ouvertes.fr/hal-01588782 | X   | X   | https://inria.halpreprod.archives-ouvertes.fr/hal-01588782 |
|         87 | https://hal.archives-ouvertes.fr/hal-01588927 | X   | X   | https://inria.halpreprod.archives-ouvertes.fr/hal-01588927 |
|         89 | https://hal.archives-ouvertes.fr/hal-01588935 | X   | ok  | https://hal-preprod.archives-ouvertes.fr/hal-01588935      |
|         88 | https://hal.archives-ouvertes.fr/hal-01588928 | X   | ok  | https://inria.halpreprod.archives-ouvertes.fr/hal-01588928 |
|         90 | https://hal.archives-ouvertes.fr/hal-01588942 | X   | ok  | https://inria.halpreprod.archives-ouvertes.fr/hal-01588942 |
|        143 | https://hal.archives-ouvertes.fr/hal-01592430 | X   | X   | https://hal-preprod.archives-ouvertes.fr/hal-01592430      |
|         75 | https://hal.archives-ouvertes.fr/hal-01588781 | X   | X   | https://inria.halpreprod.archives-ouvertes.fr/hal-01588781 |
|------------+-----------------------------------------------+-----+-----+------------------------------------------------------------|

The migration will be adapted to deal with those ^.

The next ones are that we did not find snapshots that targets those revisions.
It's possible those deposit were ingested prior to to actually having snapshots
in the model.

|------------+------------------------------------------|
| deposit id | revision                                 |
|------------+------------------------------------------|
|         93 | 734786ca12ca626b3a82a9d2a6fb5f6b968e7bd6 |
|         86 | a0b9fc8f8a8bd7e1d29a18b9ac1a7d6e402d31cd |
|         77 | 4cf243a0645d5cd10c689eafd22ab38d685ad2d4 |
|         82 | 31578998456025e4ebdb396b08dda0a63777b80e |
|         83 | bc3a12c0a288d74eafeb564ba03d8466f5fdb0f2 |
|         92 | 4eb1d36683af77b946cdcb5875798d03bd6b775a |
|        114 | a5e8b3d276e3a05989d00628e6e611ec7c51252a |
|        116 | e2cdf2d3ce49f933ac6d23054183f92eacc4faef |
|        123 | e2a3373925db0f9f4307699e913b9fea9516cf6b |
|        125 | 35ff14e6e4514adae3f950825a4b8b9b9f22767f |
|         84 | afb67a44c5de98891f4f21d04c449cc200b7e739 |
|         85 | c29acbad74bb6cc01f9b7d61dd4f01ac747d771d |
|         78 | 7b844a98f54466cb189d27dbc1eede17f39e1c52 |
|        108 | d3f9947006289c67be6fd2a5081e466d61a80996 |
|        124 | 279a8ea930ddd6ef54f10f2f0784ea14a2205215 |
|         79 | 03987f056eaf4596cd20d7b2ee01c9b84ceddfa8 |
|         80 | 2a97f21995bab29548d7b41ec75fdd5639dbd325 |
|         81 | 85a127f023c84b2326c72fa669f0e3ad73a4fb68 |
|        110 | b260ac6c02987fdf66e7dd1d2e647134cc3bed72 |
|        126 | 84567c10d3c2383a878a9d8ab6773c1665e08419 |
|        127 | d76cf5c02ce421f157d3fa624ad134a2efd18193 |
|        155 | 6c9bdcaac6b1b22726752d5d46d04865313d78aa |
|        149 | bb8d72c6646316967ac08a7bc4acc95c50c14d79 |
|------------+------------------------------------------|

A further analysis needs to be done for those.

A further analysis needs to be done for those

Discussing with @moranegg, those are indeed old deposits. Once the initial
migration is done, and D3141 is landed/deployed.

We can reschedule those deposits for ingestion so they have decent SWHIDs (with
contextual information). The ancient swhids will no longer be referenced in the
deposit but it will still be resolvable by the archive.

And as a further step, we can also send an email heads up to hal about
migrating those values. "Oh by the way, you can keep those old values or update
those to ...".

moranegg accepted this revision.May 15 2020, 11:49 AM

Looks reasonable.
A I said, I can't test ;-) But I'll accept that.
Regarding sending HAL an email, I'm in a call with them in two weeks.
This is a migration that won't affect the links they have and I'm not sure they want to migrate their data.
I suggest asking them on the call, and if they want to migrate, email some instructions.

swh/deposit/migrations/0018_migrate_swhids.py
286

first you give all deposits all ids ?
Because the swh_anchor_id should be deleted...

This revision is now accepted and ready to land.May 15 2020, 11:49 AM
ardumont added inline comments.May 15 2020, 11:56 AM
swh/deposit/migrations/0018_migrate_swhids.py
286

Yes, i keep those (swh_anchor_id, swh_anchor_id_context) at first.

I don't know yet if it's used or not from the existing deposit clients. In the
mean time, that does not pose problems to realign everything consistently.

We can always do the migration of dropping the unneeded fields later (that's
the consistent reasoning i applied in D3141 as well).

ardumont updated this revision to Diff 11211.May 15 2020, 12:05 PM
  • Deal with missing origin cases
  • Rework commit message

Ready for production

ardumont retitled this revision from wip: Migrate deposit SWHIDs (data) to the new specification to Migrate deposit SWHIDs (data) to the new specification.May 15 2020, 12:06 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)

Build is green

Patch application report for D3153 (id=11211)

Could not rebase; Attempt merge onto 9113b6107e...

Updating 9113b610..a631dabb
Fast-forward
 docs/endpoints/status.rst                          |   2 +-
 docs/getting-started.rst                           |   2 +-
 requirements-swh-server.txt                        |   2 +-
 swh/deposit/api/private/deposit_update_status.py   |  63 +++-
 swh/deposit/migrations/0018_migrate_swhids.py      | 363 +++++++++++++++++++++
 .../api/test_deposit_private_update_status.py      | 146 ++++++---
 6 files changed, 508 insertions(+), 70 deletions(-)
 create mode 100644 swh/deposit/migrations/0018_migrate_swhids.py
Changes applied before test
commit a631dabb6fdd97030ccfcce0ec600e77f75a924e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed May 13 18:37:40 2020 +0200

    Migrate deposit SWHIDs (data) to the new specification
    
    Migrate both "recent" and "old" format deposits [1] to the new specification.
    
    That means the deposit swh_id* fields will be set to:
    - swh_id: directory SWHID (no context)
    - swh_id_context: directory SWHID (with context, origin, visit, anchor path)
    
    Optionally, those 2 fields will be kept (for now) and realigned where it was not
    set ("old" deposits) to:
    - swh_anchor_id: revision SWHID (no context)
    - swh_anchor_id_context: revision SWHID (context with only origin)
    
    It's expected some very "old" deposits won't be migrated as we cannot resolve
    those values. They will be rescheduled when it will be possible to do
    so (deploy [2]).
    
    [1] "recent" format means all swh_id fields are set:
    - swh_id: directory SWHID (no context)
    - swh_id_context: directory SWHID (context with only origin)
    - swh_anchor_id: revision SWHID (no context)
    - swh_anchor_id_context: revision SWHID (context with only origin)
    
    "old" format:
    - swh_id: revision SWHID (no context)
    - swh_id_context: not set
    - swh_anchor_id: not set
    - swh_anchor_id_context: not set
    
    [2] Related to D3141
    
    Related to T2398

commit fb97bf9832bed13516d62848934cd0eea6a29f81
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue May 12 15:33:00 2020 +0200

    Update deposit swhid to respect the latest specification update
    
    This also ensures the loader sends all the required information now (it did
    already but it was a bit laxed on checks for that part).
    
    Related to T2398

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/32/ for more details.

ardumont edited the test plan for this revision. (Show Details)May 15 2020, 12:21 PM
ardumont edited the test plan for this revision. (Show Details)

A I said, I can't test ;-) But I'll accept that.

Indeed.
Demonstration in the "Test Plan" of the diff (box following the diff description fields).