Page MenuHomeSoftware Heritage

Modify deposit workflow to check duplicated POST requests
Open, NormalPublic

Description

After the session with Bruno this week, we saw that multiple request of the same deposit that are waiting for the workers create a corner case where each is treated as a different deposit and each is loaded into the archive separately. For example this deposit -https://archive.softwareheritage.org/browse/origin/https://hal.archives-ouvertes.fr/hal-01862659/visits/ with 9 visits but not related through the parent history.


Procedure:

  1. if external id exists
    1. if md5 identical
      1. calculate metadata hash
      2. if metadata hash identical
        1. return 400 //we have already received this deposit
    2. mark deposit with last identical external-id as parent-id
      1. if parent is 'rejected' status iterate until last non-rejected parent
  2. return 201 with new deposit-id

Comment: when parent is not in status 'done' the deposit can't be loaded

Event Timeline

moranegg created this task.Aug 29 2018, 4:08 PM
moranegg triaged this task as Normal priority.

After the session with Bruno this week, we saw that multiple request of the same deposit that are waiting for the workers create a corner case where each is treated as a different deposit and each is loaded into the archive separately. For example this deposit -https://archive.softwareheritage.org/browse/origin/https://hal.archives-ouvertes.fr/hal-01862659/visits/ with 9 visits but not related through the parent history.

Right.

To emphasize, there is no issue on the archive's side since everything is deduplicated. Everything is fine there.

The issue is deposit side. At the moment, it's possible to trigger multiple checks/loads (which will end up being the same stuff so a gazillion visits for nothing)
Because, client can send multiple times the same stuff... time and again...

So this is about:

  • enforcing checks on the deposit side to reject queries if they are considered already done (and explaining why, body response in the 400) ~> implementation: hash computation on request (archive, metadata for example)
  • improving the actual parent policy between deposit-id (as of today, the parent is only set if we have a deposit whose swh-id is filled in)

Cheers,

A possible problem to check with this workflow:

hal-prod and hal-preprod uses the same identifiers for different objects