Page MenuHomeSoftware Heritage

batch API to check for the presence of content in the archive
Closed, MigratedEdits Locked

Description

The only way we currently offer to check for the presence of a content in the archive is the /content endpoint.

For use cases such as being a backend for source code scanners it would be nice to have a batch version of the same endpoint; it would be more efficient and consume less rate limit "credits", with limited impact on our side. If needed, it would be OK for such endpoint to return less information about each individual content than what the current /content endpoint does (the bare minimum we need is a list of booleans stating whether the content is in the archive or not).

For the batch part, we can accept a POST request body with the list of content identifiers to check, as opposed to the current single id we accept via GET. That might justify overloading the current /content method (dispatching on GET v. POST) depending on how similar the return values will be in the two cases.

Event Timeline

zack triaged this task as Normal priority.Jun 7 2019, 10:44 AM
zack created this task.

Can we have the feature which will return the content of File Type, Language Type, and License not its URL

Can we have the feature which will return the content of File Type, Language Type, and License not its URL

I'm not sure yet at this stage, but, tentatively: I don't think so.

Because smells like conflating a lot of different things into a single API endpoint. Also, it will be a trade-off: the more information we add to the response, the smaller the maximum batch will need to be.

Stay tuned here for actual API proposals/discussions.

For batch boolean file existence tests, there's the (undocumented, but used by the search box on the main softwareheritage.org website) https://archive.softwareheritage.org/api/1/content/known/search/ API endpoint, which allows you to post arbitrary SHA1s to be checked.

example usage :

curl -XPOST -d file1=495fe31da0d856520fbffa39757d870aa138f235 -d file2=dc9d6333f297aa4d7e5d6a622adc2182846b8b1f https://archive.softwareheritage.org/api/1/content/known/search/ | jq .

returns

{
  "search_res": [
    {
      "found": true,
      "filename": "file2",
      "sha1": "dc9d6333f297aa4d7e5d6a622adc2182846b8b1f"
    },
    {
      "found": true,
      "filename": "file1",
      "sha1": "495fe31da0d856520fbffa39757d870aa138f235"
    }
  ],
  "search_stats": {
    "nbfiles": 2,
    "pct": 100
  }
}
zack closed this task as Resolved.EditedSep 17 2020, 3:00 PM
zack claimed this task.

this has been addressed, and in a more general way that works for any SWHID, in D2582 by @DanSeraf