batch API to check for the presence of content in the archive
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Jun 7 2019, 10:44 AM

Description

The only way we currently offer to check for the presence of a content in the archive is the /content endpoint.

For use cases such as being a backend for source code scanners it would be nice to have a batch version of the same endpoint; it would be more efficient and consume less rate limit "credits", with limited impact on our side. If needed, it would be OK for such endpoint to return less information about each individual content than what the current /content endpoint does (the bare minimum we need is a list of booleans stating whether the content is in the archive or not).

For the batch part, we can accept a POST request body with the list of content identifiers to check, as opposed to the current single id we accept via GET. That might justify overloading the current /content method (dispatching on GET v. POST) depending on how similar the return values will be in the two cases.

Related Objects

Mentioned In: T1804: Software Heritage api to accept batch request from FOSSology
Mentioned Here: D2582: Web API endpoint /known/

Event Timeline

zack triaged this task as Normal priority.Jun 7 2019, 10:44 AM

zack created this task.

zack mentioned this in T1804: Software Heritage api to accept batch request from FOSSology.Jun 14 2019, 12:06 PM

Can we have the feature which will return the content of File Type, Language Type, and License not its URL

In T1789#33370, @sandipbhuyan wrote:

Can we have the feature which will return the content of File Type, Language Type, and License not its URL

I'm not sure yet at this stage, but, tentatively: I don't think so.

Because smells like conflating a lot of different things into a single API endpoint. Also, it will be a trade-off: the more information we add to the response, the smaller the maximum batch will need to be.

Stay tuned here for actual API proposals/discussions.

For batch boolean file existence tests, there's the (undocumented, but used by the search box on the main softwareheritage.org website) https://archive.softwareheritage.org/api/1/content/known/search/ API endpoint, which allows you to post arbitrary SHA1s to be checked.

example usage :

curl -XPOST -d file1=495fe31da0d856520fbffa39757d870aa138f235 -d file2=dc9d6333f297aa4d7e5d6a622adc2182846b8b1f https://archive.softwareheritage.org/api/1/content/known/search/ | jq .

returns

{
  "search_res": [
    {
      "found": true,
      "filename": "file2",
      "sha1": "dc9d6333f297aa4d7e5d6a622adc2182846b8b1f"
    },
    {
      "found": true,
      "filename": "file1",
      "sha1": "495fe31da0d856520fbffa39757d870aa138f235"
    }
  ],
  "search_stats": {
    "nbfiles": 2,
    "pct": 100
  }
}

this has been addressed, and in a more general way that works for any SWHID, in D2582 by @DanSeraf

zack added a subscriber: DanSeraf.Sep 17 2020, 3:00 PM

This task has been migrated to GitLab.

batch API to check for the presence of content in the archiveClosed, MigratedEdits LockedActions

Description

Related Objects

Event Timeline

batch API to check for the presence of content in the archive
Closed, MigratedEdits Locked
Actions