Page MenuHomeSoftware Heritage

Add a public API endpoint to retrieve a set of files with a given name
Closed, MigratedEdits Locked

Description

This can be very useful to get a dataset of files following a given format in order to test parsers; and it doesn't require much work on our side.

Event Timeline

The object storage doesn't have content names, so it cannot address this feature as stated.

What the object storage could do (if it doesn't already) is provide an efficient streaming API to retrieve a large number of objects.

Alternatively, this task should be lifted up the architecture. But in that case it should probably be a filter block composable with the above one, i.e., provide an endpoint to list (in an efficient/streaming way, again) all contents matching a given criteria. The list can then be used as input to the aforementioned method of the object storage.

And if we get there, the "list objects matching a given criteria" is something that could be generic, covering not only contents (as in: blobs), but we can also get there step-by-step.

cc: @seirl

The object storage doesn't have content names, so it cannot address this feature as stated.

Sorry, wrong tag; I meant to use Storage manager, not Object storage.

Hi guys. Any pointers on where to start?

@KShivendu The linked script is a start. As it is, it requires direct access to the DB; so you need to create abstractions for it in swh-storage and swh-web