Page MenuHomeSoftware Heritage

Document takedown request processing workflow
Open, NormalPublic


As our project grows in size and visibility, we'll have to respond to more takedown requests.

Expunging data from Software Heritage is a very involved process, considering that, even within SWH itself, all the archive data is replicated across multiple systems (PostgreSQL, Kafka, Elasticsearch to name a few), each with different behaviors. The addition of mirrors to the mix makes the process even more arduous.

Finally, as a project of public interest, we have a duty to be transparent as to what operations have been taken in response to any given takedown request.

This task is two-fold: track which systems replicate what data, and how to handle clearing some data in response to a takedown. Once the brainstorming is over, this can be used as a basis for a workflow documentation.

Event Timeline

olasd triaged this task as Normal priority.Apr 12 2021, 4:33 PM
olasd created this task.

Knobs to adjust the visibility of origins in the archive and in the web API

  • origins index (used for origin search results)
    • blocklisted field
  • origin table (joined on all web api requests anchored on an origin url; also used for search when elasticsearch is disabled)
    • (hack) update the url field to make the data harder to find
    • TODO: add an actual blocklist to disable display of an origin, probably in swh.web?

do we also intent to have a takedown topic on kafka?

also: what about exports we provide on git annex?

do we also intent to have a takedown topic on kafka?

Response is