Document takedown request processing workflow
As our project grows in size and visibility, we'll have to respond to more takedown requests.

Expunging data from Software Heritage is a very involved process, considering that, even within SWH itself, all the archive data is replicated across multiple systems (PostgreSQL, Kafka, Elasticsearch to name a few), each with different behaviors. The addition of mirrors to the mix makes the process even more arduous.

Finally, as a project of public interest, we have a duty to be transparent as to what operations have been taken in response to any given takedown request.

This task is two-fold: track which systems replicate what data, and how to handle clearing some data in response to a takedown. Once the brainstorming is over, this can be used as a basis for a workflow documentation.

Knobs to adjust the visibility of origins in the archive and in the web API

  • origins index (used for origin search results)
    • blocklisted field
  • origin table (joined on all web api requests anchored on an origin url; also used for search when elasticsearch is disabled)
    • (hack) update the url field to make the data harder to find
    • TODO: add an actual blocklist to disable display of an origin, probably in swh.web?

do we also intent to have a takedown topic on kafka?

also: what about exports we provide on git annex?

do we also intent to have a takedown topic on kafka?

