Page MenuHomeSoftware Heritage

Implement support for takedown notices (infra, admin tools, workflow)
Closed, MigratedEdits Locked

Description

Takedown notices are coming, and we need full support to dereference certain contents, like it happens on https://github.com/github/dmca.

This involves several subtasks:

  • low level support for blacklisting specified contents (not only URLs, also SWHIDs), with support for regexps
  • admin interface to add/remove entries from the blacklist
  • a journal of these operations (what was added/removed, when and why, from the blacklist)
  • a public webpage that maintains the list of accepted takedown notices

Event Timeline

rdicosmo merged a task: Unknown Object (Maniphest Task).Mar 11 2021, 8:13 PM
rdicosmo added a subscriber: douardda.
olasd removed olasd as the assignee of this task.Apr 12 2021, 4:15 PM

Are we planning to add a way to notify the mirrors of the takedown notices ?
I'm just thinking if it could be interesting to subscribe the staging environment to it to ensure the content is also removed from it (and also flagged to avoid any further ingestion).

Are we planning to add a way to notify the mirrors of the takedown notices ?

Yeah, we'll have to do that.

What we (me and @rdicosmo) have been thinking of so far, was providing mirrors with a feed of the following information:

  • reference of the takedown request
  • SWHID of object affected
  • reason for takedown (maybe, can be found from the reference of the takedown request, if we find a way to structure it properly; useful for automated processing, I guess)
  • decision taken by Software Heritage (hide / remove once / blocklist forever)

We'd expect mirror operators to follow the feed, and to take their own decisions with respect to the actions to enact on their own infra.

I'm just thinking if it could be interesting to subscribe the staging environment to it to ensure the content is also removed from it

Once this scaffolding exists, it would certainly make sense to have it used to push the decisions from prod to staging.

(and also flagged to avoid any further ingestion).

For now my working assumption is that we'll remove objects *once* but we won't make the decision sticky. But I can see how having a sticky ingestion blocklist could be useful in some cases.

rdicosmo raised the priority of this task from Normal to High.Apr 13 2021, 2:53 PM

So what about exports of the archive available on git-annex?

In the most serious cases, we will be obliged to remove the incriminated content from these exports too.

One can imagine at least two ways to go:

  1. open up the export, chase the incriminated content, remove it or zero it out, then repack and replace the original export
  2. rebuild the export after removing the content from the archive

Fo 2., it would be handy to have timestamps on all objects (feature mentioned in another thread), so one could rebuild an export with the same content (minus the removed one) as the original export

Any thoughts on this? Any other ways to handle this issue (short of simply removing the exports)?

So what about exports of the archive available on git-annex?

Those exports do not contain blobs, so in case the takedown to be handled are only concerning file contents, they should not be impacted.
They might be impacted in case of takedown related to metadata, e.g., commit messages.

In that case we can go with what Roberto suggests (in short: "hot fixing" the exports), but that will take a significant amount of processing. For instance, graph compression will need to be redone from scratch. An alternative option, assuming that takedown impacting metadata will be rare enough, will be to just pull the entire graph exports. Once we have regular graph exports (which can happen as often as on a monthly basis) the impact of doing so will be fairly limited.