When loading repositories with a deeply nested structure and lots of redundancy (e.g. repositories where someone repeatedly tagged the root of the repository instead of the trunk, generating nested tags/tagX/tags/tagY/tags/tagZ/tags/tagT/... structures), the SVN loader doesn't limit the size of the checkout.
This is an avenue of DoS that we should try to avoid.
I see two possibilities to handle these repositories:
- short-term DoS mitigation: limit the size/file count/depth of the local checkout, and fail the load operation if any of these counts gets over a threshold.
- long-term enablement for loading these repositories: replace the current "dumb" checkout method with a virtual, content-addressed filesystem
Assuming these repositories are very duplicated, there shouldn't be that many contents. Instead of checking out everything on disk, we should be able to manage a merkle DAG in memory to represent the contents of the checkout. In concrete terms, this means replacing our current on-disk swh.loader.svn.replay.DirEditor with an in-memory version (probably using a similar data structure as swh.model.from_disk.Directory)
Example repository exhibiting the behavior: https://svn.code.sf.net/p/sbproj/svn