The recent paper [[ https://arxiv.org/pdf/2106.15611.pdf | The penumbra of open source ]] used an interesting approach to find Git repositories outside of the popular code hosting platforms. They leveraged [[ https://www.shodan.io/ | Shodan ]] scans and HTML snippets of code hosting web interfaces (e.g., `<meta content="GitLab" property="og:site name">`) to identify self-hosted public Git repositories in the wild.
Their results are quite promising and they also show that the found repositories are valuable: they include several research repositories (but not only) that are found on either GitHub or Software Heritage.
We should explore the feasibility or replicating the same approach in production to increase our "long tail" of archived projects.