A recent paper about "the penumbra of open source" uses an interesting approach to find Git repositories outside of the popular code hosting platforms. They leverage Shodan port scans and HTML fingerprints of code hosting web interfaces (e.g., <meta content="GitLab" property="og:site name">) to identify self-hosted public Git repositories in the wild.
Their results are quite promising and they also show that the found repositories are valuable: they include several research repositories (but not only) that are found on either GitHub or Software Heritage.
We should explore the feasibility or replicating the same approach in production to increase our "long tail" of archived projects.