Page MenuHomeSoftware Heritage

leverage Shodan scans to find and ingest the "penumbra" of FOSS
Open, LowPublic


A recent paper about "the penumbra of open source" uses an interesting approach to find Git repositories outside of the popular code hosting platforms. They leverage Shodan port scans and HTML fingerprints of code hosting web interfaces (e.g., <meta content="GitLab" property="og:site name">) to identify self-hosted public Git repositories in the wild.

Their results are quite promising and they also show that the found repositories are valuable: they include several research repositories (but not only) that are found on either GitHub or Software Heritage.

We should explore the feasibility or replicating the same approach in production to increase our "long tail" of archived projects.

Event Timeline

zack triaged this task as Low priority.Aug 10 2021, 12:19 PM
zack created this task.
zack updated the task description. (Show Details)
zack updated the task description. (Show Details)