ClearlyDefined is a project whose goal is to collaboratively and semi-automatically curate information about Free/Open Source Software (FOSS) projects, including licensing and vulnerability information. As one of its main output, ClearyDefined maintains an open data knowledge-base that cross references FOSS source code artifacts found in version control systems, package repositories, etc. to curated information about their licenses and vulnerabilities. The same source code artifacts are archived by Software Heritage for long-term preservation purposes. The goal of this task is to integrate ClearlyDefined and Software Heritage, for mutual benefit. Software Heritage will benefit from mirroring ClearlyDefined data, allowing to query them while navigating the archive and at scale; ClearlyDefined will benefit from learning about the existing of FOSS projects that have not been analyzed for "clarity" yet.
There is currently a code repository here: https://forge.softwareheritage.org/source/swh-clearlydefined
It maintains a mirror of ClearlyDefined's database, and inserts data from this mirror to our database. It needs some work/refactoring to be production-ready, though.