Page MenuHomeSoftware Heritage

Review which extrinsic metadata we want to fetch and archive
Started, Work in Progress, NormalPublic


Event Timeline

vlorentz triaged this task as Normal priority.
vlorentz added a comment.EditedFri, May 24, 11:25 AM
codemetaschema.orgtypeexampleDescriptionetalab_nameetalab typeetalab_examplecomment
namenamestringRepoNamename of the repositorynomchaîne de caractèresnom-repertoire
authorauthorPersonetalabthe authororganisation_nomchaîne de caractèresetalab
contributorcontributorPersonetalabsecondary authorschaîne de
URL used to host itplateformechaîne de caractèresGitHubnot to be confused with CodeMeta’s runtimePlatform
codeRepositorycodeRepositoryURL to the repositoryrepertoire_urlchaîne de caractères (format uri)
descriptiondescriptionstringThis repository is usefulrepository descriptiondescriptionchaîne de caractèresCe répertoire est utile
booleanfalsewhether the repository is a forkest_forkbooléenfalse
isBasedOnURL repo this repo is forked from (if any)
dateCreateddateCreateddate2018-12-01T20:00:55Zcreation datedate_creationdate et heure2018-12-01T20:00:55Z
dateModifieddateModifieddate2018-12-01T20:00:55Zupdate datederniere_mise_a_jourdate et heure2018-12-01T20:00:55Z
urlurlURLhttps://etalab.gouv.frhomepage URLpage_accueilchaîne de caractères
int42number of people who added the repository to their favoritesnombre_starsnombre entier42
int13number of times the repository was forkednombre_forksnombre entier13related property: @reverse → isBasedOn
licenselicenseURL of the repository, as detected or specified by the platformlicencechaîne de caractèresMIT
int0number of open issuesnombre_issues_ouvertesnombre entier0
issueTrackerURL to the bug tracker
programmingLanguageprogrammingLanguagestringPythonmain language(s) as detected or specified by the platformlangagechaîne de caractèresPython
keywordskeywordsstringuseful,france,opendatatopicschaîne de caractèresutile,france,opendata
contIntegrationURL to the continuous integration service
readmeURLlink to the README file
developmentStatusstringactivee.g. Active, inactive, suspended

@moranegg, @AntoineAugusti any ideas of stuff to add to this list, before I start reviewing what can actually be fetched from each forge?

vlorentz changed the task status from Open to Work in Progress.Fri, May 24, 11:36 AM

In addition, it would be nice to know the number of contributors and get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.

I did a quick review of the different forges a while ago and GitHub seemed to expose the most metadata at the organisation level, which eases a lot the retrieval.

In addition, it would be nice to know the number of contributors

Good idea! I added it to the list.

get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.

There's developmentStatus, but I highly doubt many project define it. (I've seen it as a badge on some GitHub repos, but it's rare). Using "dateModified" seems like a good idea indeed.
Actually we can get the "dateModified" based on data already in the SWH archive, because on each visit of a repo we take a snapshot of the repo and hash it; so it's just a matter of listing the visits and finding the last change to this hash. But it's rather coarse-grained, we take a snapshot of each repo every one or two years.

I think that we should fetch all metadata found in its raw form (keep in xml if xml, etc.)
Apply translation techniques with CodeMeta to while identifying relevant metadata we want to keep in a translated format.
so no need to discriminate and choose what to fetch.

Here are a couple of rare metadata that are useful in certain use cases
datePublished we use it for HAL
referencedPublication used for software citation
releaseNotes will be used in the deposit use case for creating releases