Page MenuHomeSoftware Heritage

Review which extrinsic metadata we want to fetch and archive
Closed, MigratedEdits Locked

Description

Event Timeline

vlorentz triaged this task as Normal priority.May 24 2019, 10:36 AM
vlorentz created this task.
codemetaschema.orgtypeexampleDescriptionetalab_nameetalab typeetalab_examplecomment
namenamestringRepoNamename of the repositorynomchaîne de caractèresnom-repertoire
authorauthorPersonetalabthe authororganisation_nomchaîne de caractèresetalab
contributorcontributorPersonetalabsecondary authorschaîne de
URLhttps://github.com/platform/forge used to host itplateformechaîne de caractèresGitHubnot to be confused with CodeMeta’s runtimePlatform
codeRepositorycodeRepositoryURLhttps://github.com/etalab/nom-repertoireURL to the repositoryrepertoire_urlchaîne de caractères (format uri)https://github.com/etalab/nom-repertoire
descriptiondescriptionstringThis repository is usefulrepository descriptiondescriptionchaîne de caractèresCe répertoire est utile
booleanfalsewhether the repository is a forkest_forkbooléenfalse
isBasedOnURLhttps://github.com/etalab/base-repothe repo this repo is forked from (if any)
dateCreateddateCreateddate2018-12-01T20:00:55Zcreation datedate_creationdate et heure2018-12-01T20:00:55Z
dateModifieddateModifieddate2018-12-01T20:00:55Zupdate datederniere_mise_a_jourdate et heure2018-12-01T20:00:55Z
urlurlURLhttps://etalab.gouv.frhomepage URLpage_accueilchaîne de caractèreshttps://etalab.gouv.fr
int42number of people who added the repository to their favoritesnombre_starsnombre entier42
int13number of times the repository was forkednombre_forksnombre entier13related schema.org property: @reverse → isBasedOn
licenselicenseURLhttps://spdx.org/licenses/MITlicense of the repository, as detected or specified by the platformlicencechaîne de caractèresMIT
int0number of open issuesnombre_issues_ouvertesnombre entier0
issueTrackerURLhttps://github.com/etalab/repo/issueslink to the bug tracker
programmingLanguageprogrammingLanguagestringPythonmain language(s) as detected or specified by the platformlangagechaîne de caractèresPython
keywordskeywordsstringuseful,france,opendatatopicschaîne de caractèresutile,france,opendata
contIntegrationURLhttps://travis.org/etalab/repo/link to the continuous integration service
readmeURLlink to the README file
developmentStatusstringactivee.g. Active, inactive, suspended

@moranegg, @AntoineAugusti any ideas of stuff to add to this list, before I start reviewing what can actually be fetched from each forge?

vlorentz changed the task status from Open to Work in Progress.May 24 2019, 11:36 AM

In addition, it would be nice to know the number of contributors and get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.

I did a quick review of the different forges a while ago and GitHub seemed to expose the most metadata at the organisation level, which eases a lot the retrieval.

In addition, it would be nice to know the number of contributors

Good idea! I added it to the list.

get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.

There's developmentStatus, but I highly doubt many project define it. (I've seen it as a badge on some GitHub repos, but it's rare). Using "dateModified" seems like a good idea indeed.
Actually we can get the "dateModified" based on data already in the SWH archive, because on each visit of a repo we take a snapshot of the repo and hash it; so it's just a matter of listing the visits and finding the last change to this hash. But it's rather coarse-grained, we take a snapshot of each repo every one or two years.

I think that we should fetch all metadata found in its raw form (keep in xml if xml, etc.)
Apply translation techniques with CodeMeta to while identifying relevant metadata we want to keep in a translated format.
so no need to discriminate and choose what to fetch.

Here are a couple of rare metadata that are useful in certain use cases
datePublished we use it for HAL
referencedPublication used for software citation
releaseNotes will be used in the deposit use case for creating releases

Here is the list of metadata we worked on for HAL specifications:
https://forge.softwareheritage.org/P183

Can you link to the list of metadata fields that resolves this ?

It's in my first comment on this task