Page MenuHomeSoftware Heritage

Collect metadata about software from ScanR
Closed, MigratedEdits Locked


ScanR is a platform that indexes outputs from French research. It has recently started to create links between research articles and software archived in Software Heritage.

We plan to collect the metadata generated by ScanR about software origins archived in Software Heritage, as software origin metadata.

This can be done in two ways:

  • [preferred] as a deposit from ScanR into SWH (like the metadata only deposit planned for HAL, see T1021)
  • [fragile] as a pull operation, with SWH extracting information from the ScanR API

Event Timeline

moranegg triaged this task as Normal priority.Mar 20 2020, 3:28 PM
moranegg created this task.
rdicosmo renamed this task from ScanR -SWH Collaboration to Collect metadata about software from ScanR.Mar 20 2020, 6:56 PM
rdicosmo updated the task description. (Show Details)
rdicosmo updated the task description. (Show Details)

Here is a sample request provided by the ScanR developers to get all the metadata entries that can be found about software in ScanR

params = {"query":"","sourceFields":["authors","oaEvidence","id","title","summary","domains","affiliations","links","productionType","publicationDate","isOa"],"pageSize":10000,"filters":{"links.type":{"type":"MultiValueSearchFilter","op":"any","values":["software_heritage"]}}}
r =, json=params)

And this is an excerpt from the result (that is returned in JSON format):

"id": "doi10.1002/bimj.201500098",
"links": [{'type': 'software_heritage',
 'url': ''}],
"title":{'default': 'Bayesian model selection in logistic regression for the detection of adverse drug reactions'},
"summary": {'default': 'Spontaneous adverse event reports have a high potential for detecting adverse drug reactions. However, due to their dimension, the analysis of such databases requires statistical methods. In this context, disproportionality measures can be used. Their main idea is to project the data onto contingency tables in order to measure the strength of associations between drugs and adverse events. However, due to the data projection, these methods are sensitive to the problem of coprescriptions and masking effects. Recently, logistic regressions have been used with a Lasso type penalty to perform the detection of associations between drugs and adverse events. On different examples, this approach limits the drawbacks of the disproportionality methods, but the choice of the penalty value is open to criticism while it strongly influences the results. In this paper, we propose to use a logistic regression whose sparsity is viewed as a model selection challenge. Since the model space is huge, a Metropolis-Hastings algorithm carries out the model selection by maximizing the BIC criterion. Thus, we avoid the calibration of penalty or threshold. During our application on the French pharmacovigilance database, the proposed method is compared to well-established approaches on a reference dataset, and obtains better rates of positive and negative controls. However, many signals (i.e., specific drug-event associations) are not detected by the proposed method. So, we conclude that this method should be used in parallel to existing measures in pharmacovigilance. Code implementing the proposed method is available at the following url:'}