Page MenuHomeSoftware Heritage

Implement cran loader with package manager mechanism
Closed, MigratedEdits Locked

Description

The extrinsic metadata are provided by the lister as the loading task parameters.

The intrinsic metadata lies within the DESCRIPTION file at the root tree of a tarball [1].

DESCRIPTION uses a simple file format called DCF, the Debian control format.

See [2] for the necessary parsing tools.

[1] https://r-pkgs.org/description.html

[2] python3-debian

Event Timeline

ardumont triaged this task as Normal priority.Oct 1 2019, 1:29 PM
ardumont created this task.
ardumont updated the task description. (Show Details)
def parse_debian_control(filepath: str) -> Dict:
    """Parse debian control at filepath"""
    metadata = {}
    logger.debug('Debian control file %s', filepath)
    for paragraph in Deb822.iter_paragraphs(open(filepath)):
        logger.debug('paragraph: %s', paragraph)
        metadata.update(**paragraph)

    logger.debug('metadata parsed: %s', metadata)
    return metadata

seems to do the trick

To have a look at possible fields (regarding parsing date and author), here is a sample of the artifacts listed by the cran lister [1]

There are many fields for Date (Published, etc...) and Author (Maintainer).
And their values can vary...

[1] https://forge.softwareheritage.org/F3629258

Here are some analysis sample on the cran dataset about publication "date" and "author" fields.

'Date' and 'Published' fields:

$ python ./analysis.py --with-date-repartition --dataset ./list-all-packages.R.json.gz
2019-10-03 19:46:04,095 24852 filepath: ./list-all-packages.R.json.gz
2019-10-03 19:46:04,304 24852 len(data): 15008
{'date_and_published': 9565, 'published': 5443}

Some extra work is needed to parse those:

$ python ./analysis.py --with-pattern-date-repartition --dataset ./list-all-packages.R.json.gz
2019-10-03 19:50:31,381 25448 filepath: ./list-all-packages.R.json.gz
2019-10-03 19:50:31,592 25448 len(data): 15008
2019-10-03 19:50:32,669 25448 Summary for 'Date' field
            {None: 5443,
             '%Y-%d-%m': 3854,
             '%Y-%m-%d': 9456,
             '%Y-%m-%d %H:%M:%S': 14,
             '%Y/%m/%d': 16,
             '%d %B %Y': 2,
             '%d %b %Y': 1,
             '%d.%m.%Y': 2,
             '%d.%m.%y': 1,
             '%d/%m/%Y': 7,
             'invalid': 49,
             'valid': 13353}
2019-10-03 19:50:32,669 25448 Unknown date format for 'Date' field
['Tue Dec 27 15:06:08 PST 2011',
 'Fabruary 21, 2012',
 '8-14-2013',
 '2019-05-28"',
 '2011-01',
 '04-12-2014',
 '2017-03-01 today',
 '2016-11-0110.1093/icesjms/fsw182',
 '2019-07-010',
 '2015-02.23',
 '2018-08-24, 10:40:10',
 '2013-October-16',
 'Aug 23, 2013',
 'Apr 12, 2013',
 '27-11-2014',
 '19-02-2013',
 '20013-12-30',
 '2019-09-26,',
 '9/25/2014',
 'Fri Jun 27 17:23:53 2014',
 '2016-08-017',
 '2019-02-07l',
 '2014-07',
 '28-04-2014',
 '2014-05',
 '2018-05-010',
 '04-14-2014',
 '2019-09-27 KST',
 '2019-05-08 14:17:31 UTC',
 '$Date$',
 'Wed May 21 13:50:39 CEST 2014',
 '2018-04-10 00:01:04 KST',
 '2019-09-27 KST',
 '2019-06-22 $Date$',
 '2014-11',
 '2019-08-25 10:45',
 '$Date: 2013-01-18 12:49:03 -0600 (Fri, 18 Jan 2013) $',
 '2015-7-013',
 'March 9, 2015',
 '2018-05-023',
 'Aug. 18, 2012',
 '2014-Dec-17',
 'March 01, 2013',
 '2017-04-08.',
 "Check NEWS file for changes: news(package='simSummary')",
 '2014-Apr-22',
 'Mon Jan 12 19:54:04 2015',
 'May 22, 2014',
 '2014-08-12 09:55:10 EDT']
2019-10-03 19:50:34,320 25448 Summary for 'Published' field
            {'%Y-%d-%m': 5880,
             '%Y-%m-%d': 15008,
             'valid': 20888}
2019-10-03 19:50:34,320 25448 Unknown date format for 'Published' field
[]

About 'Author' and 'Maintainer' fields:

$ python ./analysis.py --with-author-repartition --dataset ./list-all-packages.R.json.gz
2019-10-03 19:43:29,309 24451 filepath: ./list-all-packages.R.json.gz
2019-10-03 19:43:29,511 24451 len(data): 15008
{'maintainer_and_author': 15008}
$ python ./analysis.py --with-pattern-author-repartition --dataset ./list-all-packages.R.json.gz
2019-10-03 21:28:37,223 3731 filepath: ./list-all-packages.R.json.gz
2019-10-03 21:28:37,432 3731 len(data): 15008
2019-10-03 21:28:37,493 3731 Summary for 'Maintainer' field
{'ORPHANED': 62,
 "[\\'ØA-Za-z ].*": 14964,
 '[a-zA-Z].*\\n<[a-zA-Z0-9.@].*>': 19,
 '[Ø\\\'"a-zA-Z].*<[a-zA-Z0-9.@].*>.*': 14927,
 'valid': 29972}
2019-10-03 21:28:37,493 3731 Unknown format for 'Maintainer' field
[]
2019-10-03 21:28:37,553 3731 Summary for 'Author' field
{"[\\'ØA-Za-z ].*": 14979,
 '[Ø\\\'"a-zA-Z].*<[a-zA-Z0-9.@].*>.*': 3024,
 '\\n': 13,
 'invalid': 2,
 'valid': 18016}
2019-10-03 21:28:37,553 3731 Unknown format for 'Author' field
['"Shingo Yamamoto (gloops, Inc.)" [aut, cre],\n'
 '        RStudio, Inc. [cph],\n'
 '        Michael Bostock [ctb, cph] (D3.js library),\n'
 '        jQuery Foundation [cph] (jQuery library and jQuery UI library),\n'
 '        jQuery contributors [ctb, cph] (jQuery library; authors listed in '
 'inst/htmlwidgets/lib/jquery/jquery-AUTHORS.txt),\n'
 '        jQuery UI contributors [ctb, cph] (jQuery UI library; authors listed '
 'in inst/htmlwidgets/lib/jquery-ui/AUTHORS.txt)',
 '"DecisionPatterns [aut, cre]"']

Removing that package loader implementation from the main task.
It's not a blocker to close the main task.

ardumont claimed this task.

Deployed.