Page MenuHomeSoftware Heritage

Add minimal GitHub metadata mapping
ClosedPublic

Authored by vlorentz on Jun 29 2022, 6:03 PM.

Details

Summary

This introduces the scaffholding for extrinsic metadata mappings

Depends on D8048.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
extrinsic
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30138
Build 47103: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 47102: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8053 (id=29052)

Could not rebase; Attempt merge onto 1be4e184d4...

Updating 1be4e18..8fa06ba
Fast-forward
 swh/indexer/metadata.py                            |   41 +-
 swh/indexer/metadata_detector.py                   |    8 +-
 swh/indexer/metadata_dictionary/__init__.py        |   12 +-
 swh/indexer/metadata_dictionary/base.py            |   48 +-
 swh/indexer/metadata_dictionary/github.py          |   41 +
 swh/indexer/metadata_dictionary/npm.py             |    4 +-
 swh/indexer/metadata_dictionary/ruby.py            |    7 +-
 swh/indexer/tests/metadata_dictionary/__init__.py  |    0
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  220 ++++
 .../tests/metadata_dictionary/test_codemeta.py     |  175 +++
 .../tests/metadata_dictionary/test_github.py       |  113 ++
 .../tests/metadata_dictionary/test_maven.py        |  365 ++++++
 swh/indexer/tests/metadata_dictionary/test_npm.py  |  322 +++++
 .../tests/metadata_dictionary/test_python.py       |  114 ++
 swh/indexer/tests/metadata_dictionary/test_ruby.py |  134 ++
 swh/indexer/tests/test_metadata.py                 | 1277 --------------------
 swh/indexer/tests/zz_celery/README                 |    2 +
 swh/indexer/tests/zz_celery/__init__.py            |    0
 swh/indexer/tests/{ => zz_celery}/test_tasks.py    |    0
 19 files changed, 1564 insertions(+), 1319 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/__init__.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_cff.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_codemeta.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_maven.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_npm.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_python.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_ruby.py
 create mode 100644 swh/indexer/tests/zz_celery/README
 create mode 100644 swh/indexer/tests/zz_celery/__init__.py
 rename swh/indexer/tests/{ => zz_celery}/test_tasks.py (100%)
Changes applied before test
commit 8fa06ba290c342c3196b4d58309d1b6c485881b1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

commit 244bf36f55fb919f9b9da8503db309a6d816fd30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 13:43:14 2022 +0200

    Move mapping-specific tests to a new directory
    
    We have many of those now; and keeping them all their tests in the same file
    is messy
    
    This causes these tests to run after Celery tests, which breaks them;
    so this commit also renames Celery tests to make them run last.

commit e002b2ee66b305c98a153cc2b57088c179a3fc68
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:19:00 2022 +0200

    Remove given/when/then comments

commit 65edef32831949de7b8e14846ecd4fa43bc619ee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:08:49 2022 +0200

    Remove SingleFileMapping from JsonMapping's base classes
    
    Extrinsic metadata indexers will not use a 'file' as input,
    but will typically use RawExtrinsicMetadata containing formats
    in JSON.

commit f7a4bf4e04b3ac4c2fa89cf9b8a5c22e5f0c4d12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 11:01:35 2022 +0200

    Add typing to detect_metadata() and related functions

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/291/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/291/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 29 2022, 6:09 PM
Harbormaster failed remote builds in B30138: Diff 29052!

Build is green

Patch application report for D8053 (id=29062)

Could not rebase; Attempt merge onto 1be4e184d4...

Updating 1be4e18..07074b9
Fast-forward
 swh/indexer/metadata.py                            |   41 +-
 swh/indexer/metadata_detector.py                   |    8 +-
 swh/indexer/metadata_dictionary/__init__.py        |   12 +-
 swh/indexer/metadata_dictionary/base.py            |   48 +-
 swh/indexer/metadata_dictionary/github.py          |   41 +
 swh/indexer/metadata_dictionary/npm.py             |    4 +-
 swh/indexer/metadata_dictionary/ruby.py            |    7 +-
 swh/indexer/tests/metadata_dictionary/__init__.py  |    0
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  220 ++++
 .../tests/metadata_dictionary/test_codemeta.py     |  175 +++
 .../tests/metadata_dictionary/test_github.py       |  113 ++
 .../tests/metadata_dictionary/test_maven.py        |  365 ++++++
 swh/indexer/tests/metadata_dictionary/test_npm.py  |  322 +++++
 .../tests/metadata_dictionary/test_python.py       |  114 ++
 swh/indexer/tests/metadata_dictionary/test_ruby.py |  134 ++
 swh/indexer/tests/test_cli.py                      |    1 +
 swh/indexer/tests/test_metadata.py                 | 1277 --------------------
 swh/indexer/tests/zz_celery/README                 |    2 +
 swh/indexer/tests/zz_celery/__init__.py            |    0
 swh/indexer/tests/{ => zz_celery}/test_tasks.py    |    0
 20 files changed, 1565 insertions(+), 1319 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/__init__.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_cff.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_codemeta.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_maven.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_npm.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_python.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_ruby.py
 create mode 100644 swh/indexer/tests/zz_celery/README
 create mode 100644 swh/indexer/tests/zz_celery/__init__.py
 rename swh/indexer/tests/{ => zz_celery}/test_tasks.py (100%)
Changes applied before test
commit 07074b9eec29880698469a623133a94a7122b731
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

commit 244bf36f55fb919f9b9da8503db309a6d816fd30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 13:43:14 2022 +0200

    Move mapping-specific tests to a new directory
    
    We have many of those now; and keeping them all their tests in the same file
    is messy
    
    This causes these tests to run after Celery tests, which breaks them;
    so this commit also renames Celery tests to make them run last.

commit e002b2ee66b305c98a153cc2b57088c179a3fc68
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:19:00 2022 +0200

    Remove given/when/then comments

commit 65edef32831949de7b8e14846ecd4fa43bc619ee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:08:49 2022 +0200

    Remove SingleFileMapping from JsonMapping's base classes
    
    Extrinsic metadata indexers will not use a 'file' as input,
    but will typically use RawExtrinsicMetadata containing formats
    in JSON.

commit f7a4bf4e04b3ac4c2fa89cf9b8a5c22e5f0c4d12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 11:01:35 2022 +0200

    Add typing to detect_metadata() and related functions

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/295/ for more details.

douardda added a subscriber: douardda.

I'm not against this, but I have my doubts on the overall "architecture".

swh/indexer/metadata_dictionary/base.py
50–51

so if I get this right, with this new method, the BaseMapping is getting extended to support both intrinsic and extrinsic MD, right? Why isn't there a similar "intrinsic_metadata_formats()` method then? Not sure I like (or understand?) this "architecture" a lot.

66–67

I find it a bit weird that the SingleFileMapping class is only for intrinsic MD without making this a clear statement by naming the class accordingly.

This revision is now accepted and ready to land.Jul 4 2022, 11:59 AM
swh/indexer/metadata_dictionary/base.py
50–51

extrinsic metadata is present directly in the journal in RawExtrinsicMetadata objects, each with one of the formats described in this exhaustive list: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#extrinsic-metadata-formats

intrinsic metadata is present in directories, and is detected by detect_metadata_files, which needs to be smarter, because the set of possible files is unbounded. For example, *.gemspec.

swh/indexer/metadata_dictionary/base.py
66–67

yeah it's a little messy. I think I'll separate the two class hierarchies (one for extrinsic and one for intrinsic) and make DictMapping and its descendents a mixin; but in a future diff.

swh/indexer/metadata_dictionary/base.py
50–51

extrinsic metadata is present directly in the journal in RawExtrinsicMetadata objects, each with one of the formats described in this exhaustive list: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#extrinsic-metadata-formats

intrinsic metadata is present in directories, and is detected by detect_metadata_files, which needs to be smarter, because the set of possible files is unbounded. For example, *.gemspec.

I guess what I will wait for is a nice user friendly documentation of all this in the end :-)

Build is green

Patch application report for D8053 (id=29122)

Rebasing onto 3074268b1b...

Current branch diff-target is up to date.
Changes applied before test
commit 9085cae01009f19a00a9c3b1e56eeb138e4f2775
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/314/ for more details.

This revision was automatically updated to reflect the committed changes.