Page MenuHomeSoftware Heritage

Add minimal GitHub metadata mapping
ClosedPublic

Authored by vlorentz on Jun 29 2022, 6:03 PM.

Details

Summary

This introduces the scaffholding for extrinsic metadata mappings

Depends on D8048.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8053 (id=29052)

Could not rebase; Attempt merge onto 1be4e184d4...

Updating 1be4e18..8fa06ba
Fast-forward
 swh/indexer/metadata.py                            |   41 +-
 swh/indexer/metadata_detector.py                   |    8 +-
 swh/indexer/metadata_dictionary/__init__.py        |   12 +-
 swh/indexer/metadata_dictionary/base.py            |   48 +-
 swh/indexer/metadata_dictionary/github.py          |   41 +
 swh/indexer/metadata_dictionary/npm.py             |    4 +-
 swh/indexer/metadata_dictionary/ruby.py            |    7 +-
 swh/indexer/tests/metadata_dictionary/__init__.py  |    0
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  220 ++++
 .../tests/metadata_dictionary/test_codemeta.py     |  175 +++
 .../tests/metadata_dictionary/test_github.py       |  113 ++
 .../tests/metadata_dictionary/test_maven.py        |  365 ++++++
 swh/indexer/tests/metadata_dictionary/test_npm.py  |  322 +++++
 .../tests/metadata_dictionary/test_python.py       |  114 ++
 swh/indexer/tests/metadata_dictionary/test_ruby.py |  134 ++
 swh/indexer/tests/test_metadata.py                 | 1277 --------------------
 swh/indexer/tests/zz_celery/README                 |    2 +
 swh/indexer/tests/zz_celery/__init__.py            |    0
 swh/indexer/tests/{ => zz_celery}/test_tasks.py    |    0
 19 files changed, 1564 insertions(+), 1319 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/__init__.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_cff.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_codemeta.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_maven.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_npm.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_python.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_ruby.py
 create mode 100644 swh/indexer/tests/zz_celery/README
 create mode 100644 swh/indexer/tests/zz_celery/__init__.py
 rename swh/indexer/tests/{ => zz_celery}/test_tasks.py (100%)
Changes applied before test
commit 8fa06ba290c342c3196b4d58309d1b6c485881b1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

commit 244bf36f55fb919f9b9da8503db309a6d816fd30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 13:43:14 2022 +0200

    Move mapping-specific tests to a new directory
    
    We have many of those now; and keeping them all their tests in the same file
    is messy
    
    This causes these tests to run after Celery tests, which breaks them;
    so this commit also renames Celery tests to make them run last.

commit e002b2ee66b305c98a153cc2b57088c179a3fc68
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:19:00 2022 +0200

    Remove given/when/then comments

commit 65edef32831949de7b8e14846ecd4fa43bc619ee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:08:49 2022 +0200

    Remove SingleFileMapping from JsonMapping's base classes
    
    Extrinsic metadata indexers will not use a 'file' as input,
    but will typically use RawExtrinsicMetadata containing formats
    in JSON.

commit f7a4bf4e04b3ac4c2fa89cf9b8a5c22e5f0c4d12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 11:01:35 2022 +0200

    Add typing to detect_metadata() and related functions

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/291/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/291/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 29 2022, 6:09 PM
Harbormaster failed remote builds in B30138: Diff 29052!

Build is green

Patch application report for D8053 (id=29062)

Could not rebase; Attempt merge onto 1be4e184d4...

Updating 1be4e18..07074b9
Fast-forward
 swh/indexer/metadata.py                            |   41 +-
 swh/indexer/metadata_detector.py                   |    8 +-
 swh/indexer/metadata_dictionary/__init__.py        |   12 +-
 swh/indexer/metadata_dictionary/base.py            |   48 +-
 swh/indexer/metadata_dictionary/github.py          |   41 +
 swh/indexer/metadata_dictionary/npm.py             |    4 +-
 swh/indexer/metadata_dictionary/ruby.py            |    7 +-
 swh/indexer/tests/metadata_dictionary/__init__.py  |    0
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  220 ++++
 .../tests/metadata_dictionary/test_codemeta.py     |  175 +++
 .../tests/metadata_dictionary/test_github.py       |  113 ++
 .../tests/metadata_dictionary/test_maven.py        |  365 ++++++
 swh/indexer/tests/metadata_dictionary/test_npm.py  |  322 +++++
 .../tests/metadata_dictionary/test_python.py       |  114 ++
 swh/indexer/tests/metadata_dictionary/test_ruby.py |  134 ++
 swh/indexer/tests/test_cli.py                      |    1 +
 swh/indexer/tests/test_metadata.py                 | 1277 --------------------
 swh/indexer/tests/zz_celery/README                 |    2 +
 swh/indexer/tests/zz_celery/__init__.py            |    0
 swh/indexer/tests/{ => zz_celery}/test_tasks.py    |    0
 20 files changed, 1565 insertions(+), 1319 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/__init__.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_cff.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_codemeta.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_maven.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_npm.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_python.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_ruby.py
 create mode 100644 swh/indexer/tests/zz_celery/README
 create mode 100644 swh/indexer/tests/zz_celery/__init__.py
 rename swh/indexer/tests/{ => zz_celery}/test_tasks.py (100%)
Changes applied before test
commit 07074b9eec29880698469a623133a94a7122b731
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

commit 244bf36f55fb919f9b9da8503db309a6d816fd30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 13:43:14 2022 +0200

    Move mapping-specific tests to a new directory
    
    We have many of those now; and keeping them all their tests in the same file
    is messy
    
    This causes these tests to run after Celery tests, which breaks them;
    so this commit also renames Celery tests to make them run last.

commit e002b2ee66b305c98a153cc2b57088c179a3fc68
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:19:00 2022 +0200

    Remove given/when/then comments

commit 65edef32831949de7b8e14846ecd4fa43bc619ee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:08:49 2022 +0200

    Remove SingleFileMapping from JsonMapping's base classes
    
    Extrinsic metadata indexers will not use a 'file' as input,
    but will typically use RawExtrinsicMetadata containing formats
    in JSON.

commit f7a4bf4e04b3ac4c2fa89cf9b8a5c22e5f0c4d12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 11:01:35 2022 +0200

    Add typing to detect_metadata() and related functions

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/295/ for more details.

douardda added a subscriber: douardda.

I'm not against this, but I have my doubts on the overall "architecture".

swh/indexer/metadata_dictionary/base.py
50–51

so if I get this right, with this new method, the BaseMapping is getting extended to support both intrinsic and extrinsic MD, right? Why isn't there a similar "intrinsic_metadata_formats()` method then? Not sure I like (or understand?) this "architecture" a lot.

66–67

I find it a bit weird that the SingleFileMapping class is only for intrinsic MD without making this a clear statement by naming the class accordingly.

This revision is now accepted and ready to land.Jul 4 2022, 11:59 AM
swh/indexer/metadata_dictionary/base.py
50–51

extrinsic metadata is present directly in the journal in RawExtrinsicMetadata objects, each with one of the formats described in this exhaustive list: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#extrinsic-metadata-formats

intrinsic metadata is present in directories, and is detected by detect_metadata_files, which needs to be smarter, because the set of possible files is unbounded. For example, *.gemspec.

swh/indexer/metadata_dictionary/base.py
66–67

yeah it's a little messy. I think I'll separate the two class hierarchies (one for extrinsic and one for intrinsic) and make DictMapping and its descendents a mixin; but in a future diff.

swh/indexer/metadata_dictionary/base.py
50–51

extrinsic metadata is present directly in the journal in RawExtrinsicMetadata objects, each with one of the formats described in this exhaustive list: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#extrinsic-metadata-formats

intrinsic metadata is present in directories, and is detected by detect_metadata_files, which needs to be smarter, because the set of possible files is unbounded. For example, *.gemspec.

I guess what I will wait for is a nice user friendly documentation of all this in the end :-)

Build is green

Patch application report for D8053 (id=29122)

Rebasing onto 3074268b1b...

Current branch diff-target is up to date.
Changes applied before test
commit 9085cae01009f19a00a9c3b1e56eeb138e4f2775
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/314/ for more details.

This revision was automatically updated to reflect the committed changes.