Page MenuHomeSoftware Heritage

Add extrinsic metadata indexer
ClosedPublic

Authored by vlorentz on Jun 30 2022, 3:30 PM.

Details

Summary

Which calls the GitHub mapping, for RawExtrinsicMetadata objects coming
from github

Depends on D8055 and D8058

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8060 (id=29080)

Could not rebase; Attempt merge onto 1be4e184d4...

Updating 1be4e18..3e56f35
Fast-forward
 swh/indexer/codemeta.py                            |    2 +
 swh/indexer/indexer.py                             |    1 +
 swh/indexer/metadata.py                            |  171 ++-
 swh/indexer/metadata_detector.py                   |    8 +-
 swh/indexer/metadata_dictionary/__init__.py        |   12 +-
 swh/indexer/metadata_dictionary/base.py            |   81 +-
 swh/indexer/metadata_dictionary/github.py          |   78 ++
 swh/indexer/metadata_dictionary/npm.py             |    4 +-
 swh/indexer/metadata_dictionary/ruby.py            |   12 +-
 swh/indexer/sql/30-schema.sql                      |   16 +
 swh/indexer/sql/50-func.sql                        |  121 +-
 swh/indexer/sql/60-indexes.sql                     |   10 +
 swh/indexer/storage/__init__.py                    |   47 +
 swh/indexer/storage/db.py                          |   29 +
 swh/indexer/storage/in_memory.py                   |   13 +
 swh/indexer/storage/interface.py                   |   29 +
 swh/indexer/storage/model.py                       |   12 +
 swh/indexer/tests/metadata_dictionary/__init__.py  |    0
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  220 +++
 .../tests/metadata_dictionary/test_codemeta.py     |  175 +++
 .../tests/metadata_dictionary/test_github.py       |  126 ++
 .../tests/metadata_dictionary/test_maven.py        |  365 +++++
 swh/indexer/tests/metadata_dictionary/test_npm.py  |  322 +++++
 .../tests/metadata_dictionary/test_python.py       |  114 ++
 swh/indexer/tests/metadata_dictionary/test_ruby.py |  134 ++
 swh/indexer/tests/storage/test_storage.py          |  248 ++++
 swh/indexer/tests/test_cli.py                      |    1 +
 swh/indexer/tests/test_metadata.py                 | 1406 ++------------------
 swh/indexer/tests/zz_celery/README                 |    2 +
 swh/indexer/tests/zz_celery/__init__.py            |    0
 swh/indexer/tests/{ => zz_celery}/test_tasks.py    |    0
 31 files changed, 2391 insertions(+), 1368 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/__init__.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_cff.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_codemeta.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_maven.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_npm.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_python.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_ruby.py
 create mode 100644 swh/indexer/tests/zz_celery/README
 create mode 100644 swh/indexer/tests/zz_celery/__init__.py
 rename swh/indexer/tests/{ => zz_celery}/test_tasks.py (100%)
Changes applied before test
commit 3e56f35f060465381648b2cb270cf5454cb04a13
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 15:30:17 2022 +0200

    Add extrinsic metadata indexer
    
    Which calls the GitHub mapping, for RawExtrinsicMetadata objects coming
    from github

commit 49221e13a976009cd9d2306404987b541239567c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 15:27:45 2022 +0200

    Add support for origin_extrinsic_metadata to the storage

commit 1b0eb35fb8e4cc41d5718bc947084a4a827cbde6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 14:52:08 2022 +0200

    reorder SQL functions

commit 2da2bc3ce7b73b775dbba22d3763cea1526a544b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 10:54:03 2022 +0200

    github mapping: Add support for more terms from the Codemeta crosswalk

commit a20610b2503ec503e9c0d3b26e809d779c3dbc8e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 18:08:00 2022 +0200

    github mapping: Add support for terms outside the codemeta context

commit 07074b9eec29880698469a623133a94a7122b731
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

commit 244bf36f55fb919f9b9da8503db309a6d816fd30
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 13:43:14 2022 +0200

    Move mapping-specific tests to a new directory
    
    We have many of those now; and keeping them all their tests in the same file
    is messy
    
    This causes these tests to run after Celery tests, which breaks them;
    so this commit also renames Celery tests to make them run last.

commit e002b2ee66b305c98a153cc2b57088c179a3fc68
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:19:00 2022 +0200

    Remove given/when/then comments

commit 65edef32831949de7b8e14846ecd4fa43bc619ee
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 12:08:49 2022 +0200

    Remove SingleFileMapping from JsonMapping's base classes
    
    Extrinsic metadata indexers will not use a 'file' as input,
    but will typically use RawExtrinsicMetadata containing formats
    in JSON.

commit f7a4bf4e04b3ac4c2fa89cf9b8a5c22e5f0c4d12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 11:01:35 2022 +0200

    Add typing to detect_metadata() and related functions

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/299/ for more details.

douardda added a subscriber: douardda.
douardda added inline comments.
swh/indexer/metadata.py
91

break?

This revision is now accepted and ready to land.Jul 4 2022, 12:26 PM

Build is green

Patch application report for D8060 (id=29126)

Could not rebase; Attempt merge onto 3074268b1b...

Updating 3074268..61b2234
Fast-forward
 swh/indexer/codemeta.py                            |   2 +
 swh/indexer/indexer.py                             |   1 +
 swh/indexer/metadata.py                            | 130 +++++++++--
 swh/indexer/metadata_dictionary/__init__.py        |  12 +-
 swh/indexer/metadata_dictionary/base.py            |  55 ++++-
 swh/indexer/metadata_dictionary/github.py          |  78 +++++++
 swh/indexer/metadata_dictionary/ruby.py            |   7 +-
 swh/indexer/sql/30-schema.sql                      |  16 ++
 swh/indexer/sql/50-func.sql                        | 121 ++++++++--
 swh/indexer/sql/60-indexes.sql                     |  10 +
 swh/indexer/sql/upgrades/135.sql                   | 106 +++++++++
 swh/indexer/storage/__init__.py                    |  49 +++-
 swh/indexer/storage/db.py                          |  29 +++
 swh/indexer/storage/in_memory.py                   |  13 ++
 swh/indexer/storage/interface.py                   |  29 +++
 swh/indexer/storage/model.py                       |  12 +
 .../tests/metadata_dictionary/test_github.py       | 126 +++++++++++
 swh/indexer/tests/storage/test_storage.py          | 248 +++++++++++++++++++++
 swh/indexer/tests/test_cli.py                      |   1 +
 swh/indexer/tests/test_metadata.py                 | 133 ++++++++++-
 20 files changed, 1118 insertions(+), 60 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/github.py
 create mode 100644 swh/indexer/sql/upgrades/135.sql
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_github.py
Changes applied before test
commit 61b22345e70bf98c3711a293e7582c5727b8f805
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 15:30:17 2022 +0200

    Add extrinsic metadata indexer
    
    Which calls the GitHub mapping, for RawExtrinsicMetadata objects coming
    from github

commit f26b4c8b1ca771fa73cad78416001a464706445f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 15:27:45 2022 +0200

    Add support for origin_extrinsic_metadata to the storage

commit db02285bee9b4a6d017ef040753573647b04f930
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 14:52:08 2022 +0200

    reorder SQL functions

commit 151a3b8a2b698c999a0efb4f2ee7f5076d8a3076
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 30 10:54:03 2022 +0200

    github mapping: Add support for more terms from the Codemeta crosswalk

commit 8948c83972512326bd11eebaf0354b92747a8718
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 18:08:00 2022 +0200

    github mapping: Add support for terms outside the codemeta context

commit 9085cae01009f19a00a9c3b1e56eeb138e4f2775
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 29 17:53:03 2022 +0200

    Add minimal GitHub metadata mapping
    
    This introduces the scaffholding for extrinsic metadata mappings

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/318/ for more details.

This revision was automatically updated to reflect the committed changes.