Page MenuHomeSoftware Heritage

Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
ClosedPublic

Authored by vlorentz on Aug 22 2022, 2:29 PM.

Details

Summary

Motivation:

  1. It makes it easier to visualize what is actually happening when modifying the graph, by working explicitly on triples instead of a JSON-LD (a tree serialization of the graph).
  2. Remove the need for the hacky merge_values() function (and possibly merge_documents() in a future commit)
  3. It also catches malformed data exactly where it is added in the document (the call to rdflib.Graph.add()) instead of at the end of the mapping when running compaction/expansion.

Downsides:

  1. Tests are clunkier, because they relied on deterministic order of unordered lists; but rdflib does not guarantee it
  2. Code is longer
  3. Extra dependency (which we will need at some point if we want to import from RDF datasets, anyway)

Sorry for the big diff. Bulk of changes is in base.py, codemeta.py, and `utils.py, everything else is adaptations of existing code without changing the underlying semantics

Depends on D8275, D8278, and D8281

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
rdflib
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30953
Build 48413: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 48412: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8279 (id=29893)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..2328c48
Fast-forward
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  64 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  77 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  36 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  30 +----
 25 files changed, 650 insertions(+), 485 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit 2328c48b7999a721dbfe50e6e21ddc1160bccbe0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/440/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/440/console

Harbormaster returned this revision to the author for changes because remote builds failed.Aug 22 2022, 2:32 PM
Harbormaster failed remote builds in B30946: Diff 29893!

make mypy happy with type-annotated rdflib versions

Build was aborted

Patch application report for D8279 (id=29894)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..8605980
Fast-forward
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  64 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  79 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  37 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  30 +----
 25 files changed, 653 insertions(+), 485 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit 8605980328cd6a9f74dc364fa432468a19ae23c4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/442/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/442/console

Harbormaster returned this revision to the author for changes because remote builds failed.Aug 22 2022, 3:11 PM
Harbormaster failed remote builds in B30947: Diff 29894!

Build has FAILED

Patch application report for D8279 (id=29894)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..8605980
Fast-forward
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  64 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  79 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  37 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  30 +----
 25 files changed, 653 insertions(+), 485 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit 8605980328cd6a9f74dc364fa432468a19ae23c4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/443/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/443/console

vlorentz edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D8279 (id=29898)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..20d5326
Fast-forward
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  64 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  79 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  37 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  32 +----
 swh/indexer/tests/test_origin_metadata.py          |   6 +-
 swh/indexer/tests/utils.py                         |  13 --
 27 files changed, 657 insertions(+), 502 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit 20d5326b82b5224df88535b20ae43778198fd2af
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit b9f206bfe0a8e2592708e7f1728643654564f32a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 15:47:49 2022 +0200

    Remove 'keywords' from test files
    
    Their order is nondeterministic, it just happens to work with
    the way we use PyLD.

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/445/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/445/console

Build has FAILED

Patch application report for D8279 (id=29900)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..d9732bd
Fast-forward
 docs/metadata-workflow.rst                         |   2 +-
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  64 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  79 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  37 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  32 +----
 swh/indexer/tests/test_origin_metadata.py          |   6 +-
 swh/indexer/tests/utils.py                         |  13 --
 28 files changed, 658 insertions(+), 503 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit d9732bd541099164a1ac0bba4e176270a631c172
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit b9f206bfe0a8e2592708e7f1728643654564f32a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 15:47:49 2022 +0200

    Remove 'keywords' from test files
    
    Their order is nondeterministic, it just happens to work with
    the way we use PyLD.

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/446/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/446/console

Harbormaster returned this revision to the author for changes because remote builds failed.Aug 22 2022, 4:03 PM
Harbormaster failed remote builds in B30953: Diff 29900!

Build is green

Patch application report for D8279 (id=29904)

Could not rebase; Attempt merge onto 7d7f29fb6a...

Updating 7d7f29f..8850944
Fast-forward
 docs/metadata-workflow.rst                         |   2 +-
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  66 +++-------
 swh/indexer/metadata_dictionary/base.py            | 133 ++++++++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  79 +++++++-----
 swh/indexer/metadata_dictionary/composer.py        |  37 +++---
 swh/indexer/metadata_dictionary/dart.py            |  43 ++++---
 swh/indexer/metadata_dictionary/github.py          | 135 +++++++++------------
 swh/indexer/metadata_dictionary/maven.py           |  99 ++++++++-------
 swh/indexer/metadata_dictionary/npm.py             | 133 +++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  69 +++++------
 swh/indexer/metadata_dictionary/python.py          |  44 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  44 ++++---
 swh/indexer/metadata_dictionary/utils.py           |  72 +++++++++++
 swh/indexer/namespaces.py                          |  12 ++
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  53 ++++----
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  32 +----
 swh/indexer/tests/test_origin_metadata.py          |   6 +-
 swh/indexer/tests/utils.py                         |  13 --
 28 files changed, 659 insertions(+), 504 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
 create mode 100644 swh/indexer/namespaces.py
Changes applied before test
commit 885094479a97fd37ec624f781c4c2f87fc87a1f2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit b9f206bfe0a8e2592708e7f1728643654564f32a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 15:47:49 2022 +0200

    Remove 'keywords' from test files
    
    Their order is nondeterministic, it just happens to work with
    the way we use PyLD.

commit d5207e9521d982d2c170399b4855de3b6d9e8005
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:20:14 2022 +0200

    Add base XmlMapping to deduplicate between MavenMapping and NugetMapping

commit c4cd68f6ae0ea92be6e312a37fcf0fe597e7616f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:18:39 2022 +0200

    nuget: Remove test-specific code from the main class

commit b09e2bcfc73d72a670dbedfe5f8334d0036ce195
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 13:14:05 2022 +0200

    nuget: Inherit directly from BaseIntrinsicMapping

commit c8a4571c8763c84c064e832f598e949407e0e429
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:18:30 2022 +0200

    Replace 'normalize' parameter of _translate_dict() with a hook method
    
    This parameter was only used to execute extra code before
    `normalize_translation` is called. This caused some duplication, and
    will not work when switching to a non-JSON-LD internal representation.
    
    Removing it also makes the code of mappings more consistent, by removing
    specific field handling from their implementation of the `translate`
    method itself.

commit 466108c1667c88be7ff272e565ffe076e16064d8
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 12:13:33 2022 +0200

    python: Simplify translation logic of author metadata

commit 92b53419f6f9d699451609cb23a946978ecb6b07
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 19 11:37:06 2022 +0200

    metadata_dictionary: Simplify code using rdflib-style namespace classes

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/448/ for more details.

Build is green

Patch application report for D8279 (id=29923)

Could not rebase; Attempt merge onto d5207e9521...

Updating d5207e9..f72d095
Fast-forward
 docs/metadata-workflow.rst                         |   2 +-
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/indexer/codemeta.py                            |  51 ++------
 swh/indexer/metadata_dictionary/base.py            |  85 ++++++++++----
 swh/indexer/metadata_dictionary/cff.py             |  78 +++++++------
 swh/indexer/metadata_dictionary/composer.py        |  36 +++---
 swh/indexer/metadata_dictionary/dart.py            |  36 +++---
 swh/indexer/metadata_dictionary/github.py          | 119 +++++++++----------
 swh/indexer/metadata_dictionary/maven.py           |  72 +++++++-----
 swh/indexer/metadata_dictionary/npm.py             | 130 ++++++++++++++-------
 swh/indexer/metadata_dictionary/nuget.py           |  46 ++++----
 swh/indexer/metadata_dictionary/python.py          |  40 ++++---
 swh/indexer/metadata_dictionary/ruby.py            |  41 +++----
 swh/indexer/metadata_dictionary/utils.py           |  72 ++++++++++++
 swh/indexer/namespaces.py                          |  20 +---
 swh/indexer/tests/metadata_dictionary/test_cff.py  |  11 +-
 .../tests/metadata_dictionary/test_composer.py     |   7 +-
 swh/indexer/tests/metadata_dictionary/test_dart.py |  17 +--
 .../tests/metadata_dictionary/test_github.py       |   2 +-
 .../tests/metadata_dictionary/test_maven.py        |  32 ++---
 swh/indexer/tests/metadata_dictionary/test_npm.py  |   7 +-
 .../tests/metadata_dictionary/test_nuget.py        |  41 ++++---
 .../tests/metadata_dictionary/test_python.py       |   9 +-
 swh/indexer/tests/metadata_dictionary/test_ruby.py |   2 +
 swh/indexer/tests/test_codemeta.py                 |  32 +----
 swh/indexer/tests/test_origin_metadata.py          |   6 +-
 swh/indexer/tests/utils.py                         |  13 ---
 28 files changed, 563 insertions(+), 448 deletions(-)
 create mode 100644 swh/indexer/metadata_dictionary/utils.py
Changes applied before test
commit f72d095f425224f16be1bc564f5cc4ed709fb47a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 14:20:20 2022 +0200

    Refactor metadata mappings using rdflib.Graph instead of JSON-LD internally
    
    Motivation:
    
    1. It makes it easier to visualize what is actually happening when modifying
       the graph, by working explicitly on triples instead of a JSON-LD (a tree
       serialization of the graph).
    
    2. Remove the need for the hacky `merge_values()` function (and possibly
       `merge_documents()` in a future commit)
    
    3. It also catches malformed data exactly where it is added in the document
       (the call to rdflib.Graph.add()) instead of at the end of the mapping
       when running compaction/expansion.
    
    Downsides:
    
    1. Tests are clunkier, because they relied on deterministic order of
       unordered lists; but rdflib does not guarantee it
    
    2. Code is longer
    
    3. Extra dependency (which we will need at some point if we want to
       import from RDF datasets, anyway)

commit 97f5fdcdcc3ac76d2b4680dbcd4f2b5d4c557293
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Aug 22 15:47:49 2022 +0200

    Remove 'keywords' from test files
    
    Their order is nondeterministic, it just happens to work with
    the way we use PyLD.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/456/ for more details.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 23 2022, 11:28 AM
This revision was automatically updated to reflect the committed changes.