Page MenuHomeSoftware Heritage

github and gitea: Use html_url as @id and clone_url as codeRepository
ClosedPublic

Authored by vlorentz on Sep 13 2022, 5:08 PM.

Details

Summary

They are closer semantics as 'html_url' is the main page of the repository,
so it is the best to identify it; and 'clone_url' is the URL that should
be given to 'git clone', as documented by https://schema.org/codeRepository

Additionally, that property was missing so far; but a future commit will
need to use it to identify fork relationships (node ids are required to
representation relationships between documents as we cannot use blank
nodes for that)

Depends on D8460.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31540
Build 49336: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 49335: arc lint + arc unit

Unit TestsFailed

TimeTest
43,862 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_cli::test_cli_journal_client_index__origin_extrinsic_metadata[*]
cli_runner = <click.testing.CliRunner object at 0x7ff5802c34e0> swh_config = '/tmp/pytest-of-jenkins/pytest-0/test_cli_journal_client_index_3/indexer.yml' kafka_prefix = 'gqhpmbptjm', kafka_server = '127.0.0.1:33711'
43,937 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_cli::test_cli_journal_client_index__origin_extrinsic_metadata[extrinsic_metadata]
cli_runner = <click.testing.CliRunner object at 0x7ff582ab7f60> swh_config = '/tmp/pytest-of-jenkins/pytest-0/test_cli_journal_client_index_2/indexer.yml' kafka_prefix = 'hgzomzajbx', kafka_server = '127.0.0.1:33711'
3 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_metadata.TestMetadata::test_extrinsic_metadata_indexer_duplicate_origin
self = <swh.indexer.tests.test_metadata.TestMetadata object at 0x7ff582d7df60> mocker = <pytest_mock.plugin.MockerFixture object at 0x7ff553939898>
3 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_metadata.TestMetadata::test_extrinsic_metadata_indexer_github
self = <swh.indexer.tests.test_metadata.TestMetadata object at 0x7ff582d7d0b8> mocker = <pytest_mock.plugin.MockerFixture object at 0x7ff553916cf8>
3 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.indexer.tests.test_metadata.TestMetadata::test_extrinsic_metadata_indexer_thirdparty_authority
self = <swh.indexer.tests.test_metadata.TestMetadata object at 0x7ff582d7d940> mocker = <pytest_mock.plugin.MockerFixture object at 0x7ff5539079e8>
View Full Test Results (5 Failed · 356 Passed · 11 Skipped)

Event Timeline

Build has FAILED

Patch application report for D8468 (id=30509)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  17 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   1 +
 9 files changed, 451 insertions(+), 48 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit 2df4dc14566284c7339f70e284742dddb7363a26
Merge: e25a2f4 d2e42fa
Author: Jenkins user <jenkins@localhost>
Date:   Tue Sep 13 15:08:20 2022 +0000

    Merge branch 'diff-target' into HEAD

commit d2e42fae761ca540b8708145563fef712a7c329d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/493/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/493/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 13 2022, 5:09 PM
Harbormaster failed remote builds in B31516: Diff 30509!

Build has FAILED

Patch application report for D8468 (id=30539)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  17 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   1 +
 9 files changed, 451 insertions(+), 48 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit 395d0aae0a41c91cd40d472618849c3a6249a8bc
Merge: e25a2f4 8055d0d
Author: Jenkins user <jenkins@localhost>
Date:   Thu Sep 15 06:41:19 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 8055d0d6390364cdd6fcb73eaedf7203d7c10185
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/494/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/494/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 15 2022, 8:49 AM
Harbormaster failed remote builds in B31540: Diff 30539!

Build was aborted

Patch application report for D8468 (id=30558)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  19 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   1 +
 swh/indexer/tests/test_metadata.py                 |   3 +-
 10 files changed, 455 insertions(+), 49 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit a4c38aebe66ccba13f348682a599ae6a29deb705
Merge: e25a2f4 c518541
Author: Jenkins user <jenkins@localhost>
Date:   Thu Sep 15 12:02:55 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c518541b21bfbf1dd6415a369777a57ef3430c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/496/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/496/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 15 2022, 2:14 PM
Harbormaster failed remote builds in B31557: Diff 30558!

Build has FAILED

Patch application report for D8468 (id=30558)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  19 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   1 +
 swh/indexer/tests/test_metadata.py                 |   3 +-
 10 files changed, 455 insertions(+), 49 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit e8cce65695857490924764b1b04d050969614b57
Merge: e25a2f4 c518541
Author: Jenkins user <jenkins@localhost>
Date:   Fri Sep 16 12:51:51 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c518541b21bfbf1dd6415a369777a57ef3430c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/498/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/498/console

Build has FAILED

Patch application report for D8468 (id=30558)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  19 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   1 +
 swh/indexer/tests/test_metadata.py                 |   3 +-
 10 files changed, 455 insertions(+), 49 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit 1c3e4c305bf3714ccf9457d9174a20affcfbe638
Merge: e25a2f4 c518541
Author: Jenkins user <jenkins@localhost>
Date:   Fri Sep 16 12:52:17 2022 +0000

    Merge branch 'diff-target' into HEAD

commit c518541b21bfbf1dd6415a369777a57ef3430c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/499/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/499/console

Build is green

Patch application report for D8468 (id=30598)

Could not rebase; Attempt merge onto e25a2f4e4a...

Merge made by the 'recursive' strategy.
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  19 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   2 +
 swh/indexer/tests/test_metadata.py                 |   3 +-
 10 files changed, 456 insertions(+), 49 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit 26653bae18c047acfb4e7219d705d53680bb5652
Merge: e25a2f4 9d7a6a4
Author: Jenkins user <jenkins@localhost>
Date:   Sun Sep 18 12:18:01 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 9d7a6a47e157d443849dc749765ecb010ba856c2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit 9f6b75cad02745311f3d29a564b3db2d5b756af7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 3a3a348bd86e714ab016a93617bc197010ee145d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/500/ for more details.

olasd added a subscriber: olasd.

This seems reasonable in terms of ""ontology design"".

I wonder if the inconsistency between origin urls (which are sometimes the html url, sometimes the clone urls, depending on the lister) and object ids in metadata will end up biting us eventually. Considering that the extrinsic metadata is always attached to an origin url, it should be fine, but it should probably be documented for future users of the metadata.

This revision is now accepted and ready to land.Sep 27 2022, 4:59 PM

Build was aborted

Patch application report for D8468 (id=30858)

Could not rebase; Attempt merge onto e25a2f4e4a...

Updating e25a2f4..ac0e263
Fast-forward
 swh/indexer/data/Gitea.csv                         |  76 +++++++++++
 swh/indexer/metadata_dictionary/__init__.py        |  15 ++-
 swh/indexer/metadata_dictionary/base.py            | 108 ++++++++++------
 swh/indexer/metadata_dictionary/cff.py             |   5 +-
 swh/indexer/metadata_dictionary/gitea.py           | 124 ++++++++++++++++++
 swh/indexer/metadata_dictionary/github.py          |  19 ++-
 .../tests/metadata_dictionary/test_gitea.py        | 143 +++++++++++++++++++++
 .../tests/metadata_dictionary/test_github.py       |  10 +-
 swh/indexer/tests/test_cli.py                      |   2 +
 swh/indexer/tests/test_metadata.py                 |   3 +-
 10 files changed, 456 insertions(+), 49 deletions(-)
 create mode 100644 swh/indexer/data/Gitea.csv
 create mode 100644 swh/indexer/metadata_dictionary/gitea.py
 create mode 100644 swh/indexer/tests/metadata_dictionary/test_gitea.py
Changes applied before test
commit ac0e263bbfc17ee2905b97bbbbbb4929419170cd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 17:06:08 2022 +0200

    github and gitea: Use html_url as @id and clone_url as codeRepository
    
    They are closer semantics as 'html_url' is the main page of the repository,
    so it is the best to identify it; and 'clone_url' is the URL that should
    be given to 'git clone', as documented by https://schema.org/codeRepository
    
    Additionally, that property was missing so far; but a future commit will
    need to use it to identify fork relationships (node ids are required to
    representation relationships between documents as we cannot use blank
    nodes for that)

commit cb435e59ca91ac7b71cff18e5e6b3885e5be9ac1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 13:30:54 2022 +0200

    Add Gitea metadata mapping

commit 20becf4a90fa6b626e972bba3d57db46604cb7b2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Sep 13 12:34:22 2022 +0200

    GitHub: use correct JSON-LD types for URLs and dates

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/507/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/507/console

This revision was landed with ongoing or failed builds.Sep 27 2022, 5:37 PM
This revision was automatically updated to reflect the committed changes.