Page MenuHomeSoftware Heritage

deposit.parser: Add parse_swh_reference function
ClosedPublic

Authored by ardumont on Nov 6 2020, 5:40 PM.

Details

Summary

Not used yet, simply opened and check coverage is good.

This will ease the detection of a swhid reference or an origin within the
<swh:deposit /> xml block.

This is a first step towards adding more support to metadata-only deposit.

Related to T2537

Test Plan

tox

Diff Detail

Repository
rDDEP Push deposit
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 16979
Build 26200: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 26199: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4442 (id=15718)

Rebasing onto a67a8cc770...

Current branch diff-target is up to date.
Changes applied before test
commit 7bf6671163c335f91b91bfd319715b84a9482683
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.utils: Add utils function to parse swh:deposit xml entry
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/290/ for more details.

Deal with swhid with blanks

Build is green

Patch application report for D4442 (id=15719)

Rebasing onto a67a8cc770...

Current branch diff-target is up to date.
Changes applied before test
commit 220654be7dd8ff54f0421c9bd45d1e1e77f48b5b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.utils: Add utils function to parse swh:deposit xml entry
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/291/ for more details.

swh/deposit/utils.py
104 ↗(On Diff #15719)

@vlorentz wondering if i need to check with unquoting that value...

I'll rename that function and move it to the parsers module.

  • Move function to swh.deposit.parsers module
  • Rework commit message
ardumont retitled this revision from deposit.utils: Add utils function to parse swh:deposit xml entry to deposit.parser: Add parse_swh_reference function.Nov 9 2020, 9:22 AM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D4442 (id=15722)

Rebasing onto a67a8cc770...

Current branch diff-target is up to date.
Changes applied before test
commit aebd577b0f4642aebc0c9b3b28700386c9bcb44c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.parser: Add parse_swh_reference function
    
    This will ease the detection of a swhid reference or an origin within the
    <swh:deposit /> xml block.
    
    This is a first step towards adding more support to metadata-only deposit.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/292/ for more details.

why do we need/want clean_swhid? if there is garbage in the input we should just reject it.

Agree with @vlorentz, we should not clean a given SWHID, it's either valid or not

why do we need/want clean_swhid? if there is garbage in the input we should just reject it.

ok, even simpler.

why do we need/want clean_swhid? if there is garbage in the input we should just reject it.

ok, even simpler.

hmm, no not simpler...
I just have more questions ;)

Agree with @vlorentz, we should not clean a given SWHID, it's either valid or not

Yes, but what's a valid swhid?
Something that passes the call to swh.model.identifiers.parse_swhid?

For example:

SWHID(namespace='swh', scheme_version=1, object_type='directory', object_id='31b5c8cc985d190b5a7ef4878128ebfdc2358f49', metadata=ImmutableDict({'                        origin': 'https://hal.archives-ouvertes.fr/hal-01243573', '                        visit': 'swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321', '                        anchor': 'swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba', '                        path': '/moranegg-AffectationRO-df7f68b/'}))

Is that a valid swhid?

(look at the keys in the Immutable Dict, they contain lots of blanks)

  • Adapt according to review (drop clean_swhid)
  • Update docstring to mention the function can raise if the swhid reference is invalid.

Build is green

Patch application report for D4442 (id=15746)

Rebasing onto 7148a257b2...

First, rewinding head to replay your work on top of it...
Applying: deposit.parser: Add parse_swh_reference function
Changes applied before test
commit 0ed646d6d0577125fc168892dae39caa7c4da82c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.parser: Add parse_swh_reference function
    
    This will ease the detection of a swhid reference or an origin within the
    <swh:deposit /> xml block.
    
    This is a first step towards adding more support to metadata-only deposit.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/303/ for more details.

Yes, but what's a valid swhid?
Something that passes the call to swh.model.identifiers.parse_swhid?

It's something that follows this grammar: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#syntax

Which hopefully is equivalent to passing that call.

SWHID(namespace='swh', scheme_version=1, object_type='directory', object_id='31b5c8cc985d190b5a7ef4878128ebfdc2358f49', metadata=ImmutableDict({'                        origin': 'https://hal.archives-ouvertes.fr/hal-01243573', '                        visit': 'swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321', '                        anchor': 'swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba', '                        path': '/moranegg-AffectationRO-df7f68b/'}))

Is that a valid swhid?

(look at the keys in the Immutable Dict, they contain lots of blanks)

Forgive the intense semantic nitpicking, but SWHIDs are plain strings, so anything you build with SWHID(...) isn't an SWHID, it's an instance of the SWHID class.

So I'll rephrases your question as:

Does the following code print an SWHID?"

>>> swhid = SWHID(namespace='swh', scheme_version=1, object_type='directory', object_id='31b5c8cc985d190b5a7ef4878128ebfdc2358f49', metadata=ImmutableDict({'                        origin': 'https://hal.archives-ouvertes.fr/hal-01243573', '                        visit': 'swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321', '                        anchor': 'swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba', '                        path': '/moranegg-AffectationRO-df7f68b/'}))
>>> str(swhid)
'swh:1:dir:31b5c8cc985d190b5a7ef4878128ebfdc2358f49;                        origin=https://hal.archives-ouvertes.fr/hal-01243573;                        visit=swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321;                        anchor=swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba;                        path=/moranegg-AffectationRO-df7f68b/'

and the answer currently is no, because, by definition definition, SWHIDs cannot contain spaces.

Yes, but what's a valid swhid?
Something that passes the call to swh.model.identifiers.parse_swhid?

It's something that follows this grammar: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#syntax

Which hopefully is equivalent to passing that call.

SWHID(namespace='swh', scheme_version=1, object_type='directory', object_id='31b5c8cc985d190b5a7ef4878128ebfdc2358f49', metadata=ImmutableDict({'                        origin': 'https://hal.archives-ouvertes.fr/hal-01243573', '                        visit': 'swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321', '                        anchor': 'swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba', '                        path': '/moranegg-AffectationRO-df7f68b/'}))

Is that a valid swhid?

(look at the keys in the Immutable Dict, they contain lots of blanks)

Forgive the intense semantic nitpicking, but SWHIDs are plain strings, so anything you build with SWHID(...) isn't an SWHID, it's an instance of the SWHID class.

So I'll rephrases your question as:

Does the following code print an SWHID?"

>>> swhid = SWHID(namespace='swh', scheme_version=1, object_type='directory', object_id='31b5c8cc985d190b5a7ef4878128ebfdc2358f49', metadata=ImmutableDict({'                        origin': 'https://hal.archives-ouvertes.fr/hal-01243573', '                        visit': 'swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321', '                        anchor': 'swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba', '                        path': '/moranegg-AffectationRO-df7f68b/'}))
>>> str(swhid)
'swh:1:dir:31b5c8cc985d190b5a7ef4878128ebfdc2358f49;                        origin=https://hal.archives-ouvertes.fr/hal-01243573;                        visit=swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321;                        anchor=swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba;                        path=/moranegg-AffectationRO-df7f68b/'

and the answer currently is no, because, by definition definition, SWHIDs cannot contain spaces.

Cool, thanks for the clarification.

But that's not a logic that should be done deposit side (or clients of the parse_swhid function for that matters).
What i'm taking from your reasoning then is that right now, we should make the parse_swhid more strict so it raises when such occurences arose.
So that parse_swhid clients does not think they have correct swhids on their side when they don't.

Update documentation and code to test on valid swhid

The swhid invalidity checks need to happen in the
swh.model.identifiers.parse_swhid command.

Build has FAILED

Patch application report for D4442 (id=15767)

Rebasing onto 7148a257b2...

Current branch diff-target is up to date.
Changes applied before test
commit 1ac47c1d5c622da1a7b118e36d01514f1ec26c0b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.parser: Add parse_swh_reference function
    
    This will ease the detection of a swhid reference or an origin within the
    <swh:deposit /> xml block.
    
    This is a first step towards adding more support to metadata-only deposit.
    
    Related to T2537

Link to build: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/308/
See console output for more information: https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/308/console

Fix flake8 warning about line too long ¯\_(ツ)_/¯

Build is green

Patch application report for D4442 (id=15769)

Rebasing onto 7148a257b2...

Current branch diff-target is up to date.
Changes applied before test
commit 018500c195610398bc78c68720446c295a6b7359
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Nov 6 17:39:08 2020 +0100

    deposit.parser: Add parse_swh_reference function
    
    This will ease the detection of a swhid reference or an origin within the
    <swh:deposit /> xml block.
    
    This is a first step towards adding more support to metadata-only deposit.
    
    Related to T2537

See https://jenkins.softwareheritage.org/job/DDEP/job/tests-on-diff/309/ for more details.

What i'm taking from your reasoning then is that right now, we should make the parse_swhid more strict so it raises when such occurences arose.
So that parse_swhid clients does not think they have correct swhids on their side when they don't.

Yes. And also because not being strict requires more code.

This revision is now accepted and ready to land.Nov 10 2020, 1:00 PM