Page MenuHomeSoftware Heritage

Metadata Indexer for Pub (pubspec.yaml)
ClosedPublic

Authored by VickyMerzOwn on Jul 5 2022, 11:05 AM.

Details

Summary

Fixes T4376

Test Plan

Write a test case for the basic pub mapping

	   Test edge cases with examples (maybe from sourcegraph)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30383
Build 47496: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 47495: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D8079 (id=29228)

Rebasing onto c2742b5b75...

First, rewinding head to replay your work on top of it...
Applying: metadata_dictionary: Add mappings for pubspec.yaml
Changes applied before test
commit bb56919c89c6953421789ba72bda6b4f61610adf
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/350/ for more details.

vlorentz requested changes to this revision.Jul 7 2022, 9:45 PM

Why is so much code commented out?

swh/indexer/metadata_dictionary/pub.py
41–53 ↗(On Diff #29228)

Move this to a new base.YamlMapping to deduplicate the code with CffMapping

swh/indexer/tests/test_cli.py
104

keep this entry instead of adding \n to the other one

This revision now requires changes to proceed.Jul 7 2022, 9:45 PM

Adds base class for mapping Yaml files

Made the requested changes. The 2 methods that are in PubMapping are also deduplicated from CffMapping. However it felt intuitive to leave them that way (and not include them in YamlMapping).

Am I wrong? I'll change it if so.

Build is green

Patch application report for D8079 (id=29230)

Rebasing onto c2742b5b75...

First, rewinding head to replay your work on top of it...
Applying: metadata_dictionary: Add mappings for pubspec.yaml
Using index info to reconstruct a base tree...
M	swh/indexer/metadata_dictionary/cff.py
Falling back to patching base and 3-way merge...
Auto-merging swh/indexer/metadata_dictionary/cff.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/cff.py
Patch failed at 0001 metadata_dictionary: Add mappings for pubspec.yaml

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto c2742b5b75...

Already up to date.
Changes applied before test
commit 94f4a5f6cb5aa2e95f7abdf9c51712ee31cbbae8
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/351/ for more details.

vlorentz requested changes to this revision.Jul 8 2022, 8:36 AM

Thanks. Some more comments now that we have the basics covered:

Hmm, I wonder what we could translate the pubspec documentation key to. codemeta:readme and codemeta:buildInstructions both feel slightly off, and schema:documentation seems to be only for web APIs for some reason...

Also, could you rename the mapping to call this "pubspec"? "pub" alone is rather ambiguous. And probably rename the file to "dart.py", as file names tend to describe communities rather than formats (hypothetically, we would have all three mappings for PKG-INFO/setup.cfg/pyproject.toml in python.py for example)

swh/indexer/data/pub.csv
4 ↗(On Diff #29230)

could translate this from the platforms field

21 ↗(On Diff #29230)

didn't you say there is also an authors field?

34 ↗(On Diff #29230)

I this field documented? I don't see it on https://dart.dev/tools/pub/pubspec

swh/indexer/metadata_dictionary/cff.py
32–33

please rebase before submitting diffs; this code was simplified; so you can simplify the new base YamlMapping similarly

swh/indexer/metadata_dictionary/pub.py
1 ↗(On Diff #29230)

only 2022 as far as I can tell

This revision now requires changes to proceed.Jul 8 2022, 8:36 AM
VickyMerzOwn added inline comments.
swh/indexer/data/pub.csv
21 ↗(On Diff #29230)

Yeah, but both author and authors can be mapped to author.
I forgot to add it. Will look into it now.

34 ↗(On Diff #29230)

You're right, my bad.
There has been a brief discussion on this:
https://github.com/dart-lang/pub-dev/issues/5535
How come other libraries have a license field then?

swh/indexer/metadata_dictionary/pub.py
1 ↗(On Diff #29230)

Sorry, I didn't understand what you mean.
Should I remove "The Software Heritage developers"?

VickyMerzOwn marked 3 inline comments as done.

Change mapping name
Rebase

swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

probably because some project boilerplate generator writes it and/or people copy this from each other without checking the documentation

swh/indexer/metadata_dictionary/pub.py
1 ↗(On Diff #29230)

I meant replacing "2017-2022" with only "2022" (and in test_pub.py too)

VickyMerzOwn marked an inline comment as done.

Exclude license field from mapping

Build is green

Patch application report for D8079 (id=29232)

Rebasing onto c2742b5b75...

Current branch diff-target is up to date.
Changes applied before test
commit 68170dda98fda949d2988743838140680afe6789
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/352/ for more details.

swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

anyway, it's worth adding given the popularity. Please check people only use it with SPDX id though; and document it in dart.py (a one-line comment is probably enough)

Build has FAILED

Patch application report for D8079 (id=29233)

Rebasing onto c2742b5b75...

Current branch diff-target is up to date.
Changes applied before test
commit efaf1d810d0a3d9aff5ee1e534ee64af9146bd93
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/353/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/353/console

VickyMerzOwn added inline comments.
swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

Would you like that I import the table from here: https://github.com/spdx/license-list-data/blob/master/licenses.md
Then whenever a license field is present, it verifies if the value is in the table and only then adds it?

Build is green

Patch application report for D8079 (id=29236)

Rebasing onto c2742b5b75...

Current branch diff-target is up to date.
Changes applied before test
commit 6ac7d51386b6f64b2b4972e10d4441525fcbb2b6
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/354/ for more details.

swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

No, this table changes too frequently.

I meant you should check that people use SPDX ids in practice

swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

Just got another idea. Even if there isn't a license field, the first line of the license file gives the type. Can I use that or is it diverging a lot from using mappings to translate metadata?

swh/indexer/data/pub.csv
34 ↗(On Diff #29230)

it's actually more complicated than this, but possible with tools like fossology. It will need to be a separate mapping, though.

Normalize author/authors field(s)

Build is green

Patch application report for D8079 (id=29265)

Rebasing onto c2742b5b75...

Current branch diff-target is up to date.
Changes applied before test
commit 381cf1d53ab595fac7483b5621829397f91696ce
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/355/ for more details.

Could you add a test with both author and authors?

swh/indexer/metadata_dictionary/dart.py
53–59

str.rsplit takes a maxsplit argument, so you can simplify your code like above.

However, neither variant supports authors hiding their email address, and you should either add support for that, or make sure this never actually happens. (To implement it without making the code too convoluted, you can use this regexp: (?P<name>.*?)( <(?P<email>.*)>)?)

swh/indexer/tests/metadata_dictionary/test_pubspec.py
1 ↗(On Diff #29265)

you should rename this file to match the other one

79 ↗(On Diff #29265)

avoid real domain names in tests; use those defined in RFC6761 instead, such as "example.org"

89–90 ↗(On Diff #29265)

same here

This revision now requires changes to proceed.Jul 11 2022, 10:02 AM

Changes test filename to test_dart.py
Checks for presence of email while normalizing author

Build is green

Patch application report for D8079 (id=29298)

Rebasing onto 2dd2be9c1b...

First, rewinding head to replay your work on top of it...
Applying: metadata_dictionary: Add mappings for pubspec.yaml
Using index info to reconstruct a base tree...
M	swh/indexer/metadata_dictionary/__init__.py
M	swh/indexer/metadata_dictionary/base.py
M	swh/indexer/metadata_dictionary/cff.py
Falling back to patching base and 3-way merge...
Auto-merging swh/indexer/metadata_dictionary/cff.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/cff.py
Auto-merging swh/indexer/metadata_dictionary/base.py
Auto-merging swh/indexer/metadata_dictionary/__init__.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/__init__.py
Patch failed at 0001 metadata_dictionary: Add mappings for pubspec.yaml

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 2dd2be9c1b...

Already up to date.
Changes applied before test
commit 8e800e2f2ea7568b43a8ab2cfac58a329cb47c2f
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/357/ for more details.

swh/indexer/metadata_dictionary/dart.py
53–59

Checking for "@" in the string is like a check if there is an email address in the string.

don't forget this:

Could you add a test with both author and authors?

swh/indexer/metadata_dictionary/dart.py
53–59

It is not guaranteed. These files are untrusted input, they need to be fully validated to prevent crashes.

This revision now requires changes to proceed.Jul 12 2022, 2:31 PM

don't forget this:

Could you add a test with both author and authors?

I think either of the fields will be present but not both.

Not sure if this query is accurate but it gives 0 results: https://sourcegraph.com/search?q=context:global+author+authors+file:pubspec.yaml&patternType=literal

https://sourcegraph.com/search?q=context:global+author+version+file:pubspec.yaml&patternType=literal also returns no result, because it searches for the literal string author version (or author authors in your example), see https://docs.sourcegraph.com/code_search/reference/queries#literal-search-default

So this is not conclusive

  • Verifies author value with regex
  • Adds test for files containing both 'author' and 'authors' fields

Build is green

Patch application report for D8079 (id=29326)

Rebasing onto 2dd2be9c1b...

First, rewinding head to replay your work on top of it...
Applying: metadata_dictionary: Add mappings for pubspec.yaml
Using index info to reconstruct a base tree...
M	swh/indexer/metadata_dictionary/__init__.py
M	swh/indexer/metadata_dictionary/base.py
M	swh/indexer/metadata_dictionary/cff.py
Falling back to patching base and 3-way merge...
Auto-merging swh/indexer/metadata_dictionary/cff.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/cff.py
Auto-merging swh/indexer/metadata_dictionary/base.py
Auto-merging swh/indexer/metadata_dictionary/__init__.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/__init__.py
Patch failed at 0001 metadata_dictionary: Add mappings for pubspec.yaml

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 2dd2be9c1b...

Already up to date.
Changes applied before test
commit 3f72e93f35d9f8de093e1d9af9a46e38b42a52b9
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/358/ for more details.

swh/indexer/metadata_dictionary/dart.py
56–57

use the regexp's capture groups, instead of parsing twice

Build is green

Patch application report for D8079 (id=29333)

Rebasing onto 2dd2be9c1b...

First, rewinding head to replay your work on top of it...
Applying: metadata_dictionary: Add mappings for pubspec.yaml
Using index info to reconstruct a base tree...
M	swh/indexer/metadata_dictionary/__init__.py
M	swh/indexer/metadata_dictionary/base.py
M	swh/indexer/metadata_dictionary/cff.py
Falling back to patching base and 3-way merge...
Auto-merging swh/indexer/metadata_dictionary/cff.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/cff.py
Auto-merging swh/indexer/metadata_dictionary/base.py
Auto-merging swh/indexer/metadata_dictionary/__init__.py
CONFLICT (content): Merge conflict in swh/indexer/metadata_dictionary/__init__.py
Patch failed at 0001 metadata_dictionary: Add mappings for pubspec.yaml

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto 2dd2be9c1b...

Already up to date.
Changes applied before test
commit ffa4e5af7fdb1bd4f43d1ebb36ae6c48aa74ac5b
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 12 22:25:05 2022 +0530

    Capture groups

commit 3f72e93f35d9f8de093e1d9af9a46e38b42a52b9
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/359/ for more details.

thanks. one last thing:

swh/indexer/metadata_dictionary/dart.py
54

redundant, None is not an instance of str

This revision now requires changes to proceed.Jul 12 2022, 7:20 PM

Remove redundant check that input string is not Nonetype
Rebase

Build is green

Patch application report for D8079 (id=29334)

Rebasing onto 2dd2be9c1b...

Current branch diff-target is up to date.
Changes applied before test
commit fa67b73d6a88461e76601c3b61ee6d33ba5a2c93
Author: Satvik Vemuganti <vemugantisesha@iitbhilai.ac.in>
Date:   Tue Jul 5 14:33:19 2022 +0530

    metadata_dictionary: Add mappings for pubspec.yaml

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/360/ for more details.

This revision is now accepted and ready to land.Jul 12 2022, 9:56 PM