Page MenuHomeSoftware Heritage

Add mapping of definitions and harvests
ClosedPublic

Authored by TG1999 on Jan 24 2021, 1:26 PM.

Details

Summary

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

Diff Detail

Repository
rDMFCD ClearlyDefined metadata fetcher
Branch
definitions_mapping
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18795
Build 29107: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 29106: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Jan 29 2021, 11:03 AM
TG1999 added inline comments.
swh/clearlydefined/mapping_utils.py
155

source.get("url") may return None, so for that I have used this line, and url should be str, not Optional[str] that will be returned by source.get("url") to prevent mypy error here

158

Same reason for sourc.get("url")

188

The case in test_map_row_with_invalid_ID_without_revision

190

The case in test_map_row_with_invalid_ID_without_json_extension

209–210

The case in test_map_row_with_empty_metadata_string, I have seen in clearcode toolkit DB, some rows might not have metadata at time of scan, it may gradually fetch that data after a while.

swh/clearlydefined/mapping_utils.py
155

If the URL is expected to always be either None or a string, then test if it is None, and if it's not, assert it's a string.

That way we get an error at runtime instead of silently ignoring the unexpected data type.

188

Could you explain this in a comment (as well as the ones below)?

TG1999 marked 5 inline comments as done.

Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not

Build is green

Patch application report for D4931 (id=17744)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 44494efbc77d6c349985e6909a0179315e551bdf
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Fri Jan 29 19:06:11 2021 +0530

    Add mapping for definitions and harvests
    Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/63/ for more details.

swh/clearlydefined/mapping_utils.py
155

Got it

188

Sure

@vlorentz made the suggested changes, thanks for a detailed review :)

some of your test code doesn't run, see the coverage mark on the right.

And instead of testing the error message (which is brittle), you should raise different exception types.

And it's still missing tests for when the returned flag is expected to be True

swh/clearlydefined/mapping_utils.py
140

when does this happen in practice?

180–181

when does this happen in practice?

210–213
238–239

shouldn't this be an error?

swh/clearlydefined/tests/test_mapping_utils.py
131 ↗(On Diff #17744)

can you refine the return type?

This revision now requires changes to proceed.Jan 29 2021, 2:55 PM
swh/clearlydefined/mapping_utils.py
140

Yes, you are right, this line won't get hit in practice

180–181

Yes you are right, it won't happen in pratcice

238–239

IMO more tools can be used by clearcode toolkit in future and in that case we should save them for future mapping.

swh/clearlydefined/tests/test_mapping_utils.py
131 ↗(On Diff #17744)

Sure

swh/clearlydefined/mapping_utils.py
238–239

then we should either raise an error here so we don't ignore them silenty, and be generic enough to support new tools.

TG1999 added inline comments.
swh/clearlydefined/mapping_utils.py
238–239

Sure

TG1999 added inline comments.
swh/clearlydefined/mapping_utils.py
180–181

Actually this line got hit, when metadata is wrongly formed, should I raise an error for it ??

swh/clearlydefined/mapping_utils.py
180–181

depends when/why it's triggered. If you know for sure that will happen, then you need to handle it because we don't want the script to crash in production.

TG1999 marked 3 inline comments as done.

Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not

Build is green

Patch application report for D4931 (id=17761)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit fea266f254727b8318b64fc96d6deda5804588c5
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Fri Jan 29 22:14:49 2021 +0530

    Add mapping for definitions and harvests
    Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/64/ for more details.

Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not

Build is green

Patch application report for D4931 (id=17767)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 48b9edf3a0bac26f7632dde605e7ec11f35dbb6f
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 15:22:55 2021 +0530

    Add mapping for definitions and harvests
    Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/65/ for more details.

Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Add error classes for all conditions where metadata can be wrong or for an invalid ID

Build is green

Patch application report for D4931 (id=17768)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit aefcb6e0ce068cf1dd7b0da85012d103894039b5
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 15:32:30 2021 +0530

    Add mapping for definitions and harvests
    Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
    Add error classes
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/66/ for more details.

You missed this comment:

Could you also document better your map_* functions? ie. the return type should be more specific than list, and you should explain what that list contains.

And they are all very similar, especially the content of the for loop. As all three are only called from map_harvest, this work could be done in map_harvest. I'm not completely sure, but making them generator functions might help


And for this:

You're parsing/validating list_cd_path in two different places. This increases the risk of bugs and mismatches between the two codes.

Just parse it once and for all in a single place.

you moved the code around, but you're still validating twice in the same function.

swh/clearlydefined/error.py
1 ↗(On Diff #17768)

only 2021

6 ↗(On Diff #17768)

Better name, please.

7–11 ↗(On Diff #17768)

this is not needed, the base exception handles it properly already.

14–35 ↗(On Diff #17768)

Could you add docstrings explaining the meaning of each of these?

swh/clearlydefined/mapping_utils.py
1

(also 2021 or 2020-2021 while we're at it)

18

the argument name should only be "storage", for consistency with the rest of the codebase. And we prefer it to be the first argument (same reason)

60

please rename this. "flag" is synonymous with "boolean", which doesn't convey the meaning of this variable.

(same comment below)

swh/clearlydefined/tests/test_mapping_utils.py
245 ↗(On Diff #17768)

same comment on "flag"

This revision now requires changes to proceed.Feb 1 2021, 12:15 PM

and there's again some unreachable code in the tests; please check this before submitting diffs.

And there are two lines in the with pytest.raises blocks; only the line expected to raise the exception should be in the block.

TG1999 marked 7 inline comments as done.

Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Add error classes for all conditions where metadata can be wrong or for an invalid ID
Add docstring for error class

Build is green

Patch application report for D4931 (id=17774)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit f329f8e869b0565bd26c917b99608b981d2091ea
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 19:13:37 2021 +0530

    Add mapping for definitions and harvests
    Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
    Add error classes for all conditions where metadata can be wrong or for an invalid ID
    Add docstring for error class
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/67/ for more details.

ok I'm accepting this diff, but let's try to to deduplicate these functions after it's merged.

Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of error classes in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases"

Build is green

Patch application report for D4931 (id=17787)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 9b4e723924a862eefe9efadebbd0fbbc983e8eb6
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 22:16:55 2021 +0530

    Add mapping for definitions and harvests
    Use swh storage instead of sql queries to map sha1 with SWHIDs
    Add different type of errors in error.py
    Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
    If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin
    If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
    Add tests to cover all the cases
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/68/ for more details.

Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of errors in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None.
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases

Build is green

Patch application report for D4931 (id=17789)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit a09e460f28b74217efa4f61a44e20226a5868800
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 22:21:00 2021 +0530

    Add mapping for definitions and harvests
    Use swh storage instead of sql queries to map sha1 with SWHIDs
    Add different type of errors in error.py
    Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
    If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin
    If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
    Add tests to cover all the cases
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/69/ for more details.

Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of exceptions in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None.
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases

Build is green

Patch application report for D4931 (id=17791)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit f1d4923895e27d9df3b9897b2fc8f98811d1925a
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 1 23:01:51 2021 +0530

    Add mapping for definitions and harvests
    Use swh storage instead of sql queries to map sha1 with SWHIDs
    Add different type of exceptions in error.py
    Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
    If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None.
    If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
    Add tests to cover all the cases
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/70/ for more details.

Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive
This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.

Build is green

Patch application report for D4931 (id=17793)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 7d81301051b9eca6c2fade465b5926aac31d0516
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 2 14:31:29 2021 +0530

    Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive
    This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
    Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/71/ for more details.

Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive

This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.

Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.

Build is green

Patch application report for D4931 (id=17795)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 3611155bbd705d6b03595941e22ba12c9b5d3e04
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 2 15:06:51 2021 +0530

    Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive
    
    This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
    
    Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/72/ for more details.

Map a row from clearcode toolkit with software heritage archive

This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.

Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.

Build is green

Patch application report for D4931 (id=17796)

Rebasing onto 6d70494b71...

Current branch diff-target is up to date.
Changes applied before test
commit 9940287285babe39224cd64d7fddf1507d5589d3
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 2 15:11:12 2021 +0530

    Map a row from clearcode toolkit with software heritage archive
    
    This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
    
    Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/73/ for more details.

This revision is now accepted and ready to land.Feb 2 2021, 10:44 AM