Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Details
- Reviewers
douardda vlorentz - Group Reviewers
Reviewers - Commits
- rDMFCD9940287285ba: Map a row from clearcode toolkit with software heritage archive
Diff Detail
- Repository
- rDMFCD ClearlyDefined metadata fetcher
- Branch
- definitions_mapping
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 18885 Build 29256: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 29255: arc lint + arc unit
Event Timeline
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
148 | source.get("url") may return None, so for that I have used this line, and url should be str, not Optional[str] that will be returned by source.get("url") to prevent mypy error here | |
151 | Same reason for sourc.get("url") | |
181 | The case in test_map_row_with_invalid_ID_without_revision | |
183 | The case in test_map_row_with_invalid_ID_without_json_extension | |
202–203 | The case in test_map_row_with_empty_metadata_string, I have seen in clearcode toolkit DB, some rows might not have metadata at time of scan, it may gradually fetch that data after a while. |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
148 | If the URL is expected to always be either None or a string, then test if it is None, and if it's not, assert it's a string. That way we get an error at runtime instead of silently ignoring the unexpected data type. | |
181 | Could you explain this in a comment (as well as the ones below)? |
Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Build is green
Patch application report for D4931 (id=17744)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 44494efbc77d6c349985e6909a0179315e551bdf Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Fri Jan 29 19:06:11 2021 +0530 Add mapping for definitions and harvests Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/63/ for more details.
some of your test code doesn't run, see the coverage mark on the right.
And instead of testing the error message (which is brittle), you should raise different exception types.
And it's still missing tests for when the returned flag is expected to be True
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
134 | when does this happen in practice? | |
174–175 | when does this happen in practice? | |
204–207 | ||
232–233 | shouldn't this be an error? | |
swh/clearlydefined/tests/test_mapping_utils.py | ||
131 | can you refine the return type? |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
134 | Yes, you are right, this line won't get hit in practice | |
174–175 | Yes you are right, it won't happen in pratcice | |
232–233 | IMO more tools can be used by clearcode toolkit in future and in that case we should save them for future mapping. | |
swh/clearlydefined/tests/test_mapping_utils.py | ||
131 | Sure |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
232–233 | then we should either raise an error here so we don't ignore them silenty, and be generic enough to support new tools. |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
232–233 | Sure |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
174–175 | Actually this line got hit, when metadata is wrongly formed, should I raise an error for it ?? |
swh/clearlydefined/mapping_utils.py | ||
---|---|---|
174–175 | depends when/why it's triggered. If you know for sure that will happen, then you need to handle it because we don't want the script to crash in production. |
Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Build is green
Patch application report for D4931 (id=17761)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit fea266f254727b8318b64fc96d6deda5804588c5 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Fri Jan 29 22:14:49 2021 +0530 Add mapping for definitions and harvests Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/64/ for more details.
Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Build is green
Patch application report for D4931 (id=17767)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 48b9edf3a0bac26f7632dde605e7ec11f35dbb6f Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 15:22:55 2021 +0530 Add mapping for definitions and harvests Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/65/ for more details.
Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Add error classes for all conditions where metadata can be wrong or for an invalid ID
Build is green
Patch application report for D4931 (id=17768)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit aefcb6e0ce068cf1dd7b0da85012d103894039b5 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 15:32:30 2021 +0530 Add mapping for definitions and harvests Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not Add error classes Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/66/ for more details.
You missed this comment:
And for this:
you moved the code around, but you're still validating twice in the same function.
swh/clearlydefined/error.py | ||
---|---|---|
1 ↗ | (On Diff #17768) | only 2021 |
6 ↗ | (On Diff #17768) | Better name, please. |
7–11 ↗ | (On Diff #17768) | this is not needed, the base exception handles it properly already. |
14–35 ↗ | (On Diff #17768) | Could you add docstrings explaining the meaning of each of these? |
swh/clearlydefined/mapping_utils.py | ||
1 | (also 2021 or 2020-2021 while we're at it) | |
15 | the argument name should only be "storage", for consistency with the rest of the codebase. And we prefer it to be the first argument (same reason) | |
53 | please rename this. "flag" is synonymous with "boolean", which doesn't convey the meaning of this variable. (same comment below) | |
swh/clearlydefined/tests/test_mapping_utils.py | ||
245 | same comment on "flag" |
and there's again some unreachable code in the tests; please check this before submitting diffs.
And there are two lines in the with pytest.raises blocks; only the line expected to raise the exception should be in the block.
Add mapping for definitions and harvests
Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not
Add error classes for all conditions where metadata can be wrong or for an invalid ID
Add docstring for error class
Build is green
Patch application report for D4931 (id=17774)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit f329f8e869b0565bd26c917b99608b981d2091ea Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 19:13:37 2021 +0530 Add mapping for definitions and harvests Add functions map_row, map_definition, map_harvest to check whether swh archive is able to map clearlydefined object or not Add error classes for all conditions where metadata can be wrong or for an invalid ID Add docstring for error class Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/67/ for more details.
ok I'm accepting this diff, but let's try to to deduplicate these functions after it's merged.
Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of error classes in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases"
Build is green
Patch application report for D4931 (id=17787)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 9b4e723924a862eefe9efadebbd0fbbc983e8eb6 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 22:16:55 2021 +0530 Add mapping for definitions and harvests Use swh storage instead of sql queries to map sha1 with SWHIDs Add different type of errors in error.py Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest. If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]). Add tests to cover all the cases Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/68/ for more details.
Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of errors in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None.
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases
Build is green
Patch application report for D4931 (id=17789)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit a09e460f28b74217efa4f61a44e20226a5868800 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 22:21:00 2021 +0530 Add mapping for definitions and harvests Use swh storage instead of sql queries to map sha1 with SWHIDs Add different type of errors in error.py Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest. If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and return a valid SWHID, TargetType(revision/content), Origin If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]). Add tests to cover all the cases Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/69/ for more details.
Add mapping for definitions and harvests
Use swh storage instead of sql queries to map sha1 with SWHIDs
Add different type of exceptions in error.py
Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest.
If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None.
If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]).
Add tests to cover all the cases
Build is green
Patch application report for D4931 (id=17791)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit f1d4923895e27d9df3b9897b2fc8f98811d1925a Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Mon Feb 1 23:01:51 2021 +0530 Add mapping for definitions and harvests Use swh storage instead of sql queries to map sha1 with SWHIDs Add different type of exceptions in error.py Add function map_row, and try to check if the ID of the row is valid ID or not, if it is not a valid ID, then raise different exceptions defined in error.py. And if the ID is valid then check if the ID is a definition or harvest. If the given ID is a definition then map the hashes with the archive (sha1 with content and sha1git with revision) and if able to map the defintion then return True,[(SWHID, TargetType(revision/content), Origin)], otherwise return None. If the given ID is a harvest, then check the tool of the harvest (licensee, scancode, clearlydefined) and map accordig to the tool (map_licensee, map_scancode, map_clearlydefined), then return mapping status of that harvest (True if every sha1 of that harvest is mapped and False if failed to do it) and a list of data([SWHID,TargetType,None]). Add tests to cover all the cases Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/70/ for more details.
Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive
This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
Build is green
Patch application report for D4931 (id=17793)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 7d81301051b9eca6c2fade465b5926aac31d0516 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Tue Feb 2 14:31:29 2021 +0530 Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future. Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works. Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/71/ for more details.
Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive
This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
Build is green
Patch application report for D4931 (id=17795)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 3611155bbd705d6b03595941e22ba12c9b5d3e04 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Tue Feb 2 15:06:51 2021 +0530 Add feature to map a row from clearcode toolkit database with content and revision table from software heritage archive This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future. Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works. Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/72/ for more details.
Map a row from clearcode toolkit with software heritage archive
This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future.
Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works.
Build is green
Patch application report for D4931 (id=17796)
Rebasing onto 6d70494b71...
Current branch diff-target is up to date.
Changes applied before test
commit 9940287285babe39224cd64d7fddf1507d5589d3 Author: Tushar Goel <tushar.goel.dav@gmail.com> Date: Tue Feb 2 15:11:12 2021 +0530 Map a row from clearcode toolkit with software heritage archive This is to build a mechanism to map a row [path(Primary Key), content(binary data), last_modified_date(timestamp with timezone), map_error(error message while mapping), uuid] from clearcode toolkit database, with software heritage archive using content table for sha1 and revision table for sha1_git and extract required information from that row. Then return list of data that has been mapped and mapping status(if able to map every hash of that row, will return True, else return False) so the row that is not being able to map for now can be stored in a state, and can be mapped in future. Add various exception classes in error.py that can be raised while mapping a row. Check if that row is a definition or harvest and also check if that row does not has invalid path, raise exception if path is invalid. If row is a definiton then map the data using map_definition and if it is a harvest then map it using map_harvest. Use storage instead of sql queries while mapping with the data inside archive. Add tests to cover all the cases and add docstrings to explain how every function works. Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/73/ for more details.