Page MenuHomeSoftware Heritage

Add Orchestrator
ClosedPublic

Authored by TG1999 on Feb 7 2021, 6:52 PM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Commits
rDMFCD89092343ee46: Add Orchestrator
Summary

This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
not mapped in last orchestration.

Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
for future mapping purpose), if tool of row is fossoloy then skip that row. Modify the function map_row (this will also return dicitionary of data from now) and make a new function to get type of tool of that row
Add tests and docstrings.

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D5036 (id=18276)

Rebasing onto 10e3919718...

Current branch diff-target is up to date.
Changes applied before test
commit a02411238c47f5c883b6e7cd9327582d8322e188
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Fri Feb 19 13:35:56 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/104/ for more details.

Build is green

Patch application report for D5036 (id=18299)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit 589c517a9af88cca193e9bdf6d20bd3daa3ad9b9
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Fri Feb 19 18:15:28 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/108/ for more details.

vlorentz added inline comments.
swh/clearlydefined/orchestrator.py
39–53

in the sql/ dir.

and you shouldn't need ON CONFLICT

101–106

This is neither a good signature or a good docstring. Readers have no idea what row should be. It's also very easy to pass the wrong fields in row or pass them in the wrong order

Instead, either use a class, or use a dict and document every field.

The name also isn't very explicit; there should at least be a verb or a noun representing an action in function names, because functions do something. orchestor_row sounds like a variable or class name, because it does not describe an action/computation

130–140

I still don't. I don't see a change in the code and there are still no comments. (note: this comment applies to the second branch of orchestor_row)

swh/clearlydefined/sql/schema.sql
1–12 ↗(On Diff #18299)
  • wrong filename (see swh-storage)
  • missing indentation
  • missing comments
  • could use new lines between tables for readability
  • missing newline at the end
swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

why this new file?

swh/clearlydefined/tests/data/def_not_mapped.json
1

why this one?

This revision now requires changes to proceed.Feb 19 2021, 2:14 PM
TG1999 added inline comments.
swh/clearlydefined/orchestrator.py
39–53

Can you elaborate on this ?

130–140

I have updated the doc strings, if any other change is required can you elaborate on that?

swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

To add unmapped data that will be mapped after insertion in swh storage

swh/clearlydefined/tests/data/def_not_mapped.json
1

To add unmapped data that will be mapped after insertion in swh storage

swh/clearlydefined/orchestrator.py
39–53

all the initialization should be in sql/30-schema.sql, like swh-storage does.

130–140

Mostly, what is the content of mapped/data_list?

It's rather confusing that the variable mapped seems to be related to the row that is not yet mapped.

swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

can't you reuse the existing test files?

TG1999 added inline comments.
swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

That data that was used in mapping_utils can't exactly do the required stuff and I have to edit that previous data if need to be used here, then I also have to edit test cases for mapping_utils

swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

ok then. Could you add a README file in swh/clearlydefined/tests/data/ explaining what each file contains, and the difference between files that look similar?

swh/clearlydefined/orchestrator.py
39–53

Okay, then can I remove init_tables function?

swh/clearlydefined/tests/data/clearydefined_not_mapped.json
1

Surely !!

TG1999 marked an inline comment as done.

Change SQL structure
Add readme in tests

Build has FAILED

Patch application report for D5036 (id=18320)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit b5fa152c06de6605ceb085b30524628e7a44d9ae
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 22 13:45:38 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

Link to build: https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/109/
See console output for more information: https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/109/console

TG1999 marked 2 inline comments as done.

Add class for row

Build is green

Patch application report for D5036 (id=18321)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit b6ebd415f4e0722db11d76c502e38364a8af0b73
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 22 13:51:54 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/110/ for more details.

vlorentz added inline comments.
swh/clearlydefined/sql/30-schema.sql
35–38

these comments aren't very helpful, they just rephrase the column's name.

Either make them more descriptive, or remove them.

53

They are not environment variables.

54–55

also not very helpful comments; I don't think we need them (everyone understand it's a key/value)

swh/clearlydefined/tests/data/README.md
1–31

Could you add some line breaks here, and remove redundancy ("scancode_true - This file is used for feeding as input as a mock metadata")?

This revision now requires changes to proceed.Feb 22 2021, 10:21 AM
swh/clearlydefined/mapping_utils.py
43–54

and undo this change

TG1999 added inline comments.
swh/clearlydefined/mapping_utils.py
43–54

metadata = None, gives error

swh/clearlydefined/mapping_utils.py
43–54

Use attr.evolve

swh/clearlydefined/mapping_utils.py
43–54

Where ? In orchestrator ?

swh/clearlydefined/mapping_utils.py
43–54

where you need it to be not-None, yes

swh/clearlydefined/orchestrator.py
83–87

In what state?

And what is the returned value?

132–133

If I make this code run unconditionally, tests still pass. So either the conditional is useless or there is a test missing

swh/clearlydefined/tests/test_orchestrator.py
151

This only works if the system's timezone is UTC

TG1999 added inline comments.
swh/clearlydefined/tests/test_orchestrator.py
151

Suggestion on this, what should be done?

Remove Redundancy from Readme
Undo change in mapping_utils

Build is green

Patch application report for D5036 (id=18322)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit 22606c74159f4eb71fc07dbd2994d4abd2783d1f
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Mon Feb 22 15:45:43 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings.
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/111/ for more details.

swh/clearlydefined/tests/test_orchestrator.py
151

Make get_last_run_date actually return a datetime instead of a string

TG1999 added inline comments.
swh/clearlydefined/orchestrator.py
132–133

Hey, your observation is correct, can you tell me how you inferred that, so I can correct this part of the code, since I have asserted orchestrate_row it also returns False (when needed), and I thought line 169in test_orchestrator, was covering this part, since we had 1 row in unmapped_data before running orchestrator again on 168, and it left 1 row after, so its not getting deleted as well, I am a little bit confused

TG1999 marked an inline comment as done.

Change date from string to datetime

Build has FAILED

Patch application report for D5036 (id=18333)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit 3571e87ee94eeb8584cd406162240aa58830034b
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 23 12:51:22 2021 +0530

    Add orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

Link to build: https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/112/
See console output for more information: https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/112/console

Build is green

Patch application report for D5036 (id=18334)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit d8496d2e09cc975805e872cd9f24853738811ed4
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 23 14:19:57 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/113/ for more details.

Use dateutil to parse date

Build is green

Patch application report for D5036 (id=18339)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit fbb1c45c21267554099819400fd2cc2e1eafae41
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 23 16:41:16 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/114/ for more details.

Remove replace timezone from date

Build is green

Patch application report for D5036 (id=18347)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit 1e5f45a8ab59c747aa9d647e8a54748d8255efda
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 23 19:21:38 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/115/ for more details.

Add timezone in mock data

Build is green

Patch application report for D5036 (id=18358)

Rebasing onto 429ca5f59b...

Current branch diff-target is up to date.
Changes applied before test
commit 89092343ee46b3c63ccd24b0965347e167230b59
Author: Tushar Goel <tushar.goel.dav@gmail.com>
Date:   Tue Feb 23 21:18:15 2021 +0530

    Add Orchestrator
    
    This is to build a mechanism to write to write the data from clearcode database which has been mapped with swh storage into swh RawExtrensicMetadata, and the data that has not been mapped to
    a table unmapped_data. This process of orchestration will run periodically and will only try to map new data that has been entered after the last orchestration process and the data that was
    not mapped in last orchestration.
    
    Initialize tables if they don't exist in database. Initialize swh storage and add MetadataAuthority, MetadataFetcher, then map previously unmapped data and get last run date of orchestration
    then read data from clearcode and orchestor rows from clearcode DB (if whole row is mapped then in metadataStorage, if partial or no data is matched then store that row in unmapped data table
    (for future mapping purpose), if tool of row is fossoloy then skip that row.
    
    Add tests and docstrings
    
    Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

See https://jenkins.softwareheritage.org/job/DMFCD/job/tests-on-diff/116/ for more details.

Aaah, so that's where those dates were from :)

This revision is now accepted and ready to land.Feb 23 2021, 5:38 PM
This revision was automatically updated to reflect the committed changes.