Page MenuHomeSoftware Heritage

Add luigi tasks
ClosedPublic

Authored by vlorentz on Nov 10 2022, 10:38 AM.

Details

Summary

They are more tuned toward running automatically, as they call each
other as needed, and can be imported by workflows defined in other
modules (eg. the future swh.graph.luigi module).

This massively re-uses the CLI, so most of the code is:

  • telling Luigi how to deduplicate + when/how to reuse output of tasks that already ran
  • adding stamp files to avoid accidentally using a partially written export (because it was interrupted midway)
  • the meta.json file, which acts as a final stamp and provides information about the dataset export itself (required for T2579)

Depends on D8828

Test Plan

This is mostly declarative code, and all the issues are when interfacing
with external stuff (mostly S3 and Athena), so I do not think writing tests
is worth it.

I played with this code in various scenarios while debugging, so I am
confident task deduplication works fine.

Diff Detail

Repository
rDDATASET Datasets
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32764
Build 51334: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51333: arc lint + arc unit

Unit TestsFailed

TimeTest
0 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.dataset.luigi
.tox/py3/lib/python3.7/site-packages/swh/dataset/luigi.py:116: in <module> import luigi E ModuleNotFoundError: No module named 'luigi'

Event Timeline

Build has FAILED

Patch application report for D8829 (id=31824)

Could not rebase; Attempt merge onto b2ae082661...

Updating b2ae082..b557e0b
Fast-forward
 mypy.ini               |   3 +
 requirements-luigi.txt |   1 +
 setup.py               |   1 +
 swh/dataset/athena.py  |   4 +-
 swh/dataset/cli.py     |  56 ++++--
 swh/dataset/luigi.py   | 534 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 578 insertions(+), 21 deletions(-)
 create mode 100644 requirements-luigi.txt
 create mode 100644 swh/dataset/luigi.py
Changes applied before test
commit b557e0b6a1858feda20332628cf24e90cd23a530
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:33:42 2022 +0100

    Add luigi tasks
    
    They are more tuned toward running automatically, as they call each
    other as needed, and can be imported by workflows defined in other
    modules (eg. the future swh.graph.luigi module).

commit eea3e15bf7e4a817c07fee63f5262a525b6473e3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:17:04 2022 +0100

    cli: Move the main code of export_graph to its own function
    
    So it can be reused by a Luigi task

commit 5087a463974e548d78cb169264bb78935428e4a6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:11:51 2022 +0100

    athena: Fix create_table to work with restricted permissions
    
    For some reason, using a non-existing database works when working with
    credentials with unnecessarily high privileges (though it is not clear
    to me which permissions allow this).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/159/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/159/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 10 2022, 10:41 AM
Harbormaster failed remote builds in B32764: Diff 31824!

tox.ini: Add luigi to the extras

Build has FAILED

Patch application report for D8829 (id=31825)

Could not rebase; Attempt merge onto b2ae082661...

Updating b2ae082..c890d2e
Fast-forward
 mypy.ini               |   3 +
 requirements-luigi.txt |   1 +
 setup.py               |   1 +
 swh/dataset/athena.py  |   4 +-
 swh/dataset/cli.py     |  56 ++++--
 swh/dataset/luigi.py   | 534 +++++++++++++++++++++++++++++++++++++++++++++++++
 tox.ini                |   1 +
 7 files changed, 579 insertions(+), 21 deletions(-)
 create mode 100644 requirements-luigi.txt
 create mode 100644 swh/dataset/luigi.py
Changes applied before test
commit c890d2e1ac88d2313f6d3668819520a4fbcba7e5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:33:42 2022 +0100

    Add luigi tasks
    
    They are more tuned toward running automatically, as they call each
    other as needed, and can be imported by workflows defined in other
    modules (eg. the future swh.graph.luigi module).

commit eea3e15bf7e4a817c07fee63f5262a525b6473e3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:17:04 2022 +0100

    cli: Move the main code of export_graph to its own function
    
    So it can be reused by a Luigi task

commit 5087a463974e548d78cb169264bb78935428e4a6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:11:51 2022 +0100

    athena: Fix create_table to work with restricted permissions
    
    For some reason, using a non-existing database works when working with
    credentials with unnecessarily high privileges (though it is not clear
    to me which permissions allow this).

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/160/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/160/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 10 2022, 10:45 AM
Harbormaster failed remote builds in B32765: Diff 31825!

Build is green

Patch application report for D8829 (id=31826)

Could not rebase; Attempt merge onto b2ae082661...

Updating b2ae082..058e568
Fast-forward
 mypy.ini               |   3 +
 requirements-luigi.txt |   1 +
 setup.py               |   1 +
 swh/dataset/athena.py  |   4 +-
 swh/dataset/cli.py     |  56 ++++--
 swh/dataset/luigi.py   | 534 +++++++++++++++++++++++++++++++++++++++++++++++++
 tox.ini                |   3 +
 7 files changed, 581 insertions(+), 21 deletions(-)
 create mode 100644 requirements-luigi.txt
 create mode 100644 swh/dataset/luigi.py
Changes applied before test
commit 058e568492ba8ba495b6366ae19cadc6ce7e5c4f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:33:42 2022 +0100

    Add luigi tasks
    
    They are more tuned toward running automatically, as they call each
    other as needed, and can be imported by workflows defined in other
    modules (eg. the future swh.graph.luigi module).

commit eea3e15bf7e4a817c07fee63f5262a525b6473e3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:17:04 2022 +0100

    cli: Move the main code of export_graph to its own function
    
    So it can be reused by a Luigi task

commit 5087a463974e548d78cb169264bb78935428e4a6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 10 10:11:51 2022 +0100

    athena: Fix create_table to work with restricted permissions
    
    For some reason, using a non-existing database works when working with
    credentials with unnecessarily high privileges (though it is not clear
    to me which permissions allow this).

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/161/ for more details.

lgtm

one typo inline.

Overall, that feels like luigi is a kinda "distributed makefile" (with a python dsl).

...so I do not think writing tests is worth it.

only time will tell if you are right ;)

swh/dataset/luigi.py
205
This revision is now accepted and ready to land.Nov 10 2022, 3:43 PM

Overall, that feels like luigi is a kinda "distributed makefile" (with a python dsl).

definitely, yes!

This revision was automatically updated to reflect the committed changes.