Page MenuHomeSoftware Heritage

Add a quick start section in the documentation and simplify the configuration file loading mechanism in the cli
ClosedPublic

Authored by douardda on Jul 27 2021, 5:53 PM.

Details

Summary

CLI config file: keep only 3 situations:

  • if the --config-file option is given, load the given file from the given path, otherwise
  • if SWH_CONFIG_FILENAME is set, laod this file, otherwise
  • use the default config (hard written in the code).

No more config merging nor looking in several default directories. Keep
it simple to use and document.

Depends on D6015.

Diff Detail

Event Timeline

Build is green

Patch application report for D6031 (id=21804)

Could not rebase; Attempt merge onto 1ae32c0a61...

Updating 1ae32c0..ebfd880
Fast-forward
 docs/index.rst                                     | 133 +++++++-
 swh/provenance/__init__.py                         |  11 +-
 swh/provenance/cli.py                              |  37 +--
 .../{provenancedb_base.py => provenancedb.py}      | 116 ++++---
 .../postgresql/provenancedb_with_path.py           |  75 -----
 .../postgresql/provenancedb_without_path.py        |  66 ----
 swh/provenance/sql/15-flavor.sql                   |   4 +-
 swh/provenance/sql/30-schema.sql                   |  16 +
 swh/provenance/sql/40-funcs.sql                    | 338 +++++++++++++++++++++
 swh/provenance/sql/60-indexes.sql                  |   9 +
 swh/provenance/tests/conftest.py                   |  22 +-
 swh/provenance/tests/test_provenance_db.py         |  22 +-
 12 files changed, 622 insertions(+), 227 deletions(-)
 rename swh/provenance/postgresql/{provenancedb_base.py => provenancedb.py} (77%)
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
 create mode 100644 swh/provenance/sql/40-funcs.sql
Changes applied before test
commit ebfd880fdaa135985fbf977f82af742d9db9dfab
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jul 22 09:32:59 2021 +0200

    Add a quick start section in the documentation
    
    also describe the possible db flavors.

commit b9d7a10f3aee06954ce6a69d667f26cef64c9160
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jul 27 16:03:35 2021 +0200

    Simplify the configuration file loading mechanism in the cli
    
    Keep only 3 situations:
    - if the --config-file option is given, load the given file from the
      given path, otherwise
    - if SWH_CONFIG_FILENAME is set, laod this file, otherwise
    - use the default config (hard written in the code).
    
    No more config merging nor looking in several default directories. Keep
    it simple to use and document.

commit 6d9120c593a87ed5e78f17c86aba3622df59b700
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 21 11:31:50 2021 +0200

    Merge Provenance*DB classes in a single ProvenanceDB
    
    now most of the specific logic is handled in sql functions or
    templatized sql code (e.g. relation_add()), it makes no sense to keep
    two classes.

commit 72b3c87c3c40776b1d78d2982461a55084e01c1c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jun 25 15:02:46 2021 +0200

    Use stored SQL functions for content_find_{all,one}()
    
    instead of generated specific SQL queries in python. This allows to make Provenance*DB classes easily mergeable (following revision).

commit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/270/ for more details.

thanks

docs/index.rst
14–16

(same issue below)

22

(same issue below)

24–28

to make it copy-pasteable (and backticks are bad)

32–33
42
52
77–80
129–130
swh/provenance/cli.py
94–95

you forgot these prints

ardumont added inline comments.
docs/index.rst
11
33
47
75

Is there really a need for 2 storage keys, that does not sound correct.
Or that may be a more general indentation problem which makes the config sample confusing here.

fix typos reported by ardumont and vlorentz (thx)

Build is green

Patch application report for D6031 (id=21828)

Could not rebase; Attempt merge onto 509280132c...

Removing swh/provenance/postgresql/provenancedb_without_path.py
Removing swh/provenance/postgresql/provenancedb_with_path.py
Merge made by the 'recursive' strategy.
 docs/index.rst                                     | 135 +++++++-
 swh/provenance/__init__.py                         |  11 +-
 swh/provenance/cli.py                              |  34 +--
 .../{provenancedb_base.py => provenancedb.py}      | 116 ++++---
 .../postgresql/provenancedb_with_path.py           |  75 -----
 .../postgresql/provenancedb_without_path.py        |  66 ----
 swh/provenance/sql/15-flavor.sql                   |   4 +-
 swh/provenance/sql/30-schema.sql                   |  16 +
 swh/provenance/sql/40-funcs.sql                    | 338 +++++++++++++++++++++
 swh/provenance/sql/60-indexes.sql                  |   9 +
 swh/provenance/tests/conftest.py                   |  22 +-
 swh/provenance/tests/test_provenance_db.py         |  22 +-
 12 files changed, 621 insertions(+), 227 deletions(-)
 rename swh/provenance/postgresql/{provenancedb_base.py => provenancedb.py} (77%)
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
 create mode 100644 swh/provenance/sql/40-funcs.sql
Changes applied before test
commit bcec796fed77e05f33bce30c6a676e892eb022eb
Merge: 5092801 bffc000
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jul 28 12:20:29 2021 +0000

    Merge branch 'diff-target' into HEAD

commit bffc000cabac84f807e925b7833d7fc3a8059efc
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jul 22 09:32:59 2021 +0200

    Add a quick start section in the documentation
    
    also describe the possible db flavors.

commit 83861cd322c0bf35a7db70f87e5c4ad235c92631
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jul 27 16:03:35 2021 +0200

    Simplify the configuration file loading mechanism in the cli
    
    Keep only 3 situations:
    - if the --config-file option is given, load the given file from the
      given path, otherwise
    - if SWH_CONFIG_FILENAME is set, laod this file, otherwise
    - use the default config (hard written in the code).
    
    No more config merging nor looking in several default directories. Keep
    it simple to use and document.

commit 6d9120c593a87ed5e78f17c86aba3622df59b700
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 21 11:31:50 2021 +0200

    Merge Provenance*DB classes in a single ProvenanceDB
    
    now most of the specific logic is handled in sql functions or
    templatized sql code (e.g. relation_add()), it makes no sense to keep
    two classes.

commit 72b3c87c3c40776b1d78d2982461a55084e01c1c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jun 25 15:02:46 2021 +0200

    Use stored SQL functions for content_find_{all,one}()
    
    instead of generated specific SQL queries in python. This allows to make Provenance*DB classes easily mergeable (following revision).

commit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/272/ for more details.

Build is green

Patch application report for D6031 (id=21830)

Could not rebase; Attempt merge onto 509280132c...

Removing swh/provenance/postgresql/provenancedb_without_path.py
Removing swh/provenance/postgresql/provenancedb_with_path.py
Merge made by the 'recursive' strategy.
 docs/index.rst                                     | 135 +++++++-
 swh/provenance/__init__.py                         |  11 +-
 swh/provenance/cli.py                              |  34 +--
 .../{provenancedb_base.py => provenancedb.py}      | 114 ++++---
 .../postgresql/provenancedb_with_path.py           |  75 -----
 .../postgresql/provenancedb_without_path.py        |  66 ----
 swh/provenance/sql/15-flavor.sql                   |   4 +-
 swh/provenance/sql/30-schema.sql                   |  16 +
 swh/provenance/sql/40-funcs.sql                    | 338 +++++++++++++++++++++
 swh/provenance/sql/60-indexes.sql                  |   9 +
 swh/provenance/tests/conftest.py                   |  22 +-
 swh/provenance/tests/test_provenance_db.py         |  22 +-
 12 files changed, 620 insertions(+), 226 deletions(-)
 rename swh/provenance/postgresql/{provenancedb_base.py => provenancedb.py} (77%)
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
 create mode 100644 swh/provenance/sql/40-funcs.sql
Changes applied before test
commit 73a2892b1095f32aaa296a54cda3224022687aa8
Merge: 5092801 4df608c
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jul 28 12:52:53 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 4df608c3e16e83bf7c8475a86192417d6793dcde
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jul 22 09:32:59 2021 +0200

    Add a quick start section in the documentation
    
    also describe the possible db flavors.

commit 76b53e1c604c4e16789325a8a0ca5c167a0e06ac
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jul 27 16:03:35 2021 +0200

    Simplify the configuration file loading mechanism in the cli
    
    Keep only 3 situations:
    - if the --config-file option is given, load the given file from the
      given path, otherwise
    - if SWH_CONFIG_FILENAME is set, laod this file, otherwise
    - use the default config (hard written in the code).
    
    No more config merging nor looking in several default directories. Keep
    it simple to use and document.

commit a2ce4a30aaba0c1a9110a794d01a50c1aa3cd2c3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 21 11:31:50 2021 +0200

    Merge Provenance*DB classes in a single ProvenanceDB
    
    now most of the specific logic is handled in sql functions or
    templatized sql code (e.g. relation_add()), it makes no sense to keep
    two classes.

commit 72b3c87c3c40776b1d78d2982461a55084e01c1c
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jun 25 15:02:46 2021 +0200

    Use stored SQL functions for content_find_{all,one}()
    
    instead of generated specific SQL queries in python. This allows to make Provenance*DB classes easily mergeable (following revision).

commit 0d3aaffcc4b38d2aca65f85fed4d9b95f119a47a
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/274/ for more details.

Nice, i've spotted some other typos.
Otherwise, that looks great to me.

I also have one more question inline.

Thanks!

docs/index.rst
10

missed it earlier, sorry.

50

Is that a way to filter what to actually ingest out of the swh archive?
When i read this, i'm a bit confused as the reason behind this step.
I don't know whether we want to answer that question here or not though.

(nevertheless i'm interested by the answer ;)

I checked a bit the sample data but i'm still not entirely sure i get it right:

$ bunzip2 -c sample_10k.csv.bz2 | head -3
e41f50989aeb14840fee40994a9f44bb44f65598,2008-10-06 16:02:56+00:00,4b825dc642cb6eb9a060e54bf8d69288fbee4904
2d35a6b1a8ad8a8048c67acd156143cc292c7cd1,2008-10-06 16:05:28+00:00,4b825dc642cb6eb9a060e54bf8d69288fbee4904
c90c623097227632a0c04099b6461f8b65885b94,2008-10-06 16:05:49+00:00,4b825dc642cb6eb9a060e54bf8d69288fbee4904
65

to ease distinguishing between the content and the file.

122

Build is green

Patch application report for D6031 (id=21833)

Could not rebase; Attempt merge onto 509280132c...

Updating 5092801..ad2190d
Fast-forward
 docs/index.rst                                     | 135 +++++++-
 swh/provenance/__init__.py                         |  11 +-
 swh/provenance/cli.py                              |  34 +--
 .../{provenancedb_base.py => provenancedb.py}      | 114 ++++---
 .../postgresql/provenancedb_with_path.py           |  75 -----
 .../postgresql/provenancedb_without_path.py        |  66 ----
 swh/provenance/sql/15-flavor.sql                   |   4 +-
 swh/provenance/sql/30-schema.sql                   |  16 +
 swh/provenance/sql/40-funcs.sql                    | 338 +++++++++++++++++++++
 swh/provenance/sql/60-indexes.sql                  |   9 +
 swh/provenance/tests/conftest.py                   |  22 +-
 swh/provenance/tests/test_provenance_db.py         |  22 +-
 12 files changed, 620 insertions(+), 226 deletions(-)
 rename swh/provenance/postgresql/{provenancedb_base.py => provenancedb.py} (77%)
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
 create mode 100644 swh/provenance/sql/40-funcs.sql
Changes applied before test
commit ad2190d99519e2efe352ac94d2ff67e31feb1b18
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jul 22 09:32:59 2021 +0200

    Add a quick start section in the documentation
    
    also describe the possible db flavors.

commit 058ed19b0100fe2a2d7abe10a88a76fda7cd43d2
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jul 27 16:03:35 2021 +0200

    Simplify the configuration file loading mechanism in the cli
    
    Keep only 3 situations:
    - if the --config-file option is given, load the given file from the
      given path, otherwise
    - if SWH_CONFIG_FILENAME is set, laod this file, otherwise
    - use the default config (hard written in the code).
    
    No more config merging nor looking in several default directories. Keep
    it simple to use and document.

commit f5e6c283b08eb2f1efd6d598daca250b0e89dbcf
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 21 11:31:50 2021 +0200

    Merge Provenance*DB classes in a single ProvenanceDB
    
    now most of the specific logic is handled in sql functions or
    templatized sql code (e.g. relation_add()), it makes no sense to keep
    two classes.

commit fbc5499eb0e2e77d9f4650599778a8d8cd86f982
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jun 25 15:02:46 2021 +0200

    Use stored SQL functions for content_find_{all,one}()
    
    instead of generated specific SQL queries in python. This allows to make Provenance*DB classes easily mergeable (following revision).

commit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/277/ for more details.

Like i said yesterday, lgtm.

There remains a couple of typos to fix inline.

This revision is now accepted and ready to land.Jul 29 2021, 1:25 PM
aeviso added inline comments.
docs/index.rst
10

I believe it's come (without s) in this case, since the sentence already have the verb does in it

50

The main idea behind doing it this way is to have control over which revisions (and in which order) are going to be processed. I'm not sure this should be part of the module CLI in the end, but we'll use the API method behind this command for the initial load of the db. The idea is that a server process will iterate over the CSV file and feed the clients with the revisions to be processed (the order of the revisions in the file matters and should be pre-processed as desired, typically chronological order). After the initial load, a subscription to the Kafka journal will be used to keep the database updated on the daily bases.

swh/provenance/cli.py
24

The way the configuration file is handled was taken from some other module (I don't recall which one now) and I guess it was part of tenma's work on uniformly handling configurations across modules. Was that work abandoned?

docs/index.rst
10

agreed (i proposed the same fix ;)

50

Thanks for the heads up.

swh/provenance/cli.py
24

Yes, it's part of the large refactoring on our configuration management for our swh modules.
I don't think it got abandoned, just in stand-bye as nobody is working on it currently.

This revision was landed with ongoing or failed builds.Aug 6 2021, 10:58 AM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D6031 (id=21949)

Could not rebase; Attempt merge onto 509280132c...

Updating 5092801..3b145f1
Fast-forward
 docs/index.rst                                     | 135 +++++++-
 swh/provenance/__init__.py                         |  11 +-
 swh/provenance/cli.py                              |  34 +--
 .../{provenancedb_base.py => provenancedb.py}      | 114 ++++---
 .../postgresql/provenancedb_with_path.py           |  75 -----
 .../postgresql/provenancedb_without_path.py        |  66 ----
 swh/provenance/sql/15-flavor.sql                   |   4 +-
 swh/provenance/sql/30-schema.sql                   |  16 +
 swh/provenance/sql/40-funcs.sql                    | 338 +++++++++++++++++++++
 swh/provenance/sql/60-indexes.sql                  |   9 +
 swh/provenance/tests/conftest.py                   |  22 +-
 swh/provenance/tests/test_provenance_db.py         |  22 +-
 12 files changed, 620 insertions(+), 226 deletions(-)
 rename swh/provenance/postgresql/{provenancedb_base.py => provenancedb.py} (77%)
 delete mode 100644 swh/provenance/postgresql/provenancedb_with_path.py
 delete mode 100644 swh/provenance/postgresql/provenancedb_without_path.py
 create mode 100644 swh/provenance/sql/40-funcs.sql
Changes applied before test
commit 3b145f15c2db13d8f9fe766500c19c7112b05f64
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jul 22 09:32:59 2021 +0200

    Add a quick start section in the documentation
    
    also describe the possible db flavors.

commit 058ed19b0100fe2a2d7abe10a88a76fda7cd43d2
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Jul 27 16:03:35 2021 +0200

    Simplify the configuration file loading mechanism in the cli
    
    Keep only 3 situations:
    - if the --config-file option is given, load the given file from the
      given path, otherwise
    - if SWH_CONFIG_FILENAME is set, laod this file, otherwise
    - use the default config (hard written in the code).
    
    No more config merging nor looking in several default directories. Keep
    it simple to use and document.

commit f5e6c283b08eb2f1efd6d598daca250b0e89dbcf
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 21 11:31:50 2021 +0200

    Merge Provenance*DB classes in a single ProvenanceDB
    
    now most of the specific logic is handled in sql functions or
    templatized sql code (e.g. relation_add()), it makes no sense to keep
    two classes.

commit fbc5499eb0e2e77d9f4650599778a8d8cd86f982
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Jun 25 15:02:46 2021 +0200

    Use stored SQL functions for content_find_{all,one}()
    
    instead of generated specific SQL queries in python. This allows to make Provenance*DB classes easily mergeable (following revision).

commit 1c3d6426ebd2d1e4b00a50888b2b3eead5b8eab3
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jun 16 14:35:50 2021 +0200

    Add support for a denormalized version of the provenance DB
    
    in db schema, relation tables (xxx_in_yyy) are denormalized, meaning the
    yyy relation (and the location, if any) are stored as arrays.
    
    Denormalized schema is chosen at db creation time using one of the 2
    "-denormalized" flavors (aka "with-path-denormalized" or
    "without-path-denormalized").

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/285/ for more details.