Page MenuHomeSoftware Heritage

Add standard config support and HTTP auth token for swh-scanner
ClosedPublic

Authored by tenma on Sep 9 2020, 7:58 PM.

Details

Summary
  • impl the standard SWH config scheme, which is:
    • cli switch for config path
    • envvar for config path
    • default config path under ~/.config/swh/global
    • default config mapping in the offending module
  • added HTTP auth token support for requests to the SWH API
  • replaced PosixPath to Path, which astracts away environment detail, in the whole package

Closes T2572

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D3900 (id=13744)

Rebasing onto d4f9b2d6d6...

Current branch diff-target is up to date.
Changes applied before test
commit 774cb01129a6430a10e38dc1698776a643ec6847
Author: tenma <tenma+swh@mailbox.org>
Date:   Wed Sep 9 19:32:13 2020 +0200

    Add standard config support and HTTP auth token for swh-scanner
    
    - impl the standard SWH config scheme, which is:
        - cli switch for config path
        - envvar for config path
        - default config path under ~/.config/swh/global
        - default config mapping in the offending module
    - added HTTP auth token support for requests to the SWH API
    - replaced PosixPath to Path, which astracts away environment detail, in the whole package
    
    Closes T2572

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/39/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/39/console

vlorentz added a subscriber: vlorentz.

Thanks!

Could you split this diff/commit into two; first one that replaces PosixPath with Path, then the rest?

Some other specific comments below

swh/scanner/cli.py
17–19

the type should be Dict[str, Any]; because config values can be any type; you just happen to have only one.

21

This name ("config filename") is inconsistent with the name of the argument ("config file").

Additionally, Click automatically handles environment variables, so SWH_CONFIG_FILE is already handed, you just need to add Path(click.get_app_dir("swh")) / "global" as a default for the click option

62–63

I know you did not write this, but does it make sense to convert to Path then convert back to string?

This revision now requires changes to proceed.Sep 10 2020, 9:28 AM
tenma added inline comments.
swh/scanner/cli.py
21
  • Does it really handles the swh config envvar, SWH_CONFIG_FILE or whatever its name? Some swh components e.g. objstorage use config_file = os.environ.get("SWH_CONFIG_FILENAME").
  • Actually I would prefer config_path over config_file so as to be most accurate and distinguish path from file. But config_file is used everywhere in the CLIs.
tenma marked an inline comment as not done.Sep 10 2020, 10:55 AM
swh/scanner/cli.py
21

There is also the question of how much we let click handle, because there is currently much duplication between our click usage and swh.core.config:

  • can we let it handle the default value handling : envvar or default
  • do we let it validate the path: it is currently done by both code
  • bonus note: click.Path valides but doesn't convert it to pathlib.Path and returns either str or bytes, for compat reason I think
anlambert added a subscriber: anlambert.

There is still room for improvements here. Also as @vlorentz said, you should split the diff into two: one for the configuration management and another for the s/PosixPath/Path/ changes.

swh/scanner/cli.py
17–19

The web-client key is not needed here, we already know what we configure here, you should simply write:

DEFAULT_CONFIG: Dict[str, Any] = {
   "api-url": "https://archive.softwareheritage.org/api/1/"
}
19–31

I would rather load configuration this way as:

  • it checks if provided config file exists
  • it enables to take the global config into account using swh.core.config API (load_named_config also loads the global config file)
  • it fallbacks on loading a named config file (<conf_dir>/web/client.yml)
conf = {}

if not config_file:
    config_file = os.environ.get("SWH_CONFIG_FILENAME")

if config_file:
    if not os.path.exists(config_file):
        raise ValueError("Config file %s does not exist" % config_file)
     conf.update(config.load_global_config())
     conf.update(config.read(config_file, DEFAULT_CONFIG))
else:
     conf = config.load_named_config("web/client", DEFAULT_CONFIG)

This could be wrapped in a new swh.core.config function IMHO as it could be reused in other CLI tools.

122
config["api-url"] = parse_url(api_url)
swh/scanner/cli.py
21

OK, so click as of now does not convert path through its click.Path.path_type param which would be a nice feature I could add one day. That would be neat and typing friendly.

swh/scanner/cli.py
19–31

I made a mistake, the named config should be scanner and not web/client.

tenma added inline comments.
swh/scanner/cli.py
17–19

Chose to fully qualify the config in order to streamline config management.

19–31

Did as replied in above comments.

21

Implemented the following config path resolving (descending priority):
CLI switch, CLI auto envvar, default SWH config envvar, default SWH config path

62–63

I suspect an implicit validation but I have no clue, thanks to impure function.
Not critical I guess, so remove.

tenma marked an inline comment as done.

Separated commits, improved config loading

Build has FAILED

Patch application report for D3900 (id=13788)

Could not rebase; Attempt merge onto d4f9b2d6d6...

Updating d4f9b2d..09bd7bf
Fast-forward
 swh/scanner/cli.py                 | 70 +++++++++++++++++++++++++++++---------
 swh/scanner/dashboard/dashboard.py |  8 ++---
 swh/scanner/model.py               | 14 ++++----
 swh/scanner/plot.py                | 13 +++----
 swh/scanner/scanner.py             | 19 ++++++-----
 5 files changed, 79 insertions(+), 45 deletions(-)
Changes applied before test
commit 09bd7bf525c14dbfe620f2aec37394e018ed3ad3
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Sep 10 16:32:26 2020 +0200

    Add standard config support and auth token for swh-scanner
    
    - impl a full SWH config scheme, which loads config like the following
    (descending priority):
        - CLI switch for config path
        - CLI envvar for component config path (click auto_envvar feature)
        - CLI envvar for global config path
        - default global config path under ~/.config/swh/global.yml
        - merge loaded config with default specific config mapping
    - for now the default config path is hardcoded but will move to swh.core.config later
    - url and token config is namespaced with web-api to start having a
    consistent hierarchical config model accross components
    - added simple HTTP auth token support for requests to the SWH API,
    without adding yet another dependency for a small middleware
    
    Close T2572

commit 4b8b0c4d3d65272298fc466b9345a5ed989d33d1
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Sep 10 16:22:58 2020 +0200

    use the right abstraction for paths in swh-scanner
    
    replaced PosixPath to Path, which abstracts away environment detail, in the whole package

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/42/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/42/console

@tenma, could you rebase the diff to master, arc patch is failing otherwise, see below:

17:14 $ arc patch D3900
 INFO  Base commit is not in local repository; trying to fetch.
Branch name arcpatch-D3900 already exists; trying a new name.
Branch name arcpatch-D3900_1 already exists; trying a new name.
Branch name arcpatch-D3900_2 already exists; trying a new name.
Created and checked out branch arcpatch-D3900_3.


    This diff is against commit 4b8b0c4d3d65272298fc466b9345a5ed989d33d1, but
    the commit is nowhere in the working copy. Try to apply it against the
    current working copy state? (9ce6ddc0549045533578aa89bab62af8738832d3 \
    D3919) [Y/n] 

Checking patch swh/scanner/scanner.py...
error: while searching for:


async def run(
    root: Path, api_url: str, source_tree: Tree, exclude_patterns: Set[Any]
) -> None:
    """Start scanning from the given root.


error: patch failed: swh/scanner/scanner.py:151
Hunk #2 succeeded at 164 (offset 2 lines).
Hunk #3 succeeded at 179 (offset 2 lines).
Checking patch swh/scanner/cli.py...
error: while searching for:
# See top-level LICENSE file for more information

import click
import asyncio
import glob
import re
import fnmatch
from pathlib import Path
from typing import Tuple

from .scanner import run
from .model import Tree

error: patch failed: swh/scanner/cli.py:4
error: while searching for:
from .dashboard.dashboard import run_app
from .exceptions import InvalidDirectoryPath

from swh.core.cli import CONTEXT_SETTINGS


@click.group(name="scanner", context_settings=CONTEXT_SETTINGS)
@click.pass_context
def scanner(ctx):
    """Software Heritage Scanner tools."""
    pass


def parse_url(url):

error: patch failed: swh/scanner/cli.py:17
Hunk #3 succeeded at 59 (offset -31 lines).
Hunk #4 succeeded at 74 (offset -31 lines).
error: while searching for:
    "-i", "--interactive", is_flag=True, help="show the result in a dashboard"
)
@click.pass_context
def scan(ctx, root_path, api_url, patterns, format, interactive):
    """Scan a source code project to discover files and directories already
    present in the archive"""
    sre_patterns = set()
    if patterns:
        sre_patterns = {
            reg_obj for reg_obj in extract_regex_objs(Path(root_path), patterns)
        }

    api_url = parse_url(api_url)
    source_tree = Tree(Path(root_path))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(run(root_path, api_url, source_tree, sre_patterns))

    if interactive:
        root = Path(root_path)

error: patch failed: swh/scanner/cli.py:86
Hunk #6 succeeded at 110 (offset -28 lines).
Applying patch swh/scanner/scanner.py with 1 reject...
Rejected hunk #1.
Hunk #2 applied cleanly.
Hunk #3 applied cleanly.
Applying patch swh/scanner/cli.py with 3 rejects...
Rejected hunk #1.
Rejected hunk #2.
Hunk #3 applied cleanly.
Hunk #4 applied cleanly.
Rejected hunk #5.
Hunk #6 applied cleanly.

 Patch Failed! 
Usage Exception: Unable to apply patch!

Also, some tests need to be fixed.

fine with me, but I'll let anlambert finish the review

rebased 2 times master and refactored commits, improving typing

Build has FAILED

Patch application report for D3900 (id=13843)

Rebasing onto f838fed672...

Current branch diff-target is up to date.
Changes applied before test
commit 71a5c44405b1b5b9a7851b93cf69032837faa753
Author: tenma <tenma+swh@mailbox.org>
Date:   Fri Sep 11 15:11:05 2020 +0200

    Add standard config support and auth token for swh-scanner
    
    - impl a full SWH config scheme, which loads config like the following
    (descending priority):
        - CLI switch for config path
        - CLI envvar for component config path (click auto_envvar feature)
        - CLI envvar for global config path
        - default global config path under ~/.config/swh/global.yml
        - merge loaded config with default specific config mapping
    - for now the default config path is hardcoded but will move to swh.core.config later
    - url and token config is namespaced with web-api to start having a
    consistent hierarchical config model accross components
    - added simple HTTP auth token support for requests to the SWH API,
    without adding yet another dependency for a small middleware
    
    Close T2572

commit 544176fafb49d6a39c8c5ef9d47cb1c45250d811
Author: tenma <tenma+swh@mailbox.org>
Date:   Thu Sep 10 19:56:48 2020 +0200

    refactor non-cli code out of cli in swh-scanner
    
    - moved non-cli code including imports out of cli module to scanner module
    - replaced PosixPath to Path, which abstracts away environment detail, in the whole package

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/46/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/46/console

Only included commit about config

Build is green

Patch application report for D3900 (id=13854)

Rebasing onto f838fed672...

Current branch diff-target is up to date.
Changes applied before test
commit 95e0af65495bb3b1af3d6d6f31015ffef11bd4af
Author: tenma <tenma+swh@mailbox.org>
Date:   Fri Sep 11 16:54:38 2020 +0200

    Add standard config support and auth token for swh-scanner
    
    - impl a full SWH config scheme, which loads config like the following
    (descending priority):
        - CLI switch for config path
        - CLI envvar for component config path (click auto_envvar feature)
        - CLI envvar for global config path
        - default global config path under ~/.config/swh/global.yml
        - merge loaded config with default specific config mapping
    - for now the default config path is hardcoded but will move to swh.core.config later
    - url and token config is namespaced with web-api to start having a
    consistent hierarchical config model accross components
    - added simple HTTP auth token support for requests to the SWH API,
    without adding yet another dependency for a small middleware
    
    Close T2572

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/47/ for more details.

Almost good but there is still something that puzzles me regarding global configuration handling. Let me explain.

What we want to achieve here is to share Web API URL and authentication token across several swh components (swh-scanner and swh-web-client at the moment)
without duplicating it in multiple configuration files. So basically we want to put that configuration in the global.yml file.

Then each component can have its specific configuration in a dedicated file (scanner.yml and web-client.yml
file for instance). In the current state of that diff, if you do not provide any config file with the -C option, the
global config file will be used. However if you use the -C option, only the specified config file will be loaded
and the global config file will be ignored. This means we will have to duplicate Web API configuration into
scanner.yml and this is not what we want.

So the global configuration should always be loaded, even if a config file has been provided through the -C option.

The following diff enables to obtain the behavior I described above.

diff --git a/swh/scanner/cli.py b/swh/scanner/cli.py
index ea61a96..7773669 100644
--- a/swh/scanner/cli.py
+++ b/swh/scanner/cli.py
@@ -16,12 +16,6 @@ from swh.core import config
 from swh.core.cli import CONTEXT_SETTINGS
 
 
-# All generic config code should reside in swh.core.config
-DEFAULT_CONFIG_PATH = os.environ.get(
-    "SWH_CONFIG_FILE", os.path.join(click.get_app_dir("swh"), "global.yml")
-)
-
-
 DEFAULT_CONFIG: Dict[str, Any] = {
     "web-api": {
         "url": "https://archive.softwareheritage.org/api/1/",
@@ -68,7 +62,7 @@ def extract_regex_objs(root_path: PosixPath, patterns: Tuple[str]) -> object:
 @click.option(
     "-C",
     "--config-file",
-    default=DEFAULT_CONFIG_PATH,
+    default=os.environ.get("SWH_CONFIG_FILE"),
     type=click.Path(exists=True, dir_okay=False, path_type=str),
     help="YAML configuration file",
 )
@@ -77,8 +71,12 @@ def scanner(ctx, config_file: str):
     """Software Heritage Scanner tools."""
 
     # recursive merge not done by config.read
-    conf = config.read_raw_config(config.config_basepath(config_file))
-    conf = config.merge_configs(DEFAULT_CONFIG, conf)
+    base_conf = config.merge_configs(DEFAULT_CONFIG, config.load_global_config())
+    if config_file:
+        conf = config.read_raw_config(config.config_basepath(config_file))
+        conf = config.merge_configs(base_conf, conf)
+    else:
+        conf = base_conf
 
     ctx.ensure_object(dict)
     ctx.obj["config"] = conf
This revision now requires changes to proceed.Sep 11 2020, 5:52 PM

nitpicking: the config management part does need to be in the same git revision as the auth support.

Otherwise, looks ok

Almost good but there is still something that puzzles me regarding global configuration handling. Let me explain.

What we want to achieve here is to share Web API URL and authentication token across several swh components (swh-scanner and swh-web-client at the moment)
without duplicating it in multiple configuration files. So basically we want to put that configuration in the global.yml file.

I am not sure we really need this. Both the swh-scanner and the swh-web-client need the same information on how to connect the web API, so having a single file is fine, as long as it provides the needed config block, aka the web-api section here.

I'm not very fond of keeping the complex merge from multiple source of configuration in the code. We can have everything fitted in a single, easily identifiable, file. No need for this extra complexity (in the code as well as is the head of the user trying to figure out where to modify the configuration)

For the "end user" use case, needing access the the web API, and possibly some tool-specific config, we just have to keep the tool-specific config statements in a dedicated section. For this use case, having a sensible default path to read the config file from makes sense (typically in the proposed ~/.config/swh directory here).

I see no practical advantage of having one file per service type in this use case.

For the "production" use case (deploy swh components in production or in test environments like docker), have a single config file is also way simpler to manage. Let the configuration management tool (puppet/Dockerfile/docker-compose) do the dirty job, and deploy as many config file as necessary but make sure the config file location is explicitly set.

Almost good but there is still something that puzzles me regarding global configuration handling. Let me explain.

What we want to achieve here is to share Web API URL and authentication token across several swh components (swh-scanner and swh-web-client at the moment)
without duplicating it in multiple configuration files. So basically we want to put that configuration in the global.yml file.

I am not sure we really need this. Both the swh-scanner and the swh-web-client need the same information on how to connect the web API, so having a single file is fine, as long as it provides the needed config block, aka the web-api section here.

I'm not very fond of keeping the complex merge from multiple source of configuration in the code. We can have everything fitted in a single, easily identifiable, file. No need for this extra complexity (in the code as well as is the head of the user trying to figure out where to modify the configuration)

For the "end user" use case, needing access the the web API, and possibly some tool-specific config, we just have to keep the tool-specific config statements in a dedicated section. For this use case, having a sensible default path to read the config file from makes sense (typically in the proposed ~/.config/swh directory here).

I see no practical advantage of having one file per service type in this use case.

For the "production" use case (deploy swh components in production or in test environments like docker), have a single config file is also way simpler to manage. Let the configuration management tool (puppet/Dockerfile/docker-compose) do the dirty job, and deploy as many config file as necessary but make sure the config file location is explicitly set.

All right, accepting the diff then.

This revision is now accepted and ready to land.Sep 14 2020, 1:54 PM