Page MenuHomeSoftware Heritage

Kill implicit configuration: new configuration scheme
Closed, MigratedEdits Locked

Description

Collaborative document : https://hackmd.io/8hxTL4XMQoO2RVKtqFqM2g

Better config system that does not rely on implicit configurations.

Related:

T1758
T1388
T826
T1532
D3953
D3965
D4272
D4731

Revisions and Commits

rDLDG Git loader
D5077
D4128
rDSCH Scheduling utilities
D4284
rDWAPPS Web applications
D5093
rDLDHG Mercurial loader
D5078
D4127
rDENV Development environment
D5102
rDLDBASE Generic VCS/Package Loader
Abandoned
D5071
D4125
D4124
rDCIDX Metadata indexer
D4272
D4140
rDLS Listers
D4141
rDVAU Software Heritage Vault
D5092
D4294
rDCORE Foundations and core functionalities
Needs Review
D3965
rDLDSVN Subversion (SVN) loader
D5076
D5075
D4126

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tenma renamed this task from Kill implicit configuration to Kill implicit configuration : new configuration scheme.Sep 17 2020, 10:50 AM
tenma claimed this task.
tenma updated the task description. (Show Details)
tenma removed a project: Sprint 2018 12.
tenma added subscribers: ardumont, anlambert.
ardumont renamed this task from Kill implicit configuration : new configuration scheme to Kill implicit configuration: new configuration scheme.Sep 17 2020, 4:13 PM
tenma renamed this task from Kill implicit configuration: new configuration scheme to Kill implicit configuration : new configuration scheme.Sep 17 2020, 5:20 PM
tenma updated the task description. (Show Details)

For the configuration part, I see 2 use cases:

  • rpc servers, loaders, listers, ... (swh services really): They need a simple configuration file to parse and load, with eventual checks on missing keys. In such a case, it should fail early with a clear message about what's missing (so we can fix fast and restart the service). [1]
  • cli: The need is more complex (what i gathered from your irc discussion, which I did not follow entirely). What i understood was a need to share some common part across multiple subcommands of the swh cli (typically the auth-token part). And that a merge policy was thus in order. The fail-fast property is also good here ;)

[1]
Implementation wise, that's were SWHConfig kinda came from. And afaik, that's
not used by the swh cli. Thus, why the D3965 proposal which dropped a lot of
the unused code in the first place sounded sensible to me. There should be no
impact on those services with that diff (well, aside the fact that I need to
update their respective configuration with all that they need so no more
implicit configuration ;)

rpc servers, loaders, listers, ... (swh services really): They need a simple
configuration file to parse and load, with eventual checks on missing keys.
In such a case, it should fail early with a clear message about what's
missing (so we can fix fast and restart the service). [1]

I completely missed and put aside the current complexities we have in our loader declarations...

for example, for the original dvcs loaders, we have the following inheritance
chain:

(SWHConfig <-) BaseLoader <- DVCSLoader <- {GitLoader,GitLoaderFromDisk,GitLoaderFromArchive,HgBundle20Loader,...}

and we also have:

(SWHConfig <-) BaseLoader <- SvnLoader <- {SvnLoaderFromDumpArchive,SvnLoaderFromRemoteDump}

reading: A <- {B,C, } : "B, C, ... inherit from A"

What is complex here is that each layer adds its own subset of default configuration to enrich if not provided.
So maybe, in the end, the need is shared across CLIs and the rest.

ardumont renamed this task from Kill implicit configuration : new configuration scheme to Kill implicit configuration: new configuration scheme.Oct 2 2020, 10:54 AM

Should have said as much here instead...
Nonethess, a way forward for simplifying loaders is described [1]
Implementation proposal is summarized [2] (with implementation diffs reference in summary).

[1] D3965#102053

[2] D3965#102120

Suggestions from irc discussion summary:

  • fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
  • a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)
  • Move the configuration file loading out of the services themselves (loader, lister, indexer, ...) and moves within the scope of the entrypoints instead (swh cli, celery task, gunicorn wsgi, etc...)

Maybe starting a pad/hackmd document would be easier at this point?

Suggestions from irc discussion summary:

  • Move the configuration file loading out of the services themselves (loader, lister, indexer, ...) and moves within the scope of the entrypoints instead (swh cli, celery task, gunicorn wsgi, etc...)

In my mind, this means giving most of our "implementation" classes (e.g. FooLoader, BarLister, etc.) either an explicit "config" parameter or, even better, a set of properly named and typed parameters matching the contents of said configuration.

We should make the entry points parse the configuration file, and populate the arguments to the classes. This also allows overriding some configuration parameters with command line arguments, which we're kinda doing but inconsistently in some CLIs.

This is also consistent with how we're currently using the get_storage, get_objstorage, ... functions: we pass them cls and args directly taken from the contents of the configuration file of the component that's currently being run.

Moving this way would drastically reduce the "cognitive overhead" of test fixtures for all our components, which currently have to : create a tempdir, write yaml to a tempfile, override an environment variable. This also makes it explicit which bit of the config affects which components.

  • fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
  • a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)

I'm not sure we really need these two to happen if the "configuration" of our components ends up being a set of explicit arguments to their initialization function.

fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)

I'm not sure we really need these two to happen if the "configuration" of our components ends up being a set of explicit arguments to their initialization function.

Yes, now that i understand the full extent of what you foresaw, I tend to agree.


I don't yet know if we still need those default config or not with the previous suggestion.

Still, another suggestion which I found reasonable is to add back the required swh services in the default config (they are no longer present).
But those should default to the in-memory implementation. Serving mainly as documentation purposes.

Something like:

DEFAULT_CONFIG: Dict[str, Any] = {
    "max_content_size": 100 * 1024 * 1024,
    "save_data": False,
    "save_data_path": "",
    "storage": {"cls": "memory"},  # before, when we had no in-memory implementation, we had localhost:5002
}
tenma updated the task description. (Show Details)
vsellier added a subscriber: tenma.