Description
Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1410 Kill implicit configuration: new configuration scheme | ||
Migrated | gitlab-migration | T1386 Refactor indexers' initialization step | ||
Migrated | gitlab-migration | T826 Objects that implicitely load configuration are a nightmare to test | ||
Migrated | gitlab-migration | T1532 Cleanup deprecated configuration code in swh modules | ||
Migrated | gitlab-migration | T1531 Cleanup manifests once deployment is done | ||
Migrated | gitlab-migration | T1525 Deploy all swh modules' latest version |
Event Timeline
For the configuration part, I see 2 use cases:
- rpc servers, loaders, listers, ... (swh services really): They need a simple configuration file to parse and load, with eventual checks on missing keys. In such a case, it should fail early with a clear message about what's missing (so we can fix fast and restart the service). [1]
- cli: The need is more complex (what i gathered from your irc discussion, which I did not follow entirely). What i understood was a need to share some common part across multiple subcommands of the swh cli (typically the auth-token part). And that a merge policy was thus in order. The fail-fast property is also good here ;)
[1]
Implementation wise, that's were SWHConfig kinda came from. And afaik, that's
not used by the swh cli. Thus, why the D3965 proposal which dropped a lot of
the unused code in the first place sounded sensible to me. There should be no
impact on those services with that diff (well, aside the fact that I need to
update their respective configuration with all that they need so no more
implicit configuration ;)
rpc servers, loaders, listers, ... (swh services really): They need a simple
configuration file to parse and load, with eventual checks on missing keys.
In such a case, it should fail early with a clear message about what's
missing (so we can fix fast and restart the service). [1]
I completely missed and put aside the current complexities we have in our loader declarations...
for example, for the original dvcs loaders, we have the following inheritance
chain:
(SWHConfig <-) BaseLoader <- DVCSLoader <- {GitLoader,GitLoaderFromDisk,GitLoaderFromArchive,HgBundle20Loader,...}
and we also have:
(SWHConfig <-) BaseLoader <- SvnLoader <- {SvnLoaderFromDumpArchive,SvnLoaderFromRemoteDump}
reading: A <- {B,C, } : "B, C, ... inherit from A"
What is complex here is that each layer adds its own subset of default configuration to enrich if not provided.
So maybe, in the end, the need is shared across CLIs and the rest.
Should have said as much here instead...
Nonethess, a way forward for simplifying loaders is described [1]
Implementation proposal is summarized [2] (with implementation diffs reference in summary).
[1] D3965#102053
[2] D3965#102120
Suggestions from irc discussion summary:
- fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
- a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)
- Move the configuration file loading out of the services themselves (loader, lister, indexer, ...) and moves within the scope of the entrypoints instead (swh cli, celery task, gunicorn wsgi, etc...)
In my mind, this means giving most of our "implementation" classes (e.g. FooLoader, BarLister, etc.) either an explicit "config" parameter or, even better, a set of properly named and typed parameters matching the contents of said configuration.
We should make the entry points parse the configuration file, and populate the arguments to the classes. This also allows overriding some configuration parameters with command line arguments, which we're kinda doing but inconsistently in some CLIs.
This is also consistent with how we're currently using the get_storage, get_objstorage, ... functions: we pass them cls and args directly taken from the contents of the configuration file of the component that's currently being run.
Moving this way would drastically reduce the "cognitive overhead" of test fixtures for all our components, which currently have to : create a tempdir, write yaml to a tempfile, override an environment variable. This also makes it explicit which bit of the config affects which components.
- fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
- a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)
I'm not sure we really need these two to happen if the "configuration" of our components ends up being a set of explicit arguments to their initialization function.
fallback to default config when the current necessary SWH_CONFIG_FILENAME is not set (instead of current failure)
a service without a configuration should be able to run by itself from a REPL (default-config should target swh services that runs in memory)I'm not sure we really need these two to happen if the "configuration" of our components ends up being a set of explicit arguments to their initialization function.
Yes, now that i understand the full extent of what you foresaw, I tend to agree.
I don't yet know if we still need those default config or not with the previous suggestion.
Still, another suggestion which I found reasonable is to add back the required swh services in the default config (they are no longer present).
But those should default to the in-memory implementation. Serving mainly as documentation purposes.
Something like:
DEFAULT_CONFIG: Dict[str, Any] = { "max_content_size": 100 * 1024 * 1024, "save_data": False, "save_data_path": "", "storage": {"cls": "memory"}, # before, when we had no in-memory implementation, we had localhost:5002 }