Perform the complete refactoring of the archiver mentioned in T512.
Details
- Reviewers
olasd - Group Reviewers
Reviewers - Maniphest Tasks
- T512: Make archiver have a more symmetrical behavior treating storages as potential sources & destinations at the same time
- Commits
- rDSTOCc73941c4d5c9: Remove cli arguments for the archiver launching
rDSTOCf79a679d7c29: Correct config override in the archiver's test
rDSTOCdd49151734f6: Make the worker correclty catch corrupted contents
rDSTOCed9af69596f1: Remove outdated manual test
rDSTOC287da80f7c9d: archiver.worker: Remove additional config
rDSTOC212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTOC801ff72244e3: archiver: correct the worker when handling ongoing status
rDSTOCf76cd4d93861: Perform a complete rewriting of the archiver
R65:c73941c4d5c9: Remove cli arguments for the archiver launching
R65:f79a679d7c29: Correct config override in the archiver's test
R65:dd49151734f6: Make the worker correclty catch corrupted contents
R65:287da80f7c9d: archiver.worker: Remove additional config
R65:801ff72244e3: archiver: correct the worker when handling ongoing status
R65:f76cd4d93861: Perform a complete rewriting of the archiver
R65:ed9af69596f1: Remove outdated manual test
R65:212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTOf79a679d7c29: Correct config override in the archiver's test
rDSTOc73941c4d5c9: Remove cli arguments for the archiver launching
rDSTOed9af69596f1: Remove outdated manual test
rDSTO287da80f7c9d: archiver.worker: Remove additional config
rDSTO212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTO801ff72244e3: archiver: correct the worker when handling ongoing status
rDSTOdd49151734f6: Make the worker correclty catch corrupted contents
rDSTOf76cd4d93861: Perform a complete rewriting of the archiver
Diff Detail
- Repository
- rDSTO Storage manager
- Branch
- T512
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 340 Build 510: Software Heritage Python tests Build 509: arc lint + arc unit
Event Timeline
Note that the current version does not yet change the status of a content when it is found to be corrupted.
Make the worker correclty catch corrupted or missing contents that were referenced as 'present' into the db
Those contents have their status updated in the db and are removed from archival list. Next archival will reschedule new sources to reach retention policy.
See inline comments.
swh/storage/archiver/director.py | ||
---|---|---|
13 | The type name should probably be dict rather than json, as json will never be used for that purpose anyway | |
17–18 | The indentation of that dict looks broken :) | |
63–68 | The worker should really read all of this from its configuration rather than having it passed around as a message.
All in all, all the director should enqueue is the list of objects that might need copies. | |
92–94 | This is okay as a first step but a tiny refactoring of the database should allow us to do that server-side rather than client-side. | |
swh/storage/archiver/tasks.py | ||
15 | As said before, the archiver worker should read the storages, dbconn and archival policy from its config. | |
swh/storage/archiver/worker.py | ||
30–49 | Three arguments to be read from config. | |
76–100 | I don't think this comment is necessary. | |
77 | transfers | |
110–113 | We don't want an ongoing copy to end up as present here: that would mean that we might try to copy from a source that is not there. If the "virtual status" is not used elsewhere, you should probably just fold the timeout logic in this function. | |
133 | spurious debug print |
- The Director and Worker configuration are now parsed using swh.core.config.SWHConfig mixin class
- A bug where ongoing contents could be used as sources have been corrected
swh/storage/archiver/director.py | ||
---|---|---|
92–94 | Thats probably a better idea. I can't get the correct request for that however. What should I use in the WHERE section to count the number of present into the jsonb field ? |
swh/storage/archiver/director.py | ||
---|---|---|
30 | Please make that archiver/director (therefore storing the configuration in /etc/softwareheritage/archiver/director.{ini,yml}) | |
63 | We do not want to pass configuration to workers via Celery, ever. Considering that, I think the add_config instance attribute should be removed. The only actual use of that is overriding the file configuration when running the director on the command line, and we can just override specific configuration items in the director class's __init__ method. This makes the "worker arguments" just a plain list of objects that need archiving. No need for a wrapping dict, or anything else. | |
92–94 | Let's look at that after this huge diff lands. | |
118–119 | This argument is not needed any longer: the configuration file can be loaded from the default Software Heritage paths, that is in /etc or the user home directory. | |
118–119 | The default for the command-line director should be to read the config file. When you're setting defaults for the command-line args, you make the command line _always_ override the contents of the configuration file with the default configuration. What you should do is:
| |
118–119 | You should not need to do that any longer | |
119–120 | There, you can just pass the command-line config. | |
swh/storage/archiver/worker.py | ||
68 | Please make that archiver/worker. | |
70 | I don't think add_config is needed. |
Remove outdated manual test for the archiver
Edit: Hmm, wanted to make another diff for that one, with dependencies, but arc diff decided to update the previous one, as I was on new branch in top of the current diff's branch. Good to know.
See inline comments about CLI parsing.
swh/storage/archiver/director.py | ||
---|---|---|
58 | This function can be removed now. | |
63 | task.delay(batch) | |
69 | task(batch) | |
122–132 | I'm sorry, but I'd rather we removed all the command line arguments than adding this complexity. This function has you duplicate the configuration keys at three places (command line argument names, function argument names, string constants), and realistically it will never be used as we will deploy a configuration file with puppet anyway... Down the line, we should make sure that our "config module" can generate a command-line parser, rather than doing stuff ad-hoc everywhere. Can we please just scrap the command line parsing and finally land this diff? | |
swh/storage/tests/test_archiver.py | ||
121 | There's a typo here | |
127 | Is there still a point in doing this override? |