Page MenuHomeSoftware Heritage

Complete refactoring of the archiver
ClosedPublic

Authored by qcampos on Jul 29 2016, 2:59 PM.

Details

Reviewers
olasd
Group Reviewers
Reviewers
Maniphest Tasks
T512: Make archiver have a more symmetrical behavior treating storages as potential sources & destinations at the same time
Commits
rDSTOCc73941c4d5c9: Remove cli arguments for the archiver launching
rDSTOCf79a679d7c29: Correct config override in the archiver's test
rDSTOCdd49151734f6: Make the worker correclty catch corrupted contents
rDSTOCed9af69596f1: Remove outdated manual test
rDSTOC287da80f7c9d: archiver.worker: Remove additional config
rDSTOC212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTOC801ff72244e3: archiver: correct the worker when handling ongoing status
rDSTOCf76cd4d93861: Perform a complete rewriting of the archiver
R65:c73941c4d5c9: Remove cli arguments for the archiver launching
R65:f79a679d7c29: Correct config override in the archiver's test
R65:dd49151734f6: Make the worker correclty catch corrupted contents
R65:287da80f7c9d: archiver.worker: Remove additional config
R65:801ff72244e3: archiver: correct the worker when handling ongoing status
R65:f76cd4d93861: Perform a complete rewriting of the archiver
R65:ed9af69596f1: Remove outdated manual test
R65:212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTOf79a679d7c29: Correct config override in the archiver's test
rDSTOc73941c4d5c9: Remove cli arguments for the archiver launching
rDSTOed9af69596f1: Remove outdated manual test
rDSTO287da80f7c9d: archiver.worker: Remove additional config
rDSTO212d9030b443: archiver: Director & Worker arguments are now parsed from conf file
rDSTO801ff72244e3: archiver: correct the worker when handling ongoing status
rDSTOdd49151734f6: Make the worker correclty catch corrupted contents
rDSTOf76cd4d93861: Perform a complete rewriting of the archiver
Summary

Perform the complete refactoring of the archiver mentioned in T512.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos updated this revision to Diff 281.Jul 29 2016, 2:59 PM
qcampos retitled this revision from to Complete refactoring of the archiver.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)
qcampos added reviewers: olasd, zack.
qcampos updated this object.Jul 29 2016, 3:01 PM
qcampos edited edge metadata.

Note that the current version does not yet change the status of a content when it is found to be corrupted.

qcampos updated this revision to Diff 285.Jul 29 2016, 4:44 PM

Make the worker correclty catch corrupted or missing contents that were referenced as 'present' into the db

Those contents have their status updated in the db and are removed from archival list. Next archival will reschedule new sources to reach retention policy.

qcampos updated this revision to Diff 286.Jul 29 2016, 4:48 PM
qcampos edited edge metadata.

Correct an import

zack resigned from this revision.Jul 31 2016, 6:04 PM
zack removed a reviewer: zack.
olasd requested changes to this revision.Aug 1 2016, 11:55 AM
olasd edited edge metadata.

See inline comments.

swh/storage/archiver/director.py
23

The type name should probably be dict rather than json, as json will never be used for that purpose anyway

27–28

The indentation of that dict looks broken :)

82–87

The worker should really read all of this from its configuration rather than having it passed around as a message.

  • The access to the same objstorage can be different from one worker to another
  • The archival policy can have changed since we sent the messages in the queue
  • db access can also be different from one worker to another
  • the dbconn string can contain passwords that we don't really want to have flying around

All in all, all the director should enqueue is the list of objects that might need copies.

124–126

This is okay as a first step but a tiny refactoring of the database should allow us to do that server-side rather than client-side.

swh/storage/archiver/tasks.py
16

As said before, the archiver worker should read the storages, dbconn and archival policy from its config.

swh/storage/archiver/worker.py
28

Three arguments to be read from config.

46–72

I don't think this comment is necessary.

48

transfers

82–85

We don't want an ongoing copy to end up as present here: that would mean that we might try to copy from a source that is not there.

If the "virtual status" is not used elsewhere, you should probably just fold the timeout logic in this function.

141

spurious debug print

This revision now requires changes to proceed.Aug 1 2016, 11:55 AM
qcampos updated this revision to Diff 300.Aug 1 2016, 5:36 PM
qcampos edited edge metadata.
qcampos marked 9 inline comments as done.
  • The Director and Worker configuration are now parsed using swh.core.config.SWHConfig mixin class
  • A bug where ongoing contents could be used as sources have been corrected
swh/storage/archiver/director.py
124–126

Thats probably a better idea. I can't get the correct request for that however. What should I use in the WHERE section to count the number of present into the jsonb field ?

olasd requested changes to this revision.Aug 2 2016, 11:12 AM
olasd edited edge metadata.
olasd added inline comments.
swh/storage/archiver/director.py
47

Please make that archiver/director (therefore storing the configuration in /etc/softwareheritage/archiver/director.{ini,yml})

82

We do not want to pass configuration to workers via Celery, ever. Considering that, I think the add_config instance attribute should be removed. The only actual use of that is overriding the file configuration when running the director on the command line, and we can just override specific configuration items in the director class's __init__ method.

This makes the "worker arguments" just a plain list of objects that need archiving. No need for a wrapping dict, or anything else.

124–126

Let's look at that after this huge diff lands.

151

This argument is not needed any longer: the configuration file can be loaded from the default Software Heritage paths, that is in /etc or the user home directory.

152–155

The default for the command-line director should be to read the config file.

When you're setting defaults for the command-line args, you make the command line _always_ override the contents of the configuration file with the default configuration.

What you should do is:

  • drop the defaults
  • make the arguments non-mandatory
  • only override the contents of the configuration file if the command line argument is passed.
163

You should not need to do that any longer

166–167

There, you can just pass the command-line config.

swh/storage/archiver/worker.py
45

Please make that archiver/worker.

47

I don't think add_config is needed.

This revision now requires changes to proceed.Aug 2 2016, 11:12 AM
qcampos updated this revision to Diff 301.Aug 2 2016, 1:02 PM
qcampos edited edge metadata.
qcampos marked 7 inline comments as done.
  • archiver.worker: Remove additional config
qcampos marked an inline comment as done.Aug 2 2016, 1:03 PM
qcampos updated this revision to Diff 302.EditedAug 2 2016, 1:13 PM
qcampos edited edge metadata.

Remove outdated manual test for the archiver

Edit: Hmm, wanted to make another diff for that one, with dependencies, but arc diff decided to update the previous one, as I was on new branch in top of the current diff's branch. Good to know.

qcampos updated this revision to Diff 303.Aug 2 2016, 2:03 PM

Change the way Archiver's config is overriden in the tests

olasd requested changes to this revision.Aug 2 2016, 3:19 PM
olasd edited edge metadata.

See inline comments about CLI parsing.

swh/storage/archiver/director.py
77

This function can be removed now.

94

task.delay(batch)

100

task(batch)

170–180

I'm sorry, but I'd rather we removed all the command line arguments than adding this complexity.

This function has you duplicate the configuration keys at three places (command line argument names, function argument names, string constants), and realistically it will never be used as we will deploy a configuration file with puppet anyway...

Down the line, we should make sure that our "config module" can generate a command-line parser, rather than doing stuff ad-hoc everywhere.

Can we please just scrap the command line parsing and finally land this diff?

swh/storage/tests/test_archiver.py
125

There's a typo here

128

Is there still a point in doing this override?

This revision now requires changes to proceed.Aug 2 2016, 3:19 PM
qcampos updated this revision to Diff 304.Aug 2 2016, 3:45 PM
qcampos edited edge metadata.
qcampos marked 6 inline comments as done.

Remove cli arguments & correct config override in tests

olasd accepted this revision.Aug 3 2016, 10:48 AM
olasd edited edge metadata.

Thanks!

This revision is now accepted and ready to land.Aug 3 2016, 10:48 AM
qcampos updated this revision to Diff 305.Aug 3 2016, 11:07 AM
qcampos edited edge metadata.

Submit identical diff due to merge with master

This revision now requires review to proceed.Aug 3 2016, 11:07 AM
olasd accepted this revision.Aug 3 2016, 12:31 PM
olasd edited edge metadata.
This revision is now accepted and ready to land.Aug 3 2016, 12:31 PM
This revision was automatically updated to reflect the committed changes.