Details

Reviewers

Group Reviewers

Maniphest Tasks

Commits

rDSTOC7afda2189e74: Make the worker asynchronous optional
rDSTOCe32124dd75bf: Change database model and update swh.storage.db api in order to follow those…
rDSTOC0e84344d3a57: Add asynchronous mode
rDSTOCf0388f124e22: Add synchronous first implementation of the archiver
R65:f0388f124e22: Add synchronous first implementation of the archiver
R65:e32124dd75bf: Change database model and update swh.storage.db api in order to follow those…
R65:7afda2189e74: Make the worker asynchronous optional
R65:0e84344d3a57: Add asynchronous mode
rDSTO7afda2189e74: Make the worker asynchronous optional
rDSTO0e84344d3a57: Add asynchronous mode
rDSTOf0388f124e22: Add synchronous first implementation of the archiver
rDSTOe32124dd75bf: Change database model and update swh.storage.db api in order to follow those…

Summary

Add an archiver that purpose is to replicate the object storage into another (or many other) location(s)

Test Plan

Archive objects only when it's needed
- Less copies than expected
- The content is missing and not present
An archive with status 'ongoing' will be rescheduled if its delay have been elapsed.

Diff Detail

Repository

rDSTO Storage manager

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos updated this revision to Diff 64.May 9 2016, 4:56 PM

qcampos added a task: T240: content archiver.

qcampos retitled this revision from to Content archiver.

qcampos updated this object.

qcampos edited the test plan for this revision. (Show Details)

Herald added a reviewer: Reviewers. · View Herald TranscriptMay 9 2016, 4:56 PM

The current diff is a Work in Progress waiting for some feedback.
ATM, there is no task queue involved and everything would run on the same thread. If everything is correct, it will be the next step.

swh/storage/archiver/copier.py
53–66 ↗	(On Diff #64)	If that there is an error during the process, some files may have been copied. What should be the best option? Mark them as 'present', or just ignore them and reschedule the whole batch after the 'ongoing' maximum delay is elapsed?
swh/storage/archiver/director.py
158–212 ↗	(On Diff #64)	Please, ignore that as it is only used for debug purposes. However, that brings up a question. The table content_archive contains a field status that may be 'missing'. Does that means that everything in the content table will be in the content_data table as 'missing' at the start? The 'missing' status is something we can compute on the fly (in content but not in content_archive) in order to avoid this storage. I don't realize the size it takes, though (but I want to). It may be nothing compared to the whole database and not worth the extra-effort it would require to compute the missing on the fly. Is it?
swh/storage/tests/test_archiver.py
59–72 ↗	(On Diff #64)	I had to do some db commits because the archiver directly uses the local storage's db. As the test class uses it's own cursor, the archiver would not have been aware of these changes otherwise. But that don't seems very nice. Is there another way?

ardumont added a subscriber: ardumont.May 10 2016, 4:03 PM

ardumont added inline comments.

swh/storage/archiver/__init__.py
1 ↗	(On Diff #64)	why the # noqa?
swh/storage/archiver/copier.py
16 ↗	(On Diff #64)	a list of sha1`s` this copier `has`
50 ↗	(On Diff #64)	I think it might work without needing to consume the map. Can you refresh my memory, why do you need to strip the first 2 characters (is it '\\x')? I think this might be due to how you test.
53–66 ↗	(On Diff #64)	As a first approximation, considering nothing is done is reasonable. (If nothing prevents later to add again some existing content)
56 ↗	(On Diff #64)	self.server.content_add(map(lambda c: c['data'], contents)) ?
58 ↗	(On Diff #64)	You may want to avoid writing your doubts in the code and use differential for that (as you also did below). I like to use, TODO (you'll need to think more and improve the code), FIXME (you saw something ugly but this is not the time to fix it) , HACK (you had to do something horribly ugly but did not see any other way around it) followed by some concise description about what this is all about.
64 ↗	(On Diff #64)	Well, it depends on how you see the asynchroneous part work... If archive-director spawns asynchroneous archive-workers which also spawns asynchroneous archive-copiers, it could be simpler to do what you propose (database access here)... But then the abstraction around the content copy state is broken, meaning we have now multiple classes which are in charge of it (first the worker initiates something that the copier must update and close). Or we could have a more complex design where the copier also states that `woohoo i'm done` (by queueing another message)... But then it's complicated... But if the async activity stops at archive-workers meaning only archive-directory spawns async archive-workers. Then what is now is ok (and simpler).
swh/storage/archiver/director.py
29 ↗	(On Diff #64)	which one is `needed as` backup? process`es`
39 ↗	(On Diff #64)	c`o`py`
84 ↗	(On Diff #64)	content`s`
94 ↗	(On Diff #64)	content`s`
99 ↗	(On Diff #64)	content`s` w`h`ich cont`a`ins
115 ↗	(On Diff #64)	storage slaves
145 ↗	(On Diff #64)	archiva`l`
158–212 ↗	(On Diff #64)	Please, ignore that as it is only used for debug purposes. Ok. Either, you could add a comment `NOT FOR PRODUCTION` in the function's docstring or get this code in a bin/ folder at the repository's root (the plus side is that it will be easier to reach from cli). Does that means that everything in the content table will be in the content_data table as 'missing' at the start? Possibly. I see 2 problematics: one is to adapt the existing code to deal with new content as soon as this module is ready. That is, when content_add is triggered, it also adds newly added contents in archival table with `missing` status. the other one being how to make the existing contents be archived (it's another task). But that is for later. I don't realize the size it takes, though (but I want to). To have an approximate size (because it's dayly updated) check https://archive.softwareheritage.org/api/1/stat/counters/. Right now, we are at 2349689600 contents ^^. It may be nothing compared to the whole database and not worth the extra-effort it would require to compute the missing on the fly. Is it? I think that bootstraping this and dealing with prior data are 2 separate tasks.
swh/storage/archiver/worker.py
13 ↗	(On Diff #64)	Indeed but we have abstracted it. Extend swh.scheduler.task.Task instead... (and you'd need to add python3-swh.scheduler as deps in debian/control too).
14 ↗	(On Diff #64)	should probably extend``. content`s`
20 ↗	(On Diff #64)	ha`s` dictio`n`ary
37 ↗	(On Diff #64)	`content's location`
49 ↗	(On Diff #64)	archiva`l`
52 ↗	(On Diff #64)	amou`n`t contain``
82 ↗	(On Diff #64)	If it's an update, call it update ^^
87 ↗	(On Diff #64)	change `its` status
90 ↗	(On Diff #64)	content id
94 ↗	(On Diff #64)	change`s`
111 ↗	(On Diff #64)	We hide the sql queries in db usually, it's clearer to continue that way.
swh/storage/tests/test_archiver.py
59–72 ↗	(On Diff #64)	Using self.conn?

ATM, there is no task queue involved and everything would run on the same thread.

Ok.

Were you able to run your code aside from the tests?

If everything is correct, it will be the next step.

Everything seems correct ^^ (aside for the itsy bitsy tini mini remarks :D)

swh/storage/archiver/__init__.py
1 ↗	(On Diff #64)	Ok, discussing with Quentin, it's due to the pre-commit-hook we have installed which refuses to let the user commits if there are some pep8 violations. And, as the module is imported but in appearance not used, for flake8, it's a fail. We need to find a way to improve on this (or not). I know it's possible to have some setup file for flake8 to ignore some errors (this one is F401). But ignoring this error is too much - http://flake8.readthedocs.io/en/latest/warnings.html#error-codes. We only want to ignore it in that particular case. Also, i found this which relates to this http://stackoverflow.com/questions/31079047/python-pep8-class-in-init-imported-but-not-used. Following the proposed solution, this'd give something like: from .director import ArchiverDirector from .worker import ArchiverWorker from .copier import ArchiverCopier __all__ = ['ArchiverDirector', 'ArchiverWorker', 'ArchiverCopier'] I don't know if this is reasonable or not but it works. As pros, this has the merit to force us to be explicit about what's public... As cons, it's somehow redundant and instead of 18 characters, we have 66... @olasd @zack What do you think?
swh/storage/archiver/copier.py
53–66 ↗	(On Diff #64)	(If nothing prevents later to add again some existing content) What i meant was, as long as you can write and overwrite existing contents again, i think it's ok.

Were you able to run your code aside from the tests?

I've done some tests with a local backup and very small samples but it was on an early version.

I have no idea how the task queue works. Aside the inheritance from swh.scheduler.task.Task, what should I do to the worker to make it asynchronous?

swh/storage/archiver/copier.py
53–66 ↗	(On Diff #64)	Shouldn't create any problem (ATM) as the object storage just write files without checking if they already exists.
56 ↗	(On Diff #64)	Just checked : `RemoteObjStorage::content_add` is linked to `ObjStorage::add_bytes` that only take a single content.
64 ↗	(On Diff #64)	Now that I think about it , the specification is kinda blurry about what precisely is asynchronous. But I understood it as Director spawning async workers that execute copier's code.

In D23#416, @qcampos wrote:

Were you able to run your code aside from the tests?

I've done some tests with a local backup and very small samples but it was on an early version.

Ok. Can you please check with the current version?

I have no idea how the task queue works. Aside the inheritance from swh.scheduler.task.Task, what should I do to the worker to make it asynchronous?

We usually create a task.py module (here that would be swh.storage.archiver.task) in which we declare an inherited Task class .
(https://forge.softwareheritage.org/diffusion/DSCH/browse/master/swh/scheduler/task.py)

This class will contain the async code to execute (here, following the actual code, we'd delegate the job to ArchiveCopier inside the new task in the run function passing every need state information as function parameter for example).

This task declares a queue name to use for receiving message and a method run (for example swh_storage_archive_copier)
The queue is what is used to communicate the job to do (the producer sends a message to a queue, the worker picks up the message and starts working).
The function code in the run is the actual code being executed by a celery worker.
Some tasks example: https://forge.softwareheritage.org/diffusion/61/browse/master/swh/fetcher/googlecode/tasks.py.

Then you'd need to adapt the current ArchiveDirector to produce a message (that says, here, do some work with that batch of contents).
So instead of instantiating ArchiveWorker directory (swh/storage/archiver/director.py line 88), you'd need to retrieve from the app the task you want and delay it (this is the message sending part).
Here is some example: https://forge.softwareheritage.org/diffusion/61/browse/master/swh/fetcher/googlecode/checker_producer.py;9a428e0f2b908db390286a96fe4e559b592edb9b$29

Then you'd also need to adapt how you trigger everything.

Add asynchronous code for the archiver

Change database model and update swh.storage.db api in order to follow those changes
Add synchronous first implementation of the archiver
Add tests for the archiver
Add asynchronous mode
Fix storage instantiation problem
Align ArchiverWorker parameter orders
Extract imports from the method to top-level
Fix docstring

Herald edited edge metadata. · View Herald TranscriptMay 11 2016, 5:22 PM

qcampos marked 7 inline comments as done.May 11 2016, 5:23 PM

qcampos added inline comments.May 11 2016, 5:36 PM

swh/storage/objstorage/api/server.py
68–69 ↗	(On Diff #67)	Thats not clean, and will be in its own commit (its own diff maybe?).

qcampos updated this object.May 11 2016, 5:37 PM

qcampos edited edge metadata.

Add asynchronous mode
Fix storage instantiation problem
Align ArchiverWorker parameter orders
Extract imports from the method to top-level
Fix docstring
Change server mains in order to change default port
Make the worker asynchronous optional

Herald edited edge metadata. · View Herald TranscriptMay 12 2016, 12:45 PM

Change database model and update swh.storage.db api in order to follow those changes
Add synchronous first implementation of the archiver
Add asynchronous mode
Make the worker asynchronous optional

Herald edited edge metadata. · View Herald TranscriptMay 12 2016, 1:09 PM