Page MenuHomeSoftware Heritage

Content archiver
ClosedPublic

Authored by qcampos on May 9 2016, 4:56 PM.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos retitled this revision from to Content archiver.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)

The current diff is a Work in Progress waiting for some feedback.
ATM, there is no task queue involved and everything would run on the same thread. If everything is correct, it will be the next step.

swh/storage/archiver/copier.py
53–66 ↗(On Diff #64)

If that there is an error during the process, some files may have been copied. What should be the best option? Mark them as 'present', or just ignore them and reschedule the whole batch after the 'ongoing' maximum delay is elapsed?

swh/storage/archiver/director.py
158–212 ↗(On Diff #64)

Please, ignore that as it is only used for debug purposes. However, that brings up a question.

The table content_archive contains a field status that may be 'missing'. Does that means that everything in the content table will be in the content_data table as 'missing' at the start?
The 'missing' status is something we can compute on the fly (in content but not in content_archive) in order to avoid this storage.
I don't realize the size it takes, though (but I want to). It may be nothing compared to the whole database and not worth the extra-effort it would require to compute the missing on the fly. Is it?

swh/storage/tests/test_archiver.py
59–72 ↗(On Diff #64)

I had to do some db commits because the archiver directly uses the local storage's db. As the test class uses it's own cursor, the archiver would not have been aware of these changes otherwise.
But that don't seems very nice. Is there another way?

ardumont added inline comments.
swh/storage/archiver/__init__.py
1 ↗(On Diff #64)

why the # noqa?

swh/storage/archiver/copier.py
16 ↗(On Diff #64)

a list of sha1`s`
this copier has

50 ↗(On Diff #64)

I think it might work without needing to consume the map.

Can you refresh my memory, why do you need to strip the first 2 characters (is it '\\x')?
I think this might be due to how you test.

53–66 ↗(On Diff #64)

As a first approximation, considering nothing is done is reasonable.
(If nothing prevents later to add again some existing content)

56 ↗(On Diff #64)
self.server.content_add(map(lambda c: c['data'], contents))

?

58 ↗(On Diff #64)

You may want to avoid writing your doubts in the code and use differential for that (as you also did below).

I like to use, TODO (you'll need to think more and improve the code), FIXME (you saw something ugly but this is not the time to fix it) , HACK (you had to do something horribly ugly but did not see any other way around it) followed by some concise description about what this is all about.

64 ↗(On Diff #64)

Well, it depends on how you see the asynchroneous part work...

If archive-director spawns asynchroneous archive-workers which also spawns asynchroneous archive-copiers, it could be simpler to do what you propose (database access here)...
But then the abstraction around the content copy state is broken, meaning we have now multiple classes which are in charge of it (first the worker initiates something that the copier must update and close).
Or we could have a more complex design where the copier also states that woohoo i'm done (by queueing another message)... But then it's complicated...

But if the async activity stops at archive-workers meaning only archive-directory spawns async archive-workers.
Then what is now is ok (and simpler).

swh/storage/archiver/director.py
29 ↗(On Diff #64)

which one is needed as backup?
process`es`

39 ↗(On Diff #64)

c`o`py`

84 ↗(On Diff #64)

content`s`

94 ↗(On Diff #64)

content`s`

99 ↗(On Diff #64)

content`s`
w`h`ich
cont`a`ins

115 ↗(On Diff #64)

storage slaves

145 ↗(On Diff #64)

archiva`l`

158–212 ↗(On Diff #64)

Please, ignore that as it is only used for debug purposes.

Ok.
Either, you could add a comment NOT FOR PRODUCTION in the function's docstring or get this code in a bin/ folder at the repository's root (the plus side is that it will be easier to reach from cli).

Does that means that everything in the content table will be in the content_data table as 'missing' at the start?

Possibly.

I see 2 problematics:

  • one is to adapt the existing code to deal with new content as soon as this module is ready. That is, when content_add is triggered, it also adds newly added contents in archival table with missing status.
  • the other one being how to make the existing contents be archived (it's another task).

But that is for later.

I don't realize the size it takes, though (but I want to).

To have an approximate size (because it's dayly updated) check https://archive.softwareheritage.org/api/1/stat/counters/.
Right now, we are at 2349689600 contents ^^.

It may be nothing compared to the whole database and not worth the extra-effort it would require to compute the missing on the fly. Is it?

I think that bootstraping this and dealing with prior data are 2 separate tasks.

swh/storage/archiver/worker.py
13 ↗(On Diff #64)

Indeed but we have abstracted it.
Extend swh.scheduler.task.Task instead... (and you'd need to add python3-swh.scheduler as deps in debian/control too).

14 ↗(On Diff #64)

should probably extend``.
content`s`

20 ↗(On Diff #64)

ha`s`
dictio`n`ary

37 ↗(On Diff #64)

content's location

49 ↗(On Diff #64)

archiva`l`

52 ↗(On Diff #64)

amou`n`t
contain``

82 ↗(On Diff #64)

If it's an update, call it update ^^

87 ↗(On Diff #64)

change its status

90 ↗(On Diff #64)

content id

94 ↗(On Diff #64)

change`s`

111 ↗(On Diff #64)

We hide the sql queries in db usually, it's clearer to continue that way.

swh/storage/tests/test_archiver.py
59–72 ↗(On Diff #64)

Using self.conn?

ATM, there is no task queue involved and everything would run on the same thread.

Ok.

Were you able to run your code aside from the tests?

If everything is correct, it will be the next step.

Everything seems correct ^^ (aside for the itsy bitsy tini mini remarks :D)

swh/storage/archiver/__init__.py
1 ↗(On Diff #64)

Ok, discussing with Quentin, it's due to the pre-commit-hook we have installed which refuses to let the user commits if there are some pep8 violations. And, as the module is imported but in appearance not used, for flake8, it's a fail.

We need to find a way to improve on this (or not).

I know it's possible to have some setup file for flake8 to ignore some errors (this one is F401).
But ignoring this error is too much - http://flake8.readthedocs.io/en/latest/warnings.html#error-codes.
We only want to ignore it in that particular case.

Also, i found this which relates to this http://stackoverflow.com/questions/31079047/python-pep8-class-in-init-imported-but-not-used. Following the proposed solution, this'd give something like:

from .director import ArchiverDirector
from .worker import ArchiverWorker
from .copier import ArchiverCopier

__all__ = ['ArchiverDirector', 'ArchiverWorker', 'ArchiverCopier']

I don't know if this is reasonable or not but it works.
As pros, this has the merit to force us to be explicit about what's public...
As cons, it's somehow redundant and instead of 18 characters, we have 66...

@olasd @zack What do you think?

swh/storage/archiver/copier.py
53–66 ↗(On Diff #64)

(If nothing prevents later to add again some existing content)

What i meant was, as long as you can write and overwrite existing contents again, i think it's ok.

Were you able to run your code aside from the tests?

I've done some tests with a local backup and very small samples but it was on an early version.

I have no idea how the task queue works. Aside the inheritance from swh.scheduler.task.Task, what should I do to the worker to make it asynchronous?

swh/storage/archiver/copier.py
53–66 ↗(On Diff #64)

Shouldn't create any problem (ATM) as the object storage just write files without checking if they already exists.

56 ↗(On Diff #64)

Just checked : RemoteObjStorage::content_add is linked to ObjStorage::add_bytes that only take a single content.

64 ↗(On Diff #64)

Now that I think about it , the specification is kinda blurry about what precisely is asynchronous. But I understood it as Director spawning async workers that execute copier's code.

In D23#416, @qcampos wrote:

Were you able to run your code aside from the tests?

I've done some tests with a local backup and very small samples but it was on an early version.

Ok. Can you please check with the current version?

I have no idea how the task queue works. Aside the inheritance from swh.scheduler.task.Task, what should I do to the worker to make it asynchronous?

We usually create a task.py module (here that would be swh.storage.archiver.task) in which we declare an inherited Task class .
(https://forge.softwareheritage.org/diffusion/DSCH/browse/master/swh/scheduler/task.py)

This class will contain the async code to execute (here, following the actual code, we'd delegate the job to ArchiveCopier inside the new task in the run function passing every need state information as function parameter for example).

This task declares a queue name to use for receiving message and a method run (for example swh_storage_archive_copier)
The queue is what is used to communicate the job to do (the producer sends a message to a queue, the worker picks up the message and starts working).
The function code in the run is the actual code being executed by a celery worker.
Some tasks example: https://forge.softwareheritage.org/diffusion/61/browse/master/swh/fetcher/googlecode/tasks.py.

Then you'd need to adapt the current ArchiveDirector to produce a message (that says, here, do some work with that batch of contents).
So instead of instantiating ArchiveWorker directory (swh/storage/archiver/director.py line 88), you'd need to retrieve from the app the task you want and delay it (this is the message sending part).
Here is some example: https://forge.softwareheritage.org/diffusion/61/browse/master/swh/fetcher/googlecode/checker_producer.py;9a428e0f2b908db390286a96fe4e559b592edb9b$29

Then you'd also need to adapt how you trigger everything.

qcampos marked an inline comment as done.
qcampos edited edge metadata.

Add asynchronous code for the archiver

  • Change database model and update swh.storage.db api in order to follow those changes
  • Add synchronous first implementation of the archiver
  • Add tests for the archiver
  • Add asynchronous mode
  • Fix storage instantiation problem
  • Align ArchiverWorker parameter orders
  • Extract imports from the method to top-level
  • Fix docstring
swh/storage/objstorage/api/server.py
68–69 ↗(On Diff #67)

Thats not clean, and will be in its own commit (its own diff maybe?).

qcampos edited edge metadata.
  • Add asynchronous mode
  • Fix storage instantiation problem
  • Align ArchiverWorker parameter orders
  • Extract imports from the method to top-level
  • Fix docstring
  • Change server mains in order to change default port
  • Make the worker asynchronous optional
qcampos edited edge metadata.
  • Change database model and update swh.storage.db api in order to follow those changes
  • Add synchronous first implementation of the archiver
  • Add asynchronous mode
  • Make the worker asynchronous optional
qcampos edited edge metadata.
  • Fix manual test launching
qcampos edited edge metadata.
  • Add synchronous first implementation of the archiver
  • Add asynchronous mode
  • Make the worker asynchronous optional
swh/storage/objstorage/api/server.py
68–69 ↗(On Diff #67)

Done in D25.

This revision is now accepted and ready to land.May 12 2016, 1:30 PM
This revision was automatically updated to reflect the committed changes.
zack changed the visibility from "All Users" to "Public (No Login Required)".