Page MenuHomeSoftware Heritage

Add a random content generator backend
ClosedPublic

Authored by douardda on Mar 27 2019, 11:29 AM.

Details

Summary

this backend is a naive objstorage backend that generates random file
content which sizes follow an approximation of the actual object size
distribution in the archive.

This is dedicated at stress testing other swh-objstorage backends.

Depends on D1302

Diff Detail

Repository
rDOBJS Object storage
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/objstorage/backends/generator.py
21

You may want to do self.data = fobj.read(min(1024, 2*size)), so eliminate consecutive small reads.

/dev/urandom could also be opened once in the constructor.

27

I don't think the filesize argument is useful. Instead, gen_random_file should check if its gen_random_file is not None, and use itertools.repeat(filesize)

53

TIL randint is right inclusive.

56

Should be renamed to gen_random_files

67

generates

80–84

Should be renamed to generator and _generator.

And why a property instead of initializing it in the contructor?

95

I don't like using yield on the rhs, it makes the code less readable. Why not rewrite list_content to use a simple for loop?

This revision now requires changes to proceed.Mar 27 2019, 2:02 PM

Could you also write some tests? Nothing fancy, just to have some test coverage on gen_filesize and gen_random_file

swh/objstorage/backends/generator.py
21

You may want to do self.data = fobj.read(min(1024, 2*size))

makes sense, indeed.

/dev/urandom could also be opened once in the constructor.

I'd rather not keep an open file descriptor for no good reason.

27

not sure it's a big improvement, but hey why not

56

or gen_random_content since it does not generate files, in the end.

80–84

And why a property instead of initializing it in the contructor?

So that RandomGeneratorObjStorage objects are pickle compliant so I can use them in multiprocessing based stuff.

80–84

Should be renamed to generator and _generator.

generator does not really improve the readability of the code IMHO, since it does not give a clue on what this stuff is generating. content_generator could be better for this (but I tend to dislike long identifiers).

95

it makes the code less readable

I disagree.

I find it makes actually more sense to write the list_content that way than what is done in other objstorage classes.

Also, this is a very dumb objstorage for testing purpose, I don't think it makes sense to fine tune it or make it a piece of enjoyable code, if I may.

refactor a bit according to vlorentz' comments and add a few tests

Forgot the comment about a min size

This revision is now accepted and ready to land.Mar 27 2019, 5:06 PM
This revision was landed with ongoing or failed builds.Mar 28 2019, 10:57 AM
This revision was automatically updated to reflect the committed changes.