Page MenuHomeSoftware Heritage

Randomize last_update in generated ListedOrigins in fill_test_data
AbandonedPublic

Authored by douardda on Jan 22 2021, 10:50 AM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Summary

also insert objects by batches of 10k to make it nicer with ram usage.

Depends on D4919

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D4920 (id=17505)

Could not rebase; Attempt merge onto 03460207a1...

Updating 0346020..1c069ca
Fast-forward
 swh/scheduler/cli/simulator.py      | 13 +++++++---
 swh/scheduler/simulator/__init__.py | 51 ++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 24 deletions(-)
Changes applied before test
commit 1c069ca34add6b26d060588abb7958a089cb0735
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jan 21 11:33:19 2021 +0100

    Randomize last_upadte in generated ListedOrigins in fill_test_data
    
    also insert objects by batches of 10k to make it nicer with ram usage.

commit 8a9aaf3942d5585e6af038ebced3cde6faf27c7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jan 21 11:30:21 2021 +0100

    Add a --num-origins option to the fill-test-data cli command

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/245/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/245/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 22 2021, 10:55 AM
Harbormaster failed remote builds in B18637: Diff 17505!

I'd like to keep the simulator deterministic. What about adding a CLI option with a seed?

typo and kill unneeded Diff dependency

Build is green

Patch application report for D4920 (id=17505)

Could not rebase; Attempt merge onto b93aa5be2c...

Merge made by the 'recursive' strategy.
 swh/scheduler/cli/simulator.py      | 13 +++++++---
 swh/scheduler/simulator/__init__.py | 51 ++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 24 deletions(-)
Changes applied before test
commit dbca7a7b8a4b041a7092447ee4d91851dc22f711
Merge: b93aa5b 1c069ca
Author: Jenkins user <jenkins@localhost>
Date:   Fri Jan 22 10:36:24 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 1c069ca34add6b26d060588abb7958a089cb0735
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jan 21 11:33:19 2021 +0100

    Randomize last_upadte in generated ListedOrigins in fill_test_data
    
    also insert objects by batches of 10k to make it nicer with ram usage.

commit 8a9aaf3942d5585e6af038ebced3cde6faf27c7e
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jan 21 11:30:21 2021 +0100

    Add a --num-origins option to the fill-test-data cli command

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/255/ for more details.

Build is green

Patch application report for D4920 (id=17515)

Rebasing onto b93aa5be2c...

First, rewinding head to replay your work on top of it...
Applying: Randomize last_update in generated ListedOrigins in fill_test_data
Changes applied before test
commit b48917915cff235c880061afb29a6257f50b4baf
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu Jan 21 11:33:19 2021 +0100

    Randomize last_update in generated ListedOrigins in fill_test_data
    
    also insert objects by batches of 10k to make it nicer with ram usage.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/256/ for more details.

This revision now requires changes to proceed.Jan 22 2021, 11:41 AM

I'd like to keep the simulator deterministic. What about adding a CLI option with a seed?

why not (cli option), but why (keep it deterministic)?

I'd like to keep the simulator deterministic. What about adding a CLI option with a seed?

why not (cli option), but why (keep it deterministic)?

Also, a given seed will not be enough here: there is also the maxts = int(utcnow().timestamp()) that will kill the deterministic property...

I'd like to keep the simulator deterministic. What about adding a CLI option with a seed?

why not (cli option), but why (keep it deterministic)?

Also, a given seed will not be enough here: there is also the maxts = int(utcnow().timestamp()) that will kill the deterministic property...

So to get a deterministic behavior, the option should allow to hard set this last_update time.

why not (cli option), but why (keep it deterministic)?

  1. reproducibility, so we can run the simulator twice with different code, and be sure that differences in behavior are not caused by randomness
  2. if a particular run gives an unexpected results, you can run it again with more logging / instrumentation to see what went wrong

So to get a deterministic behavior, the option should allow to hard set this last_update time.

or generate it from the seed as well

why not (cli option), but why (keep it deterministic)?

  1. reproducibility, so we can run the simulator twice with different code, and be sure that differences in behavior are not caused by randomness

This is not what I call reproducibility... Especially for simulation involving randomized stuff... Using hashing of some values as PRNG looks wrong to me (and not because of good or bad probabilistic properties of such generators) but because it makes the intent and method of randomization unclear and obfuscated, thus it makes the code hard to understand and maintain.

I'd much prefer we do proper PRNG, with proper seeds management if we really need this level of reproducibility.

  1. if a particular run gives an unexpected results, you can run it again with more logging / instrumentation to see what went wrong

To me, the idea of such a simulation stack is not to identify (and fix) singular buggy behaviors.

So to get a deterministic behavior, the option should allow to hard set this last_update time.

or generate it from the seed as well

or make it possible to be explicitly given, if need be.

I'd much prefer we do proper PRNG, with proper seeds management if we really need this level of reproducibility.

sure

To me, the idea of such a simulation stack is not to identify (and fix) singular buggy behaviors.

Of course. But if you do encounter one, it's helpful to have a way to reproduce it.

ardumont retitled this revision from Randomize last_upadte in generated ListedOrigins in fill_test_data to Randomize last_update in generated ListedOrigins in fill_test_data.Jan 25 2021, 6:26 PM