Page MenuHomeSoftware Heritage

scanner-benchmark: use os.listdir() instead of os.walk() to avoid symlinks
ClosedPublic

Authored by DanSeraf on Feb 4 2021, 2:29 PM.

Details

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D5011 (id=17882)

Could not rebase; Attempt merge onto 33a9cd4eb9...

Auto-merging swh/scanner/cli.py
Merge made by the 'recursive' strategy.
 benchmark.py                   | 136 ++++++++++++++
 run_backend.sh                 |  15 ++
 run_benchmark.sh               |  37 ++++
 swh/scanner/backend.py         |  16 +-
 swh/scanner/benchmark_algos.py | 395 +++++++++++++++++++++++++++++++++++++++++
 swh/scanner/cli.py             |  73 ++++++++
 swh/scanner/model.py           |  57 +++++-
 7 files changed, 718 insertions(+), 11 deletions(-)
 create mode 100755 benchmark.py
 create mode 100755 run_backend.sh
 create mode 100755 run_benchmark.sh
 create mode 100644 swh/scanner/benchmark_algos.py
Changes applied before test
commit 4d3001147e4469ca62353bcd681d9a696d596517
Merge: 33a9cd4 ba54311
Author: Jenkins user <jenkins@localhost>
Date:   Thu Feb 4 13:29:34 2021 +0000

    Merge branch 'diff-target' into HEAD

commit ba54311a7c2a7eb16491044a04507f9701b3c57b
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 14:28:31 2021 +0100

    run random algorithm only once

commit aaf3266f05c569bd0f7f30013d455c37df2aaf27
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 14:17:59 2021 +0100

    use os.listdir() instead of os.walk() to avoid symlinks

commit 3d3665a4f5bb77c981a27ee9206a2c92717e82b0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 15:30:54 2021 +0100

    algo_min: delete the upstream directories if a (sub)directory is unknown

commit c42e643aa512cbd8c039be2350159e46d34daa0d
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 13:24:12 2021 +0100

    model: wrong iteration in 'iterate_bfs' function

commit 0d3b5cb86144b87accab7f9a45d6457f457d47d0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 11:13:13 2021 +0100

    make 'set_children_status' works with different kind of nodes

commit b601f382db643ddb0af40c85d1d8fc5065bd7224
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jan 28 16:45:45 2021 +0100

    file_priority: remove children only when the unset directory is known
    
    If the directory is unknown the algorithm should check the downstream
    directories since they could be unknown too.

commit 5e01c09af4c61a309d71adb0d4f61d1766b8a021
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 26 10:10:00 2021 +0100

    retry request in case of backend failure

commit ebad16c02da6bffbc96a623e082a4b5f706d7b1f
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 25 13:48:14 2021 +0100

    algo_min: remove the current node as well

commit 5cd9f762467ece41d7d8e1ae1841e1d24aad45e4
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 18 10:26:06 2021 +0100

    fix: the temporary directory is removed by tempfile

commit 7a289332f73025f94f7f85ab5bd6755b876ebe68
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:12:18 2021 +0100

    print results as a csv

commit 9e4df16d9486a891498124dd4cfb7558c57dfa0c
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:10:39 2021 +0100

    extract repositories in temporary directories

commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/95/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/95/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 4 2021, 2:30 PM
Harbormaster failed remote builds in B19001: Diff 17882!
zack requested changes to this revision.Feb 4 2021, 2:56 PM
zack added a subscriber: zack.
zack added inline comments.
swh/scanner/benchmark_algos.py
306–309

if you want to avoid symlinks, these doesn't work, because doc (for both) says:

"This follows symbolic links, so both islink() and isfile() can be true for the same path."

you want to avoid a test before either of these like: "if os.path.islink(...): ... continue ..."

This revision now requires changes to proceed.Feb 4 2021, 2:56 PM
swh/scanner/benchmark_algos.py
306–309

actually, you probably do not want to ignore symlinks completely (I think?, it depends on how your tree is then used)

if you want to keep them, probably you should just avoid listing root_path if *it* is a symlink, so using islink() on it before invoking listdir on it()

Build has FAILED

Patch application report for D5011 (id=17892)

Could not rebase; Attempt merge onto 33a9cd4eb9...

Auto-merging swh/scanner/cli.py
Merge made by the 'recursive' strategy.
 benchmark.py                   | 136 ++++++++++++++
 run_backend.sh                 |  15 ++
 run_benchmark.sh               |  37 ++++
 swh/scanner/backend.py         |  16 +-
 swh/scanner/benchmark_algos.py | 396 +++++++++++++++++++++++++++++++++++++++++
 swh/scanner/cli.py             |  73 ++++++++
 swh/scanner/model.py           |  57 +++++-
 7 files changed, 719 insertions(+), 11 deletions(-)
 create mode 100755 benchmark.py
 create mode 100755 run_backend.sh
 create mode 100755 run_benchmark.sh
 create mode 100644 swh/scanner/benchmark_algos.py
Changes applied before test
commit 34d1383d95e3a26cd5d2e26aad84dbe624698a80
Merge: 33a9cd4 0806485
Author: Jenkins user <jenkins@localhost>
Date:   Thu Feb 4 15:31:35 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 080648583efcdf14c31af2f42ccc1c86f2745b63
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:28:21 2021 +0100

    run random algorithm only once

commit 3004b66787b28cffa1047427876750397f02e06a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:27:59 2021 +0100

    use os.listdir() instead of os.walk() to avoid symlinks

commit 3d3665a4f5bb77c981a27ee9206a2c92717e82b0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 15:30:54 2021 +0100

    algo_min: delete the upstream directories if a (sub)directory is unknown

commit c42e643aa512cbd8c039be2350159e46d34daa0d
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 13:24:12 2021 +0100

    model: wrong iteration in 'iterate_bfs' function

commit 0d3b5cb86144b87accab7f9a45d6457f457d47d0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 11:13:13 2021 +0100

    make 'set_children_status' works with different kind of nodes

commit b601f382db643ddb0af40c85d1d8fc5065bd7224
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jan 28 16:45:45 2021 +0100

    file_priority: remove children only when the unset directory is known
    
    If the directory is unknown the algorithm should check the downstream
    directories since they could be unknown too.

commit 5e01c09af4c61a309d71adb0d4f61d1766b8a021
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 26 10:10:00 2021 +0100

    retry request in case of backend failure

commit ebad16c02da6bffbc96a623e082a4b5f706d7b1f
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 25 13:48:14 2021 +0100

    algo_min: remove the current node as well

commit 5cd9f762467ece41d7d8e1ae1841e1d24aad45e4
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 18 10:26:06 2021 +0100

    fix: the temporary directory is removed by tempfile

commit 7a289332f73025f94f7f85ab5bd6755b876ebe68
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:12:18 2021 +0100

    print results as a csv

commit 9e4df16d9486a891498124dd4cfb7558c57dfa0c
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:10:39 2021 +0100

    extract repositories in temporary directories

commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/96/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/96/console

Build has FAILED

Patch application report for D5011 (id=17893)

Rebasing onto 33a9cd4eb9...

First, rewinding head to replay your work on top of it...
Applying: scanner experiments
Applying: extract repositories in temporary directories
Applying: print results as a csv
Applying: fix: the temporary directory is removed by tempfile
Applying: algo_min: remove the current node as well
Applying: retry request in case of backend failure
Applying: file_priority: remove children only when the unset directory is known
Applying: make 'set_children_status' works with different kind of nodes
Applying: model: wrong iteration in 'iterate_bfs' function
Applying: algo_min: delete the upstream directories if a (sub)directory is unknown
Applying: check if path is a symlink
Applying: run random algorithm only once
Changes applied before test
commit 6c534b8af6b62468cf8467aa2791f63f1a471958
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:47:24 2021 +0100

    run random algorithm only once

commit 3446bb600e3aeca5ddc22b5b9a17eda224996450
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:27:59 2021 +0100

    check if path is a symlink
    
    exclude the path if it is a symlink.
    
    - os.listdir() instead of os.walk() to list subdirectories

commit b46c265a776490a6797454e64e5cbc607fba1e94
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 15:30:54 2021 +0100

    algo_min: delete the upstream directories if a (sub)directory is unknown

commit 4cec0aa255ba71479acb7cd58048f697c3ad0aa5
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 13:24:12 2021 +0100

    model: wrong iteration in 'iterate_bfs' function

commit 15cb48637cf708bf15fcab7a6958b2b97bdafe7b
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 11:13:13 2021 +0100

    make 'set_children_status' works with different kind of nodes

commit 3ebcebddc15ac53203c53ac771a501339ff681a8
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jan 28 16:45:45 2021 +0100

    file_priority: remove children only when the unset directory is known
    
    If the directory is unknown the algorithm should check the downstream
    directories since they could be unknown too.

commit d64b0d8d402872de7351b0674bde391efcff8fcf
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 26 10:10:00 2021 +0100

    retry request in case of backend failure

commit ba29deefccf09642d1c006b1e0887f369d87d321
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 25 13:48:14 2021 +0100

    algo_min: remove the current node as well

commit fa7460a9f9a1a291ea43f7af60486c4a362d04d2
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 18 10:26:06 2021 +0100

    fix: the temporary directory is removed by tempfile

commit f7464b81a5169755a5dbcca853a694ccb29ec9e7
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:12:18 2021 +0100

    print results as a csv

commit f0f34283cc77dd0795484f5904918a7bba67e329
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:10:39 2021 +0100

    extract repositories in temporary directories

commit 2d4bf40939653e71d0715a4d3fdba6ce5765991c
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/97/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/97/console

Build has FAILED

Patch application report for D5011 (id=17894)

Rebasing onto 33a9cd4eb9...

First, rewinding head to replay your work on top of it...
Applying: scanner experiments
Applying: extract repositories in temporary directories
Applying: print results as a csv
Applying: fix: the temporary directory is removed by tempfile
Applying: algo_min: remove the current node as well
Applying: retry request in case of backend failure
Applying: file_priority: remove children only when the unset directory is known
Applying: make 'set_children_status' works with different kind of nodes
Applying: model: wrong iteration in 'iterate_bfs' function
Applying: algo_min: delete the upstream directories if a (sub)directory is unknown
Applying: check if path is a symlink
Applying: run random algorithm only once
Changes applied before test
commit 2eca880da64bf5537ac4603a09cd2804c3151d40
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:47:24 2021 +0100

    run random algorithm only once

commit 3a6203415be0be7825edd74cd505bb6d14ffb635
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:27:59 2021 +0100

    check if path is a symlink
    
    exclude the path if it is a symlink.
    
    - os.listdir() instead of os.walk() to list subdirectories

commit e3e1a96f5913905a42762c672720d1480184f858
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 15:30:54 2021 +0100

    algo_min: delete the upstream directories if a (sub)directory is unknown

commit 5f27ca465bc33d8babc70f8bfb258165934153e0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 13:24:12 2021 +0100

    model: wrong iteration in 'iterate_bfs' function

commit 590fc3252c7aabbdf30f5fce001d45d487a880d7
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 11:13:13 2021 +0100

    make 'set_children_status' works with different kind of nodes

commit d829830b407e06b3bc2624a8552adcddb90278ce
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jan 28 16:45:45 2021 +0100

    file_priority: remove children only when the unset directory is known
    
    If the directory is unknown the algorithm should check the downstream
    directories since they could be unknown too.

commit 4bceda44454777762d5bf677818478a72ad2f624
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 26 10:10:00 2021 +0100

    retry request in case of backend failure

commit 00a2d73a2193406d6fba0a46c91e3098d800d986
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 25 13:48:14 2021 +0100

    algo_min: remove the current node as well

commit 243faa41794f2c5f4182d627bbf3a9dc2e14b75a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 18 10:26:06 2021 +0100

    fix: the temporary directory is removed by tempfile

commit 942d63226f3e589ce0315ec89317118198048a8a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:12:18 2021 +0100

    print results as a csv

commit 88a9d3232e3a04f8e3d96e95ae05de7dc406c87a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:10:39 2021 +0100

    extract repositories in temporary directories

commit 7a55f8962e424771aaf5410d7c11103f8fcdbb7c
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/98/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/98/console

Build has FAILED

Patch application report for D5011 (id=17895)

Could not rebase; Attempt merge onto 33a9cd4eb9...

Auto-merging swh/scanner/cli.py
Merge made by the 'recursive' strategy.
 benchmark.py                   | 136 ++++++++++++++
 run_backend.sh                 |  15 ++
 run_benchmark.sh               |  37 ++++
 swh/scanner/backend.py         |  16 +-
 swh/scanner/benchmark_algos.py | 396 +++++++++++++++++++++++++++++++++++++++++
 swh/scanner/cli.py             |  73 ++++++++
 swh/scanner/model.py           |  57 +++++-
 7 files changed, 719 insertions(+), 11 deletions(-)
 create mode 100755 benchmark.py
 create mode 100755 run_backend.sh
 create mode 100755 run_benchmark.sh
 create mode 100644 swh/scanner/benchmark_algos.py
Changes applied before test
commit 28ceb8e275f88e4fee71fbc725f9afb4360b5d0e
Merge: 33a9cd4 e46e713
Author: Jenkins user <jenkins@localhost>
Date:   Thu Feb 4 16:39:04 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e46e713d2145f69be19e16f5d22a565648e7c0ff
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:28:21 2021 +0100

    run random algorithm only once

commit 3004b66787b28cffa1047427876750397f02e06a
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Feb 4 16:27:59 2021 +0100

    use os.listdir() instead of os.walk() to avoid symlinks

commit 3d3665a4f5bb77c981a27ee9206a2c92717e82b0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 15:30:54 2021 +0100

    algo_min: delete the upstream directories if a (sub)directory is unknown

commit c42e643aa512cbd8c039be2350159e46d34daa0d
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 13:24:12 2021 +0100

    model: wrong iteration in 'iterate_bfs' function

commit 0d3b5cb86144b87accab7f9a45d6457f457d47d0
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Feb 2 11:13:13 2021 +0100

    make 'set_children_status' works with different kind of nodes

commit b601f382db643ddb0af40c85d1d8fc5065bd7224
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Jan 28 16:45:45 2021 +0100

    file_priority: remove children only when the unset directory is known
    
    If the directory is unknown the algorithm should check the downstream
    directories since they could be unknown too.

commit 5e01c09af4c61a309d71adb0d4f61d1766b8a021
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 26 10:10:00 2021 +0100

    retry request in case of backend failure

commit ebad16c02da6bffbc96a623e082a4b5f706d7b1f
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 25 13:48:14 2021 +0100

    algo_min: remove the current node as well

commit 5cd9f762467ece41d7d8e1ae1841e1d24aad45e4
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Mon Jan 18 10:26:06 2021 +0100

    fix: the temporary directory is removed by tempfile

commit 7a289332f73025f94f7f85ab5bd6755b876ebe68
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:12:18 2021 +0100

    print results as a csv

commit 9e4df16d9486a891498124dd4cfb7558c57dfa0c
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Tue Jan 12 23:10:39 2021 +0100

    extract repositories in temporary directories

commit 7bd1939949dcbcf0c52b8647f2b1750f2c9d2300
Author: Daniele Serafini <me@danieleserafini.eu>
Date:   Thu Dec 10 23:59:31 2020 +0100

    scanner experiments

Link to build: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/99/
See console output for more information: https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/99/console

This revision is now accepted and ready to land.Feb 4 2021, 5:45 PM
This revision was landed with ongoing or failed builds.Feb 4 2021, 5:46 PM
This revision was automatically updated to reflect the committed changes.