Page MenuHomeSoftware Heritage

scanner: show result with a sunburst chart
ClosedPublic

Authored by DanSeraf on Mar 20 2020, 7:02 PM.

Details

Summary

Implementation of a sunburst chart for the scanner

The sunburst chart show for each directory how many files are present and the percentage of discovered.
result example: raw_result

Diff Detail

Repository
rDTSCN Code scanner
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

I'm quite confused by the data flow in plot.py (as I am often with code dealing with dataframes). Could you add a (large) docstring at the beginning of the module explaining the overall idea, and more comments in the code?

requirements-test.txt
4–6

Should be in requirements.txt, they are used outside tests too.

swh/scanner/model.py
116

directories looks like an accumulator; and it's initialized by the called with the content_count, which is the job done in this function.
It's also confusing to take an argument, then modify in place and return it.

Instead, you should turn this function into a private one (by prepending an understanding) without a return type, and add a public function getDirectoriesInfo(root: PosixPath) -> Dict[PosixPath, List].

This way, callers don't have to initialize the dict.

120–121

Could you document further the format of values?

135

As it's fixed-length, and both values have different meanings, use Tuple[int, int] instead.

140

"A tuple of the total number of contents, and the number of discovered contents."

What do you mean by "discovered"?

swh/scanner/plot.py
6–8 ↗(On Diff #10192)

this group of imports should be after the other ones. See the second item here: https://www.python.org/dev/peps/pep-0008/#imports

swh/scanner/tests/test_model.py
78

forgot to remove this print

This revision now requires changes to proceed.Mar 20 2020, 7:50 PM
swh/scanner/plot.py
71 ↗(On Diff #10192)

also, would be nice to write in this docstring what data is in the dataframe

swh/scanner/model.py
140

It's the number of contents found in the archive in a specific directory.

  • test dataframe generation
  • requested changes

This time I reviewed plot.py.

Most of my comments are to make the code more readable, and don't require in-depth changes

swh/scanner/model.py
116

Should be a private method now, so prefix its name with an underscore

117

This docstring doesn't give more info than the signature of the function. You should either make it more detailed or remove it.

128

no need to return it (same reason as in the other comment)

140

Could you explain this in both docstrings?

swh/scanner/plot.py
14–24 ↗(On Diff #10261)

module docstring must be before imports

27–31 ↗(On Diff #10261)

Could you add a description of each argument? (and possibly change the name df to express what it contains)

33 ↗(On Diff #10261)

"For each directory level stored, the new dataframe will have"?

38 ↗(On Diff #10261)

Could you add an example of inputs, and what the outputs are? (like you did for generate_df)

47 ↗(On Diff #10261)

forgot this print

49–55 ↗(On Diff #10261)

I think I understand what is going on here, but could you add a comment with an example of how the data is reshaped from df to df_tree step-by-step, for future readers?

58 ↗(On Diff #10261)

The data from the first df_trees will be copied into a new dataframe for each run of the loop. This is inefficient (quadratic complexity). You might want to build a list of dataframes and concatenate everything at the end.

70 ↗(On Diff #10261)

dirs_path: List[PosixPath]

88 ↗(On Diff #10261)

You could rename it generate_df_from_dirs.

And you can add a type annotation: dirs: Dict[PosixPath, Tuple[int, int]].

121 ↗(On Diff #10261)

parts_list = list(dir_path.parts)

swh/scanner/tests/test_plot.py
22–24 ↗(On Diff #10261)

Should be in an independent test

41 ↗(On Diff #10261)

Missing tests for build_hierarchical_df

This revision now requires changes to proceed.Mar 26 2020, 2:06 PM

Sorry I didn't do it earlier, but I finally got around to actually running this code, and I found a bug: if two directories have the same name (or if there's one named "total"), then the swh-scanner process exits early, and nothing is shown in the browser.

Also, for consistency with other formats, I would expect --format=sunburst to output a file (like other formats) instead of opening a browser. You could however add a new option --show or --open to immediately a browser.
I don't know if it would be easy to get plotly to write a static file; if not you may ignore this comment for now.

swh/scanner/model.py
120–121

missed this comment

140

and this one

swh/scanner/plot.py
42–95 ↗(On Diff #10300)

Great!

27–31 ↗(On Diff #10261)

renaming df to dataframe isn't more explicit about what it contains :/

swh/scanner/tests/test_plot.py
17 ↗(On Diff #10300)

should rename the test to test_generate_df_from_dirs

This revision now requires changes to proceed.Mar 27 2020, 10:43 AM

Sorry I didn't do it earlier, but I finally got around to actually running this code, and I found a bug: if two directories have the same name (or if there's one named "total"), then the swh-scanner process exits early, and nothing is shown in the browser.

Thanks! I will check a solution for that; anyways, i didn't have problem testing directories that have the same name, could you provide me an example, so i can replicate your test?

Also, for consistency with other formats, I would expect --format=sunburst to output a file (like other formats) instead of opening a browser. You could however add a new option --show or --open to immediately a browser.
I don't know if it would be easy to get plotly to write a static file; if not you may ignore this comment for now.

Yeah, i can easily write the result in a static file.

Thanks! I will check a solution for that; anyways, i didn't have problem testing directories that have the same name, could you provide me an example, so i can replicate your test?

Sure (sorry, I should have give you one earlier):

rm /tmp/scan_test -rf  # cleanup
mkdir /tmp/scan_test/foo -p
swh scanner scan /tmp/scan_test --format sunburst  # works
mkdir /tmp/scan_test/foo/foo -p
swh scanner scan /tmp/scan_test --format sunburst  # works
touch /tmp/scan_test/foo/__init__.py
swh scanner scan /tmp/scan_test --format sunburst  # crash
rm -rf /tmp/scan_test/foo/foo
swh scanner scan /tmp/scan_test --format sunburst  # works again
mkdir /tmp/scan_test/foo/total
swh scanner scan /tmp/scan_test --format sunburst  # crashes again

requested changes
resolved bug due to wrong percentage calculation

ids of sunburst must be unique

Build is green

Patch application report for D2863 (id=10451)

Rebasing onto 4e96d30076...

First, rewinding head to replay your work on top of it...
Applying: model: get information about directories
Applying: changed structure of temp_folder fixture
Applying: new visualization format: sunburst chart
Changes applied before test
commit 980f13f1701072e7180d7da1175d01f1e6128781
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:58:06 2020 +0100

    new visualization format: sunburst chart
    
    generate a sunburst chart showing for each directory how many
    files are discovered
    
    plot test: creation of dataframe from directories; creation of
    hierarchical dataframe

commit c032c09f3b1927f376368b82de566515cd28fbd5
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:48:46 2020 +0100

    changed structure of temp_folder fixture

commit 34f814276d59c2a9c039563af1d44aa8d3d24262
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:45:55 2020 +0100

    model: get information about directories
    
    for each directory store information about how many contents are
    presents and discovered

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/3/ for more details.

vlorentz requested changes to this revision.Apr 1 2020, 3:36 PM

Could you add tests that check this bug doesn't happen again?

This revision now requires changes to proceed.Apr 1 2020, 3:36 PM

Could you add tests that check this bug doesn't happen again?

If i want to make a test i have to raise an exception when the labels (hierarchical_df['id'] in this case) used to generate the sunburst, contains equal values; so i should check every value in labels.
Since now i'm storing all the directory path i think this bug can't happen again.

Since now i'm storing all the directory path i think this bug can't happen again.

With the current code, it can't, indeed. But this kind of test is called a regression test, as it catches future mistakes in the code.

If i want to make a test i have to raise an exception when the labels (hierarchical_df['id'] in this case) used to generate the sunburst, contains equal values; so i should check every value in labels.

Indeed, there isn't much point in that.

This revision is now accepted and ready to land.Apr 2 2020, 11:02 AM

Build is green

Patch application report for D2863 (id=10484)

Rebasing onto 4e96d30076...

Current branch diff-target is up to date.
Changes applied before test
commit a618bf17c0492f427174abc43e5758f4501cbc9a
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:58:06 2020 +0100

    new visualization format: sunburst chart
    
    generate a sunburst chart showing for each directory how many
    files are present and the percentage of files known
    
    plot test: creation of direcories dataframe; creation of
    hierarchical dataframe

commit e4bb51b7d8949c24992a07e2477445b5e578fb40
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:48:46 2020 +0100

    changed structure of temp_folder fixture

commit a49c4e5ec053f77e974f65ca25484b1fbe762d00
Author: Daniele Serafini <danseraf@softwareheritage.org>
Date:   Fri Mar 20 18:45:55 2020 +0100

    model: get information about directories
    
    for each directory store information about how many contents are
    presents and discovered

See https://jenkins.softwareheritage.org/job/DTSCN/job/tests-on-diff/4/ for more details.