Page MenuHomeSoftware Heritage

misleading 100% known summary in sunburst rendering
Closed, MigratedEdits Locked

Description

considering the following scenario:

$ swh scanner scan -x '*.git' -f ndjson scikit-learn | head -n 1
{".": {"swhid": "swh:1:dir:1de41371de86ff66c85271ac410097531372b6d1", "known": true}}
$ echo foo > scikit-learn/foo.txt
$ swh scanner scan -x '*.git' -f ndjson scikit-learn | head -n 1
{".": {"swhid": "swh:1:dir:2699c6331bc22d604e524184ce2dd4340e3a1107", "known": false}}
$ swh scanner scan -x '*.git' -f sunburst scikit-learn | head -n 1

the checkout of scikit-learn we are initially scanning is an archived commit, completely known to the archive. Then we add one file (foo.txt), which is also known in the archive, but which makes the top-level directory not known to the archive (because it is a directory that only contains known stuff, but which is itself, as a directory, unknown). Scanning works correctly, as the ndjson output shows, but the sunburst rendering shows "100.0%" as the percentage of known content of the root directory, which is misleading.

I believe this is because the 100% is computed only in terms of the number of files known/unknown.

One possible fix is counting in terms of nodes, which would include both files and directories, making the total lower than 100% in cases like this.

Event Timeline

zack triaged this task as Low priority.Nov 29 2021, 1:10 PM
zack created this task.

I've tried replacing the content of foo.txt with something unknown to the archive (random garbage) and the sunburst rendering still shows 100.0%.
So it could also be a rounding error instead.
Either way, it is misleading and should be fixed.