Page MenuHomeSoftware Heritage

scanner: json output should return both known and unknown files/dirs
Closed, MigratedEdits Locked

Description

The json format output of the scanner returns something like this:

$ swh scanner scan -f json .
{
    ".HEADER": "swh:1:cnt:fd8430bc864cfcd5f10e5590f8a447e01b942bfe",
    ".editorconfig": "swh:1:cnt:34c5e9234ec18c69a16828dbc9633a95f0253fe9",
    ".gitattributes": "swh:1:cnt:176a458f94e0ea5272ce67c36bf30b6be9caf623",
    ".github": "swh:1:dir:e8bfe5af39579a7e4898bb23f3a76a72c368cee6",
    ".gitignore": "swh:1:cnt:dec3dca06c8fdc1dd7d426bb148b7f99355eaaed",
    ...
    "src": "swh:1:dir:f3c5e67df5a3b3e812e6331008b7e179865a30fc",
    "tests": "swh:1:dir:506e33bae73858bdf4b90a8f89dee8a32dae9c93"
}

It looks like the semantics is to return the list of known files/dirs and not returning unknown ones.
That is not very easily exploitable programmatically, as based on the json output alone one doesn't know what is missing out.

The output format should be changed to always output all encountered files/dirs, with an associated known: boolean flag.
Also remember that in the future other fields will need to be associated to each encoutered file/dir, so we need to have room (e.g., other keys at the same level of known) to attach other information in the future.

Revisions and Commits

Event Timeline

zack triaged this task as Normal priority.Apr 15 2020, 1:45 PM
zack created this task.

The new json output will be like the following:

$ swh scanner scan -f json /tmp/test
{
    "dir1": {
        "children": {
            "subdir1": {
                "children": {
                    "text.txt": {
                        "known": true,
                        "swhid": "swh:1:cnt:ff5b57b7095eb5d168a36db6552ad2ce1f219bf6"
                    }
                },
                "known": false,
                "swhid": "swh:1:dir:2186fc616a69c749f2c2cad4a581d41d74341cfd"
            }
        },
        "known": false,
        "swhid": "swh:1:dir:5545ec2b4883272ef8c7dec28e0fee4ed96226cf"
    },
    "dir2": {
        "children": {
            "file.tac": {
                "known": false,
                "swhid": "swh:1:cnt:b39cd318115984b9d956fa985d4e33e33254fd85"
            }
        },
        "known": false,
        "swhid": "swh:1:dir:9f2c14e1b009f5f3348a231414dc6074402848bf"
    },
    "file1.py": {
        "known": false,
        "swhid": "swh:1:cnt:c5497b0e874da5c753320531bce81917ebfb6f8d"
    },
    "file2.py": {
        "known": false,
        "swhid": "swh:1:cnt:fed33a30f23eee5a08b41ffc062f0903c4834121"
    }
}

Each path will have the know status and its Software Heritage persistent identifier.
In the case the path is a directory it will also have a children dictionary with the related contents/directories.

Currently, when scanning a source project, if a directory is 'known' the relative contents/directories are not saved in the model. Should i save it?

$ swh scanner scan -f json /tmp/test
{
    "dir1": {
        "children": {
            "subdir1": {
                "children": {
                    "text.txt": {
                        "known": true,
                        "swhid": "swh:1:cnt:ff5b57b7095eb5d168a36db6552ad2ce1f219bf6"
                    }

Thanks @DanSeraf for this proposal. I see the problem that, to have room for known/swhid and in the future other keys, and at the same time keep the natural directory nesting, we need to add the extra children indirection. Upon reflection, I find that ugly and hard to actually exploit. So I'm circling back and proposing an alternative format in which there is no nesting and we expand instead all paths in a flat top-level dictionary, with subkeys for known/unknown, swhids, etc.
It will look like this (first few entries of a linux kernel tree):

{
    ".": {
        "known": false,
        "swhid": "..."
    },
    "tools": {
        "known": false,
        "swhid": "..."
    },
    "tools/time": {
        "known": true,
        "swhid": "..."
    },
    "tools/time/udelay_test.sh": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi": {
        "known": false,
        "swhid": "..."
    },
    "tools/spi/spidev_fdx.c": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/Build": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/spidev_test.c": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/.gitignore": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/Makefile": {
        "known": true,
        "swhid": "..."
    }
}

I think this would be a much more easy to exploit format and offer a more compact structure. There is an inflation in size in the path names (as previous paths are repeated), but I think it would be negligible both in absolute terms (on the linux kernel the output of find . is ~2.5 MB) and in comparison to the json overhead of the previous structure.

What do you think?

Just jumping in, I suggest using ndjson (newline-delimited json) instead of a full json tree, as the former is easier to stream / parse incrementally for large outputs (like the linux kernel).