scanner: json output should return both known and unknown files/dirs
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Apr 15 2020, 1:45 PM

Description

The json format output of the scanner returns something like this:

$ swh scanner scan -f json .
{
    ".HEADER": "swh:1:cnt:fd8430bc864cfcd5f10e5590f8a447e01b942bfe",
    ".editorconfig": "swh:1:cnt:34c5e9234ec18c69a16828dbc9633a95f0253fe9",
    ".gitattributes": "swh:1:cnt:176a458f94e0ea5272ce67c36bf30b6be9caf623",
    ".github": "swh:1:dir:e8bfe5af39579a7e4898bb23f3a76a72c368cee6",
    ".gitignore": "swh:1:cnt:dec3dca06c8fdc1dd7d426bb148b7f99355eaaed",
    ...
    "src": "swh:1:dir:f3c5e67df5a3b3e812e6331008b7e179865a30fc",
    "tests": "swh:1:dir:506e33bae73858bdf4b90a8f89dee8a32dae9c93"
}

It looks like the semantics is to return the list of known files/dirs and not returning unknown ones.
That is not very easily exploitable programmatically, as based on the json output alone one doesn't know what is missing out.

The output format should be changed to always output all encountered files/dirs, with an associated known: boolean flag.
Also remember that in the future other fields will need to be associated to each encoutered file/dir, so we need to have room (e.g., other keys at the same level of known) to attach other information in the future.

Revisions and Commits

rDTSCN Code scanner
	D3085	rDTSCN623a9dbe6157 ndjson output format

Related Objects

Mentioned In: D3085: scanner: ndjson output format
D3069: scanner: json output format

Event Timeline

zack triaged this task as Normal priority.Apr 15 2020, 1:45 PM

zack created this task.

zack updated the task description. (Show Details)Apr 15 2020, 2:07 PM

The new json output will be like the following:

$ swh scanner scan -f json /tmp/test
{
    "dir1": {
        "children": {
            "subdir1": {
                "children": {
                    "text.txt": {
                        "known": true,
                        "swhid": "swh:1:cnt:ff5b57b7095eb5d168a36db6552ad2ce1f219bf6"
                    }
                },
                "known": false,
                "swhid": "swh:1:dir:2186fc616a69c749f2c2cad4a581d41d74341cfd"
            }
        },
        "known": false,
        "swhid": "swh:1:dir:5545ec2b4883272ef8c7dec28e0fee4ed96226cf"
    },
    "dir2": {
        "children": {
            "file.tac": {
                "known": false,
                "swhid": "swh:1:cnt:b39cd318115984b9d956fa985d4e33e33254fd85"
            }
        },
        "known": false,
        "swhid": "swh:1:dir:9f2c14e1b009f5f3348a231414dc6074402848bf"
    },
    "file1.py": {
        "known": false,
        "swhid": "swh:1:cnt:c5497b0e874da5c753320531bce81917ebfb6f8d"
    },
    "file2.py": {
        "known": false,
        "swhid": "swh:1:cnt:fed33a30f23eee5a08b41ffc062f0903c4834121"
    }
}

Each path will have the know status and its Software Heritage persistent identifier.
In the case the path is a directory it will also have a children dictionary with the related contents/directories.

Currently, when scanning a source project, if a directory is 'known' the relative contents/directories are not saved in the model. Should i save it?

In T2363#43710, @DanSeraf wrote:

$ swh scanner scan -f json /tmp/test
{
    "dir1": {
        "children": {
            "subdir1": {
                "children": {
                    "text.txt": {
                        "known": true,
                        "swhid": "swh:1:cnt:ff5b57b7095eb5d168a36db6552ad2ce1f219bf6"
                    }

Thanks @DanSeraf for this proposal. I see the problem that, to have room for known/swhid and in the future other keys, and at the same time keep the natural directory nesting, we need to add the extra children indirection. Upon reflection, I find that ugly and hard to actually exploit. So I'm circling back and proposing an alternative format in which there is no nesting and we expand instead all paths in a flat top-level dictionary, with subkeys for known/unknown, swhids, etc.
It will look like this (first few entries of a linux kernel tree):

{
    ".": {
        "known": false,
        "swhid": "..."
    },
    "tools": {
        "known": false,
        "swhid": "..."
    },
    "tools/time": {
        "known": true,
        "swhid": "..."
    },
    "tools/time/udelay_test.sh": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi": {
        "known": false,
        "swhid": "..."
    },
    "tools/spi/spidev_fdx.c": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/Build": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/spidev_test.c": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/.gitignore": {
        "known": true,
        "swhid": "..."
    },
    "tools/spi/Makefile": {
        "known": true,
        "swhid": "..."
    }
}

I think this would be a much more easy to exploit format and offer a more compact structure. There is an inflation in size in the path names (as previous paths are repeated), but I think it would be negligible both in absolute terms (on the linux kernel the output of find . is ~2.5 MB) and in comparison to the json overhead of the previous structure.

What do you think?

Just jumping in, I suggest using ndjson (newline-delimited json) instead of a full json tree, as the former is easier to stream / parse incrementally for large outputs (like the linux kernel).

DanSeraf mentioned this in D3069: scanner: json output format.Apr 27 2020, 11:25 AM

DanSeraf mentioned this in D3085: scanner: ndjson output format.Apr 29 2020, 1:17 PM

DanSeraf closed this task as Resolved by committing rDTSCN623a9dbe6157: ndjson output format.Apr 29 2020, 4:40 PM

DanSeraf added a commit: rDTSCN623a9dbe6157: ndjson output format.

This task has been migrated to GitLab.

scanner: json output should return both known and unknown files/dirsClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

scanner: json output should return both known and unknown files/dirs
Closed, MigratedEdits Locked
Actions