Page MenuHomeSoftware Heritage

README.md
No OneTemporary

README.md

# Provenance Index Test Dataset
This directory contains datasets used by `test_provenance_heurstics` tests of
the provenance index database.
Each dataset `xxx` consist in several parts:
- a description of a git repository as a yaml file named `xxx_repo.yaml`,
- a msgpack file containing storage objects for the given repository, from
which the storage is filled before each test using these data, and
- a set of synthetic files, named `synthetic_xxx_(lower|upper)_<mindepth>.txt`,
describing the expected result in the provenance database if ingested with
the flag `lower` set or not set, and the `mindepth` value (integer, most
often `1` or `2`).
## Git repos description file
The description of a git repository is a yaml file which contains a list dicts,
each one representing a git revision to add (linearly) in the git repo used a
base for the dataset. Each dict consist in a structure like:
``` yaml
- msg: R00
date: 1000000000
content:
A/B/C/a: "content a"
```
this example will generate a git commit with the commit message "R00", the
author and committer date 1000000000 (given as a unix timestamp), and a one
file which path is `A/B/C/a` and content is "content a".
The file is parsed to create git revisions in a temporary git repository, in
order of appearance in the yaml file (so one may create an git repository with
'out-of-order' commits).
There is no way of creating branches and merges for now.
The tool to generate this git repo is `generate_repo.py`:
```
python generate_repo.py --help
Usage: generate_repo.py [OPTIONS] INPUT_FILE OUTPUT_DIR
Options:
-C, --clean-output / --no-clean-output
--help Show this message and exit.
```
It generates a git repository in the `OUTPUT_DIR` as well as produces a
template `synthetic` file on its standard output, which can be used to ease
writing the expected `synthetic` files.
Typical usage will be:
```
python generate_repo.py repo2_repo.yaml repo2 > synthetic_repo2_template.txt
```
Note that hashes (for revision, directories and content) of the git objects
only depends on the content of the input yaml file. Calling the tool twice on
the same input file should generate the exact same git repo twice.
Also note that the tool will add a branch at each revision (using the commit
message as bramch name), to make it easier to reference any point in the git
history.
## Msgpack dump of the storage
This file contains a set of storage objects (`Revision`, `Content` and
`Directory`) and is usually generated from a local git repository (typically
the one generated by the previous command) using the
`generate_storage_from_git.py` tool:
```
python generate_storage_from_git.py --help
Usage: generate_storage_from_git.py [OPTIONS] GIT_REPO
simple tool to generate the CMDBTS.msgpack dataset filed used in tests
Options:
-r, --head TEXT head revision to start from
-o, --output TEXT output file
--help Show this message and exit.
```
Typical usage would be, using the git repository `repo2` created previously:
```
python generate_storage_from_git.py repo2
Revision hash for master is 8363e8e98751dc9f264d2fedd6b829ad4b1218b0
Wrote 86 objects in repo2.msgpack
```
### Adding extra visits/snapshots
It is also possible to generate a storage from a git repo with extra origin
visits, using the `--visit` option of the `generate_storage_from_git` tool.
This option expect a yaml file as argument. This file contains a description of
extra visits (and snapshots) you want to add to the storage.
The format is simple, for example:
```
# a visit pattern scenario for the 'repo_with_merges' repo
- origin: http://repo_with_merges/1/
date: 1000000015
branches:
- R01
```
will create an OriginVisit (at given date) for the given origin URL (the Origin
will be created as well), with a `Snapshot` including the listed
branches.
## Synthetic files
These files describe the expected content of the provenance database for each
revision (in order of ingestion).
The `generate_repo.py` tool will produce a template of synthetic file like:
```
1000000000.0 b582a17b3fc37f72fc57877616f85c3f0abed064 R00
R00 | | | R b582a17b3fc37f72fc57877616f85c3f0abed064 | 1000000000.0
| | . | D a4cb5e6b2831f7e8eef0e6e08e43d642c97303a1 | 0.0
| | A | D 1c8d9fd9afa7e5a2cf52a3db6f05dc5c3a1ca86b | 0.0
| | A/B | D 36876d475197b5ad86ad592e8e28818171455f16 | 0.0
| | A/B/C | D 98f7a4a23d8df1fb1a5055facae2aff9b2d0a8b3 | 0.0
| | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0
1000000010.0 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe R01
R01 | | | R 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe | 1000000010.0
| | . | D b3cf11b22c9f93c3c494cf90ab072f394155072d | 0.0
| | A | D baca735bf8b8720131b4bfdb47c51631a9260348 | 0.0
| | A/B | D 4b28979d88ed209a09c272bcc80f69d9b18339c2 | 0.0
| | A/B/C | D c9cabe7f49012e3fdef6ac6b929efb5654f583cf | 0.0
| | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0
| | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0.0
[...]
```
where all the content and directories of each revision are listed; it's then
the responsibility of the user to create the expected synthetic file for a
given heuristics configuration. For example, the 2 revisions above are to be
adapted, for the `(lower=True, mindepth=1)` case, as:
```
1000000000 c0d8929936631ecbcf9147be6b8aa13b13b014e4 R00
R00 | | | R c0d8929936631ecbcf9147be6b8aa13b13b014e4 | 1000000000
| R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0
1000000010 1444db96cbd8cd791abe83527becee73d3c64e86 R01
R01 | | | R 1444db96cbd8cd791abe83527becee73d3c64e86 | 1000000010
| R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | -10
| R---C | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0
```

File Metadata

Mime Type
text/plain
Expires
Fri, Jul 4, 3:00 PM (4 d, 19 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3269381

Event Timeline