Changeset View
Changeset View
Standalone View
Standalone View
swh/provenance/tests/data/README.md
- This file was added.
# Provenance Index Test Dataset | |||||||||||||||||||
This directory contains datasets used by `test_provenance_heurstics` tests of | |||||||||||||||||||
the provenance index database. | |||||||||||||||||||
Each dataset `xxx` consist in several parts: | |||||||||||||||||||
- a description of a git repository as a yaml file named `xxx_repo.yaml`, and | |||||||||||||||||||
- a msgpack file containing storage objects for the given repository, from | |||||||||||||||||||
which the storage is filled before each test using these data, | |||||||||||||||||||
- a set of synthetic files, named `synthetic_xxx_(lower|upper)_<mindepth>.txt`, | |||||||||||||||||||
describing the expected result in the provenance database if ingested with | |||||||||||||||||||
the flag `lower` set or not set, and the `mindepth` value (integer, most | |||||||||||||||||||
often `1` or `2`). | |||||||||||||||||||
vlorentzUnsubmitted Done Inline Actions
vlorentz: | |||||||||||||||||||
## Git repos description file | |||||||||||||||||||
The description of a git repository is a yaml file which contains a list dicts, | |||||||||||||||||||
each one representing a git revision to add (linearly) in the git repo used a | |||||||||||||||||||
base for the dataset. Each dict consist in a structure like: | |||||||||||||||||||
``` yaml | |||||||||||||||||||
- msg: R00 | |||||||||||||||||||
date: 1000000000 | |||||||||||||||||||
content: | |||||||||||||||||||
A/B/C/a: "content a" | |||||||||||||||||||
``` | |||||||||||||||||||
this example will generate a git commit with the commit message "R00", the | |||||||||||||||||||
author and committer date 1000000000 (given as a unix timestamp), and a one | |||||||||||||||||||
file which path is `A/B/C/a` and content is "content a". | |||||||||||||||||||
The file is parsed to create git revisions in a temporary git repository, in | |||||||||||||||||||
order of appearance in the yaml file (so one may create an git repository with | |||||||||||||||||||
'out-of-order' commits). | |||||||||||||||||||
Not Done Inline ActionsI like this, but shouldn't this new DSL be defined outside swh-provenance and be used to generate model objects directly? We could use it to replace the dependency on swh-loader-git in swh-web and swh-vault's tests. vlorentz: I like this, but shouldn't this new DSL be defined outside swh-provenance and be used to… | |||||||||||||||||||
Done Inline Actionsgood thinking... can we add a task and do it later ? :-) Concerning the dependencies, well I do use swh-loader-git in generate_storage_from_git.py (so it's a dependency in requirements-tests.txt), so... douardda: good thinking... can we add a task and do it later ? :-)
Concerning the dependencies, well I… | |||||||||||||||||||
There is no way of creating branches and merges for now. | |||||||||||||||||||
The tool to generate this git repo is `generate_repo.py`: | |||||||||||||||||||
``` | |||||||||||||||||||
python generate_repo.py --help | |||||||||||||||||||
Usage: generate_repo.py [OPTIONS] INPUT_FILE OUTPUT_DIR | |||||||||||||||||||
Options: | |||||||||||||||||||
-C, --clean-output / --no-clean-output | |||||||||||||||||||
--help Show this message and exit. | |||||||||||||||||||
``` | |||||||||||||||||||
It generates a git repository in the `OUTPUT_DIR` as well as produces a | |||||||||||||||||||
template `synthetic` file on its standard output, which can be used to ease | |||||||||||||||||||
writing the expected `synthetic` files. | |||||||||||||||||||
Typical usage will be: | |||||||||||||||||||
``` | |||||||||||||||||||
python generate_repo.py repo2_repo.yaml repo2 > synthetic_repo2_template.txt | |||||||||||||||||||
``` | |||||||||||||||||||
Note that hashes (for revision, directories and content) of the git objects | |||||||||||||||||||
only depends on the content of the input yaml file. Calling the tool twice on | |||||||||||||||||||
the same input file should generate the exact same git repo twice. | |||||||||||||||||||
## Msgpack dump of the storage | |||||||||||||||||||
This file contains a set of storage objects (`Revision`, `Content` and | |||||||||||||||||||
`Directory`) and is usually generated from a local git repository (typically | |||||||||||||||||||
the one generated by the previous command) using the | |||||||||||||||||||
`generate_storage_from_git.py` tool: | |||||||||||||||||||
``` | |||||||||||||||||||
python generate_storage_from_git.py --help | |||||||||||||||||||
Usage: generate_storage_from_git.py [OPTIONS] GIT_REPO | |||||||||||||||||||
simple tool to generate the CMDBTS.msgpack dataset filed used in tests | |||||||||||||||||||
Options: | |||||||||||||||||||
-r, --head TEXT head revision to start from | |||||||||||||||||||
-o, --output TEXT output file | |||||||||||||||||||
--help Show this message and exit. | |||||||||||||||||||
``` | |||||||||||||||||||
Typical usage would be, using the git repository `repo2` created previously: | |||||||||||||||||||
``` | |||||||||||||||||||
python generate_storage_from_git.py repo2 | |||||||||||||||||||
Revision hash for master is 8363e8e98751dc9f264d2fedd6b829ad4b1218b0 | |||||||||||||||||||
Wrote 86 objects in repo2.msgpack | |||||||||||||||||||
``` | |||||||||||||||||||
## Synthetic files | |||||||||||||||||||
These files describe the expected content of the provenance database for each | |||||||||||||||||||
revision (in order of ingestion). | |||||||||||||||||||
The `generate_repo.py` tool will produce a template of synthetic file like: | |||||||||||||||||||
``` | |||||||||||||||||||
1000000000.0 b582a17b3fc37f72fc57877616f85c3f0abed064 R00 | |||||||||||||||||||
R00 | | | R b582a17b3fc37f72fc57877616f85c3f0abed064 | 1000000000.0 | |||||||||||||||||||
| | . | D a4cb5e6b2831f7e8eef0e6e08e43d642c97303a1 | 0.0 | |||||||||||||||||||
| | A | D 1c8d9fd9afa7e5a2cf52a3db6f05dc5c3a1ca86b | 0.0 | |||||||||||||||||||
| | A/B | D 36876d475197b5ad86ad592e8e28818171455f16 | 0.0 | |||||||||||||||||||
| | A/B/C | D 98f7a4a23d8df1fb1a5055facae2aff9b2d0a8b3 | 0.0 | |||||||||||||||||||
| | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0 | |||||||||||||||||||
1000000010.0 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe R01 | |||||||||||||||||||
R01 | | | R 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe | 1000000010.0 | |||||||||||||||||||
| | . | D b3cf11b22c9f93c3c494cf90ab072f394155072d | 0.0 | |||||||||||||||||||
| | A | D baca735bf8b8720131b4bfdb47c51631a9260348 | 0.0 | |||||||||||||||||||
| | A/B | D 4b28979d88ed209a09c272bcc80f69d9b18339c2 | 0.0 | |||||||||||||||||||
| | A/B/C | D c9cabe7f49012e3fdef6ac6b929efb5654f583cf | 0.0 | |||||||||||||||||||
| | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0 | |||||||||||||||||||
| | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0.0 | |||||||||||||||||||
[...] | |||||||||||||||||||
``` | |||||||||||||||||||
where all the content and directories of each revision are listed; it's then | |||||||||||||||||||
the responsibility of the user to create the expected synthetic file for a | |||||||||||||||||||
given heuristics configuration. For example, the 2 revisions above are to be | |||||||||||||||||||
adapted, for the `(lower=True, mindepth=1)` case, as: | |||||||||||||||||||
``` | |||||||||||||||||||
1000000000 c0d8929936631ecbcf9147be6b8aa13b13b014e4 R00 | |||||||||||||||||||
R00 | | | R c0d8929936631ecbcf9147be6b8aa13b13b014e4 | 1000000000 | |||||||||||||||||||
| R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0 | |||||||||||||||||||
1000000010 1444db96cbd8cd791abe83527becee73d3c64e86 R01 | |||||||||||||||||||
R01 | | | R 1444db96cbd8cd791abe83527becee73d3c64e86 | 1000000010 | |||||||||||||||||||
| R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | -10 | |||||||||||||||||||
| R---C | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0 | |||||||||||||||||||
``` |