diff --git a/swh/provenance/tests/data/README.md b/swh/provenance/tests/data/README.md index 3e78ba5..2eb0da0 100644 --- a/swh/provenance/tests/data/README.md +++ b/swh/provenance/tests/data/README.md @@ -1,138 +1,166 @@ # Provenance Index Test Dataset This directory contains datasets used by `test_provenance_heurstics` tests of the provenance index database. Each dataset `xxx` consist in several parts: - a description of a git repository as a yaml file named `xxx_repo.yaml`, - a msgpack file containing storage objects for the given repository, from which the storage is filled before each test using these data, and - a set of synthetic files, named `synthetic_xxx_(lower|upper)_.txt`, describing the expected result in the provenance database if ingested with the flag `lower` set or not set, and the `mindepth` value (integer, most often `1` or `2`). ## Git repos description file The description of a git repository is a yaml file which contains a list dicts, each one representing a git revision to add (linearly) in the git repo used a base for the dataset. Each dict consist in a structure like: ``` yaml - msg: R00 date: 1000000000 content: A/B/C/a: "content a" ``` this example will generate a git commit with the commit message "R00", the author and committer date 1000000000 (given as a unix timestamp), and a one file which path is `A/B/C/a` and content is "content a". The file is parsed to create git revisions in a temporary git repository, in order of appearance in the yaml file (so one may create an git repository with 'out-of-order' commits). There is no way of creating branches and merges for now. The tool to generate this git repo is `generate_repo.py`: ``` python generate_repo.py --help Usage: generate_repo.py [OPTIONS] INPUT_FILE OUTPUT_DIR Options: -C, --clean-output / --no-clean-output --help Show this message and exit. ``` It generates a git repository in the `OUTPUT_DIR` as well as produces a template `synthetic` file on its standard output, which can be used to ease writing the expected `synthetic` files. Typical usage will be: ``` python generate_repo.py repo2_repo.yaml repo2 > synthetic_repo2_template.txt ``` Note that hashes (for revision, directories and content) of the git objects only depends on the content of the input yaml file. Calling the tool twice on the same input file should generate the exact same git repo twice. +Also note that the tool will add a branch at each revision (using the commit +message as bramch name), to make it easier to reference any point in the git +history. ## Msgpack dump of the storage This file contains a set of storage objects (`Revision`, `Content` and `Directory`) and is usually generated from a local git repository (typically the one generated by the previous command) using the `generate_storage_from_git.py` tool: ``` python generate_storage_from_git.py --help Usage: generate_storage_from_git.py [OPTIONS] GIT_REPO simple tool to generate the CMDBTS.msgpack dataset filed used in tests Options: -r, --head TEXT head revision to start from -o, --output TEXT output file --help Show this message and exit. ``` Typical usage would be, using the git repository `repo2` created previously: ``` python generate_storage_from_git.py repo2 Revision hash for master is 8363e8e98751dc9f264d2fedd6b829ad4b1218b0 Wrote 86 objects in repo2.msgpack ``` +### Adding extra visits/snapshots + +It is also possible to generate a storage from a git repo with extra origin +visits, using the `--visit` option of the `generate_repo_from_git` tool. + +This option expect a yaml file as argument. This file contains a description of +extra visits (and snapshots) you want to add to the storage. + +The format is simple, for example: + +``` +# a visit pattern scenario for the 'repo_with_merges' repo + +- origin: http://repo_with_merges/1/ + date: 1000000015 + branches: + - R01 + +``` + +will create an OriginVisit (at given date) for the given origin URL (the Origin +will be created as well), with a `Snapshot` including the listed +branches. + + ## Synthetic files These files describe the expected content of the provenance database for each revision (in order of ingestion). The `generate_repo.py` tool will produce a template of synthetic file like: ``` 1000000000.0 b582a17b3fc37f72fc57877616f85c3f0abed064 R00 R00 | | | R b582a17b3fc37f72fc57877616f85c3f0abed064 | 1000000000.0 | | . | D a4cb5e6b2831f7e8eef0e6e08e43d642c97303a1 | 0.0 | | A | D 1c8d9fd9afa7e5a2cf52a3db6f05dc5c3a1ca86b | 0.0 | | A/B | D 36876d475197b5ad86ad592e8e28818171455f16 | 0.0 | | A/B/C | D 98f7a4a23d8df1fb1a5055facae2aff9b2d0a8b3 | 0.0 | | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0 1000000010.0 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe R01 R01 | | | R 8259eeae2ff5046f0bb4393d6e894fe6d7e01bfe | 1000000010.0 | | . | D b3cf11b22c9f93c3c494cf90ab072f394155072d | 0.0 | | A | D baca735bf8b8720131b4bfdb47c51631a9260348 | 0.0 | | A/B | D 4b28979d88ed209a09c272bcc80f69d9b18339c2 | 0.0 | | A/B/C | D c9cabe7f49012e3fdef6ac6b929efb5654f583cf | 0.0 | | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0.0 | | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0.0 [...] ``` where all the content and directories of each revision are listed; it's then the responsibility of the user to create the expected synthetic file for a given heuristics configuration. For example, the 2 revisions above are to be adapted, for the `(lower=True, mindepth=1)` case, as: ``` 1000000000 c0d8929936631ecbcf9147be6b8aa13b13b014e4 R00 R00 | | | R c0d8929936631ecbcf9147be6b8aa13b13b014e4 | 1000000000 | R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | 0 1000000010 1444db96cbd8cd791abe83527becee73d3c64e86 R01 R01 | | | R 1444db96cbd8cd791abe83527becee73d3c64e86 | 1000000010 | R---C | A/B/C/a | C 20329687bb9c1231a7e05afe86160343ad49b494 | -10 | R---C | A/B/C/b | C 50e9cdb03f9719261dd39d7f2920b906db3711a3 | 0 ``` diff --git a/swh/provenance/tests/data/generate_storage_from_git.py b/swh/provenance/tests/data/generate_storage_from_git.py index 854a3eb..0d8b0cd 100644 --- a/swh/provenance/tests/data/generate_storage_from_git.py +++ b/swh/provenance/tests/data/generate_storage_from_git.py @@ -1,51 +1,114 @@ # Copyright (C) 2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime, timezone import os +from subprocess import check_output import click +import yaml from swh.loader.git.from_disk import GitLoaderFromDisk +from swh.model.model import ( + Origin, + OriginVisit, + OriginVisitStatus, + Snapshot, + SnapshotBranch, + TargetType, +) from swh.storage import get_storage def load_git_repo(url, directory, storage): visit_date = datetime.now(tz=timezone.utc) loader = GitLoaderFromDisk( url=url, directory=directory, visit_date=visit_date, storage=storage, ) return loader.load() def pop_key(d, k): d.pop(k) return d @click.command() @click.option("-o", "--output", default=None, help="output file") +@click.option( + "-v", + "--visits", + type=click.File(mode="rb"), + default=None, + help="additional visits to generate.", +) @click.argument("git-repo", type=click.Path(exists=True, file_okay=False)) -def main(output, git_repo): +def main(output, visits, git_repo): "simple tool to generate the git_repo.msgpack dataset file used in some tests" if output is None: output = f"{git_repo}.msgpack" with open(output, "wb") as outstream: sto = get_storage( cls="memory", journal_writer={"cls": "stream", "output_stream": outstream} ) if git_repo.endswith("/"): git_repo = git_repo[:-1] reponame = os.path.basename(git_repo) load_git_repo(f"https://{reponame}", git_repo, sto) + + if visits: + # retrieve all branches from the actual git repo + all_branches = { + ref: sha1 + for sha1, ref in ( + line.strip().split() + for line in check_output(["git", "-C", git_repo, "show-ref"]) + .decode() + .splitlines() + ) + } + + for visit in yaml.full_load(visits): + # add the origin (if it already exists, this is a noop) + sto.origin_add([Origin(url=visit["origin"])]) + # add a new visit for this origin + visit_id = sto.origin_visit_add( + [ + OriginVisit( + origin=visit["origin"], + date=datetime.fromtimestamp(visit["date"], tz=timezone.utc), + type="git", + ) + ] + )[0].visit + # add a snapshot with branches from the input file + branches = { + f"refs/heads/{name}".encode(): SnapshotBranch( + target=bytes.fromhex(all_branches[f"refs/heads/{name}"]), + target_type=TargetType.REVISION, + ) + for name in visit["branches"] + } + snap = Snapshot(branches=branches) + sto.snapshot_add([snap]) + # add a "closing" origin visit status update referencing the snapshot + status = OriginVisitStatus( + origin=visit["origin"], + visit=visit_id, + date=datetime.fromtimestamp(visit["date"], tz=timezone.utc), + status="full", + snapshot=snap.id, + ) + sto.origin_visit_status_add([status]) + click.echo(f"Serialized the storage made from {reponame} in {output}") if __name__ == "__main__": main() diff --git a/swh/provenance/tests/data/repo_with_merges-visits-01.yaml b/swh/provenance/tests/data/repo_with_merges-visits-01.yaml new file mode 100644 index 0000000..5fb2359 --- /dev/null +++ b/swh/provenance/tests/data/repo_with_merges-visits-01.yaml @@ -0,0 +1,34 @@ +# a visit pattern scenario for the 'repo_with_merges' repo + +- origin: http://repo_with_merges/1/ + date: 1000000015 + branches: + - R01 + +- origin: http://repo_with_merges/1/ + date: 1000000025 + branches: + - R03 + - R06 + +- origin: http://repo_with_merges/2/ + date: 1000000035 + branches: + - R05 + - R06 + +- origin: http://repo_with_merges/1/ + date: 1000000045 + branches: + - R06 + - R07 + +- origin: http://repo_with_merges/1/ + date: 1000000055 + branches: + - R08 + +- origin: http://repo_with_merges/2/ + date: 1000000065 + branches: + - R08