Page MenuHomeSoftware Heritage

D6031.id21953.diff
No OneTemporary

D6031.id21953.diff

diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,10 +1,137 @@
-.. _swh-py-template:
+.. _swh-provenance:
-Software Heritage - Python module template
-==========================================
+Software Heritage Provenance
+============================
-Python module template, used as skeleton to create new modules.
+A provenance index database based on the Software Heritage Archive. This is an
+implementation of the paper `Software Provenance Tracking at the Scale of
+Public Source Code`_ published in `Empirical Software Engineering`_
+This provenance index database is a tool to help answering the question "where
+does this source code artifact come from?", which the main Software Heritage
+Archive cannot easily solve.
+
+
+Quick Start
+-----------
+
+Database creation
+~~~~~~~~~~~~~~~~~
+
+Create a provenance index database (in this example we use pifpaf_ to easily
+set up a test Postgresql database. Adapt the example below to your Postgresql
+setup):
+
+.. code-block:: shell
+
+ eval $(pifpaf run postgresql)
+ swh db create -d provdb provenance
+ swh db init-admin -d provdb provenance
+ swh db init -d provdb provenance
+
+The provenance index DB comes in 2 feature flags, so there are 4 possible flavors. Feature flags are:
+
+- `with-path` / `without-path`: whether the provenance index database will store file path,
+- `normalized` / `denormalized`: whether or not the main relation tables are normalized (see below).
+
+So the possible flavors are:
+
+- `with-path`
+- `without-path`
+- `with-path-denormalized`
+- `without-path-denormalized`
+
+Filling the provenance index database
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This step requires an access to the Software Heritage Archive to retrieve the
+actual data from the Archive.
+
+It currently also needs an input CSV file of revisions and origins to insert in
+the provenance database.
+
+Examples of such files are available in the `provenance public dataset`_.
+
+.. _`provenance public dataset`: https://annex.softwareheritage.org/public/dataset/provenance
+
+.. code-block:: shell
+
+ wget https://annex.softwareheritage.org/public/dataset/provenance/sample_10k.csv.bz2
+ bunzip2 sample_10k.csv.bz2
+
+You need a configuration file, like:
+
+.. code-block:: yaml
+
+ # config.yaml
+ provenance:
+ storage:
+ cls: local
+ db:
+ host: /tmp/tmpifn2ov_j
+ port: 9824
+ dbname: provdb
+ archive:
+ cls: api
+ storage:
+ cls: remote
+ url: http://storage:5002/
+
+Note that you need access to the internal API of a :ref:`swh-storage
+<swh-storage>` instance (here the machine named `storage`) for this.
+
+Then you can feed the provenance index database using:
+
+.. code-block:: shell
+
+ swh provenance -C config.yaml iter-revisions sample_10k.csv
+
+
+This may take a while to complete.
+
+Querying the provenance index database
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using the same config file, you may look for the first known occurrence of a file content:
+
+.. code-block:: shell
+
+ swh provenance -C config.yaml find-first 8a54694c92c944fcb06d73c17743ac72444a5b72
+ swh:1:cnt:8a54694c92c944fcb06d73c17743ac72444a5b72, swh:1:rev:6193fae0668d082d90207f6c9f33d6e8c98dd04a, 2008-10-06 18:32:23+00:00, None, lua/effects/bloodstream/init.lua
+
+
+Or all the known occurrences:
+
+.. code-block:: shell
+
+ swh provenance -C config.yaml find-all 8a54694c92c944fcb06d73c17743ac72444a5b72
+ swh:1:cnt:8a54694c92c944fcb06d73c17743ac72444a5b72, swh:1:rev:6193fae0668d082d90207f6c9f33d6e8c98dd04a, 2008-10-06 18:32:23+00:00, None, lua/effects/bloodstream/init.lua
+ swh:1:cnt:8a54694c92c944fcb06d73c17743ac72444a5b72, swh:1:rev:f0a5078eed8808323b93ed09cddb003dbe2a85e4, 2008-10-06 18:32:23+00:00, None, trunk/lua/effects/bloodstream/init.lua
+ [...]
+
+
+(De)normalized database
+-----------------------
+
+For some relation tables (like the ``content_in_revision`` storing, for each
+content object, in which revision it has been found), the default data schema
+is to store one row for each relation.
+
+For a big database, this can have a significant cost in terms of storage.
+
+So it is possible to store these relations using an array as destination column
+(the ``revision`` column in the case of the ``content_in_revisison`` table).
+
+This can drastically reduce the database storage size, possibly at the price of
+a slight performance hit.
+
+Warning: the denormalized version of the database is still under test and
+validation. Do not use for serious work.
+
+
+.. _`Empirical Software Engineering`: http://link.springer.com/journal/10664
+.. _`Software Provenance Tracking at the Scale of Public Source Code`: http://dx.doi.org/10.1007/s10664-020-09828-5
+.. _pifpaf: https://github.com/jd/pifpaf
.. toctree::
:maxdepth: 2

File Metadata

Mime Type
text/plain
Expires
Nov 5 2024, 5:15 PM (12 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3225251

Event Timeline