diff --git a/PKG-INFO b/PKG-INFO index d5b37a9..9c87cc0 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,10 +1,10 @@ Metadata-Version: 1.0 Name: swh.model -Version: 0.0.21 +Version: 0.0.22 Summary: Software Heritage data model Home-page: https://forge.softwareheritage.org/diffusion/DMOD/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Description: UNKNOWN Platform: UNKNOWN diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index c796a80..7f41d61 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,145 +1,145 @@ .. _persistent-identifiers: Persistent identifiers ====================== You can point to objects present in the Software Heritage archive by the means of **persistent identifiers** that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided. A persistent identifier can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that is embedded in its persistent identifier as described below. Object identifiers are strong cryptographic hashes computed on the entire set of object properties to form a `Merkle structure `_. See :ref:`data-model` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object identifiers are computed. Syntax ------ Syntactically, persistent identifiers are generated by the ```` entry point of the grammar: .. code-block:: bnf ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "a" | "b" | "c" | "d" | "e" | "f" ; Semantics --------- ``:`` is used as separator between the logical parts of identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of identifiers that conform to previous versions of the scheme). A persistent identifier points to a single object, whose type is explicitly captured by ````: * ``snp`` identifiers points to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, -* ``cnt`` to **releases**. +* ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` * for **directories**, as per :py:func:`swh.model.identifiers.directory_identifier` * for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. Git compatibility ~~~~~~~~~~~~~~~~~ Intrinsic object identifiers for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. A Software Heritage content identifier will be identical to a Git blob identifier of any file with the same content, a Software Heritage revision identifier will be identical to the corresponding Git commit identifier, etc. This is not the case for snapshot identifiers as Git doesn't have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub Resolution ---------- Persistent identifiers can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the ``/browse/`` endpoint can be given a persistent identifier and will lead to the browsing page of the corresponding object, like this: ``https://archive.softwareheritage.org/browse/``. For example: * ``_ * ``_ * ``_ * ``_ * ``_ diff --git a/swh.model.egg-info/PKG-INFO b/swh.model.egg-info/PKG-INFO index d5b37a9..9c87cc0 100644 --- a/swh.model.egg-info/PKG-INFO +++ b/swh.model.egg-info/PKG-INFO @@ -1,10 +1,10 @@ Metadata-Version: 1.0 Name: swh.model -Version: 0.0.21 +Version: 0.0.22 Summary: Software Heritage data model Home-page: https://forge.softwareheritage.org/diffusion/DMOD/ Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Description: UNKNOWN Platform: UNKNOWN diff --git a/swh.model.egg-info/SOURCES.txt b/swh.model.egg-info/SOURCES.txt index d870641..b45f730 100644 --- a/swh.model.egg-info/SOURCES.txt +++ b/swh.model.egg-info/SOURCES.txt @@ -1,56 +1,58 @@ .gitignore AUTHORS LICENSE MANIFEST.in Makefile Makefile.local README-dev.md requirements-swh.txt requirements.txt setup.py version.txt bin/git-revhash bin/swh-hash-file bin/swh-revhash debian/changelog debian/compat debian/control debian/copyright debian/rules debian/source/format docs/.gitignore docs/Makefile docs/conf.py docs/data-model.rst docs/index.rst docs/persistent-identifiers.rst docs/_static/.placeholder docs/_templates/.placeholder swh/__init__.py swh.model.egg-info/PKG-INFO swh.model.egg-info/SOURCES.txt swh.model.egg-info/dependency_links.txt swh.model.egg-info/requires.txt swh.model.egg-info/top_level.txt swh/model/__init__.py swh/model/exceptions.py swh/model/from_disk.py swh/model/hashutil.py swh/model/identifiers.py swh/model/merkle.py +swh/model/toposort.py swh/model/validators.py swh/model/fields/__init__.py swh/model/fields/compound.py swh/model/fields/hashes.py swh/model/fields/simple.py swh/model/tests/__init__.py swh/model/tests/generate_testdata_from_disk.py swh/model/tests/test_from_disk.py swh/model/tests/test_hashutil.py swh/model/tests/test_identifiers.py swh/model/tests/test_merkle.py +swh/model/tests/test_toposort.py swh/model/tests/test_validators.py swh/model/tests/fields/__init__.py swh/model/tests/fields/test_compound.py swh/model/tests/fields/test_hashes.py swh/model/tests/fields/test_simple.py \ No newline at end of file diff --git a/swh/model/tests/test_toposort.py b/swh/model/tests/test_toposort.py new file mode 100644 index 0000000..66a8ee1 --- /dev/null +++ b/swh/model/tests/test_toposort.py @@ -0,0 +1,99 @@ +# Copyright (C) 2017-2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import unittest +from swh.model.toposort import toposort + + +def is_toposorted_slow(revision_log): + """Check (inefficiently) that the given revision log is in any topological + order. + + Complexity: O(n^2). + (Note: It's totally possible to write a O(n) is_toposorted function, + but it requires computing the transitive closure of the input DAG, + which requires computing a topological ordering of that DAG, which + kind of defeats the purpose of writing unit tests for toposort().) + + Args: + revision_log: Revision log as returned by + swh.storage.Storage.revision_log(). + + Returns: + True if the revision log is topologically sorted. + """ + rev_by_id = {r['id']: r for r in revision_log} + + def all_parents(revision): + for parent in revision['parents']: + yield parent + yield from all_parents(rev_by_id[parent]) + + visited = set() + for rev in revision_log: + visited.add(rev['id']) + if not all(parent in visited for parent in all_parents(rev)): + return False + return True + + +class TestToposort(unittest.TestCase): + def generate_log(self, graph): + for node_id, parents in graph.items(): + yield {'id': node_id, 'parents': tuple(parents)} + + def unordered_log(self, log): + return {(d['id'], tuple(d['parents'])) for d in log} + + def check(self, graph): + log = list(self.generate_log(graph)) + topolog = list(toposort(log)) + self.assertEqual(len(topolog), len(graph)) + self.assertEqual(self.unordered_log(topolog), self.unordered_log(log)) + self.assertTrue(is_toposorted_slow(toposort(log))) + + def test_linked_list(self): + self.check({3: [2], + 2: [1], + 1: []}) + + def test_fork(self): + self.check({7: [6], + 6: [4], + 5: [3], + 4: [2], + 3: [2], + 2: [1], + 1: []}) + + def test_fork_merge(self): + self.check({8: [7, 5], + 7: [6], + 6: [4], + 5: [3], + 4: [2], + 3: [2], + 2: [1], + 1: []}) + + def test_two_origins(self): + self.check({9: [8], + 8: [7, 5], + 7: [6], + 6: [4], + 5: [3], + 4: [], + 3: []}) + + def test_three_way(self): + self.check({9: [8, 4, 2], + 8: [7, 5], + 7: [6], + 6: [4], + 5: [3], + 4: [2], + 3: [2], + 2: [1], + 1: []}) diff --git a/swh/model/toposort.py b/swh/model/toposort.py new file mode 100644 index 0000000..b0a7231 --- /dev/null +++ b/swh/model/toposort.py @@ -0,0 +1,43 @@ +# Copyright (C) 2017-2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import collections + + +def toposort(revision_log): + """Perform a topological sort on a revision log graph. + + Complexity: O(N) (linear in the length of the revision log) + + Args: + revision_log: Revision log as returned by + swh.storage.Storage.revision_log(). + + Yields: + The revision log sorted by a topological order + """ + in_degree = {} # rev_id -> numbers of parents left to compute + children = collections.defaultdict(list) # rev_id -> children + + # Compute the in_degrees and the parents of all the revisions. + # Add the roots to the processing queue. + queue = collections.deque() + for rev in revision_log: + parents = rev['parents'] + in_degree[rev['id']] = len(parents) + if not parents: + queue.append(rev) + for parent in parents: + children[parent].append(rev) + + # Topological sort: yield the 'ready' nodes, decrease the in degree of + # their children and add the 'ready' ones to the queue. + while queue: + rev = queue.popleft() + yield rev + for child in children[rev['id']]: + in_degree[child['id']] -= 1 + if in_degree[child['id']] == 0: + queue.append(child) diff --git a/version.txt b/version.txt index 943a41a..3a338cb 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.21-0-gbdf26f5 \ No newline at end of file +v0.0.22-0-ga06122e \ No newline at end of file