Page MenuHomeSoftware Heritage

Specify the Vitam archiving format
Closed, MigratedEdits Locked

Description

SWH in Vitam

Proposal for archiving SWH in Vitam

It seems unrealistic to apply the SWH data model directly in Vitam, and probably unnecessary: the main goal of this archiving project it to be able to recover a working SWH dataset, and possibly to relatively easily find and retrieve one SWH object from its SWHID.

With this main goal in mind, the simplest solution would probably to pack SWH objects in "packfiles" of a certain amount of objects, and keep all the SWHIDs stored in this packfile in a indexable metadata file attached to this packfile. These 2 files would be an SWH Archive Unit.

This should allow to relatively easily handle the incremental archiving of the SWH Archive by assembling these packfiles by consuming the SWH Journal (kafka-based).

These packfiles can use several low level serialization protocols, like the msgpack format which is already used in several places in the software heritage code base but the existing work on swh-dataset gives the opportunity to consider the Apache ORC file format.

This comes with the advantage of using a known and well defined file format (but the export data model remains to be defined) and provide very good compression and tooling to read, parse filter etc the data.

Depending on whether we can add a custom indexer in Vitam to deal with ORC files produces by swh-dataset, we may be able to get rid of the text-based index file storing the list of SWHIDs included in a given ORC file.


OAIS -- Open Archival Information System

Organize "information" in "packages" of different nature to capture:

  • when *producing* the information
  • when *archiving* it
  • when *communicating* it

OAIS propose 3 kinds of packages:

  • Submission Information Packages (SIP), crafted by producer for the archiving system (Vitam here)
  • Archival Information Package (AIP), resulting of the processing of the SIP by the archiving service (Vitam)
  • Dissemination Information Packages (DIP), resulting os the processing of one or more AIP by the archiving service, aim at publicaton of said information.

SEDA -- *Standard d’Échange de Données pour l’Archivage*

SEDA, along with the MEDONA norm, are the official, normalized standard to exchange rtansactions between archiving services.

Transferring an archive unit to a distant archive service following the SEDA protocol consist in a SIP.

Vitam

Archiving platform implementing (among others) OAIS/SEDA protocols.

https://www.programmevitam.fr/pages/documentation/

Data model

:::info
Based on mongodb, this does have impacts on how the data model is specified.
:::

The data model consists in several "collections" organized in "bases":

  • Identity Base: stores user and application certificates (x509)
    • Certificate Collection
    • PersonalCertificate Collection
  • Logbook Base: operation and lifecycle log of archive units and objects
    • LogbookOperation Collection
    • LogbookLifeCycle Collection
    • LogbookLifeCycleUnit Collection
    • LogbookLifeCycleObjectGroup Collection
    • Offset Collection
  • MetaData Base: metadata on archive units (Unit) and objects (ObjectGroup)
    • Unit Collection
    • ObjectGroup Collection
    • Offset Collection
  • MasterData Base
    • AccessContract Collection
    • AccestionRegisterDetail Collection
    • AccessionRegisterSummary Collection
    • AccessionRegisterSymbolic Collection
    • ArchiveUniProfile Collection: describe archive unit profiles
    • Agencies Collection
    • Context Collection: describe application contexts(?)
    • FileFormat Collection: describe file formats (filled from the PRONOM base provided by the UK National Archive)
    • FileRules Collection: management rules to compute life cycle and deadline events for archive unit
    • Griffin Collection: (sort of plugins used to make some processing on archived binary objects)
    • IngestContract Collection: ingestion contracts
    • ManagementContract Collection: management contracts
    • Ontology Collection
    • PreservationScenario Collection: "script" (aka list of griffins) to be executed on archived units (eg. check formats and generate PDFs from documents)
    • Profile Collection: archiving profiles
    • SecurityProfile Collection
    • VitamSequence Collection: used to generate internal IDs
    • Offset Collection
  • Report Base
    • AuditObjectGroup Collection
    • EliminationActionUnit Collection
    • EliminationActionObjectGroup Collection
    • PreservationReport Collection

SIP

Consist in a zip or tgz file with:

  • a transfer manifest file (_manifest.xml) with informations and metadata describing digital objects and archival units being trasferred,
  • a content directory containing said objects.

A SIP must not be larger than 1GB.

A SIP must not have more than 100k objects.

manifest file

Consists in:

  • a header: identify the archive lot and the transfer agreement
  • list of binary objects in the SIP,
  • archiving metadata:
    • ManagementMetadata: MD for the whole archiving lot,
    • DescriptiveMetadata: logical tree of included ArchiveUnits
  • declarations of the source and destinations services.

A "group of objects" in the SEDA norm is used to model the idea that an original artifact (eg. a photography) can be transferred as several objects (eg. different file formats or resolutions). The Object Group is the unit that encompass the archived object, even if the archived artifacts are multiples.

An object group is desclared using DataObjectGroup. It is mandatory to declare such an object group if the archived object consists in several actual objects, otherwise, it is not recommanded.

It is possible to add custom fields in the manifest of a SIP, but these need to be declared beforehand in an Ontology.

Event Timeline

douardda created this task.

Proposal from CINES

This proposal is based on a scenario with une AU per SWH Snapshot, stored in one SIP package.
*Cette proposition reprend le scénario d'une AU par snapshot, contenue dans un SIP (paquet versé).*

A second scenario is evocated as comment: several SWH Snapshop per SIP, so several AU; depending on the weight it imply fit Vitam's capacities.

*Est indiqué en commentaire un deuxième scénario possible : plusieurs snapshots par SIP, donc plusieurs AU, envisageable selon le poids que cela représenterait et les capacités de VITAM.*

In this model, Tag in the AU are used to list SWHID of embedded objects, which may not scale right (for example the SWH Snapshot swh:1:snp:fc7706e4c177714475a4886831486ad0979983ea is realated to more than 7 million SWH Content objects, aka files).

This initial proposal from CINES has not been selected because it de facto normalize a number of relations of the SWH graph making it unfit to storage in a solution like Vitam (too many objects, hard to manage incremental updates).