Page MenuHomeSoftware Heritage

Discuss the project <-> origin mapping
Closed, MigratedEdits Locked

Description

The current db schema maps each "project" to a single "origin" (while keeping the history of this mapping).

This prevents us from having a single project point to, e.g., a tarball directory and a git/hg repository at the same time.

Do we want to register that Python-3.5.0.tar.xz is different to the 3.5.0 tag on hg.python.org?

Event Timeline

olasd raised the priority of this task from to Needs Triage.
olasd updated the task description. (Show Details)
olasd added projects: Developers, Staff.
olasd added subscribers: zack, olasd.
zack triaged this task as Normal priority.Sep 14 2015, 4:46 PM

I've had some thoughts about this problem, and here are my propositions.

The most flexible way to store the Project to Origin mapping is a three-way map:

  • Organization
  • Project
  • Origin

The idea is that each project can be hosted by different organizations, and that we can map those together.

In our current schema, we should create one organization per way of listing importable artefacts :

  • GitHub
    • GitHub git hosting -> GitHub lister, generates suborganizations for "GitHub organizations"
      • hylang -> no specific lister, "shallow" organization
      • zacchiro
      • olasd
      • ...
    • GitHub asset hosting -> T17
  • Debian

Each github repo would be one project. Forks would be associated to the same project, but with another organization (or even the same organization but a different origin).

Each debian source package name would be one project too, and be associated with two origins (if applicable), one for the snapshot.d.o organization, and one for the archive.d.o organization.

We then need a way to "deduplicate" projects (and for instance associate debian's python-hy with github's hylang/hy). My opinion on this is to leave the "automatically generated" projects alone, and to keep a separate "association table" that would be filled separately.

In T3#1275, @olasd wrote:
  • GitHub
    • GitHub git hosting -> GitHub lister, generates suborganizations for "GitHub organizations"
      • hylang -> no specific lister, "shallow" organization
      • zacchiro
      • olasd
      • ...
    • GitHub asset hosting -> T17

Mulling this over, this could be a bit different:

  • GitHub
    • GitHub hosting
      • GitHub git hosting
      • GitHub asset hosting
    • GitHub organizations
      • hylang
      • debian
      • ...
    • GitHub users
      • olasd
      • zacchiro
      • ...
  • Debian
    • Debian hosting
      • snapshot.debian.org
      • archive.debian.org
      • alioth.debian.org
        • git.debian.org
        • svn.debian.org
        • ...
    • Debian teams (generated from alioth)
      • pkg-foo
      • pkg-bar
      • ...
    • Debian People (generated from alioth in the /users/ hierarchy)
      • olasd
      • baz-guest
      • ...
  • GNU
    • GNU Hosting
      • mirror.gnu.org/gnu/
      • mirror.gnu.org/old-gnu/
    • GNU Projects (generated from the mirror hierarchy)
      • bash
      • glibc
      • ...
  • Apache
    • Apache Hosting
      • archive.apache.org
    • Apache Projects (generated from the archive hierarchy)
      • httpd
      • ...

We probably need to add an "autogenerated" flag to organizations, and a "matching" table like we do for projects.

@zack points out that organization does not feel like the right term anymore.

Possible alternatives :

  • (source) entity
  • umbrella
  • authority
  • source (probably overloaded)

We will also need to define an entity typology

  • organization (= Software Heritage, Debian, GNU, GitHub, Apache, ...)
  • group of entities (for hierarchy-only entities like "GitHub Hosting")
  • hosting facility (= snapshot.debian.org, GitHub git hosting, ...)
  • group of persons (= GitHub Organization, Debian Team)
  • person (= GitHub User, Debian People)
  • project (= GNU Projects, Apache Projects)
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:04 PM