Page MenuHomeSoftware Heritage

DB schema: add missing unicity constraint on origin (type, url)
Closed, MigratedEdits Locked

Event Timeline

olasd raised the priority of this task from to Normal.
olasd updated the task description. (Show Details)
olasd added a project: Developers.
olasd added a subscriber: olasd.
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:05 PM
zack added a project: Restricted Project.Feb 12 2017, 6:35 PM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 12 2017, 6:40 PM

There exist some duplicated origins (well, at least regarding the loader-tar's origins):

softwareheritage=> select * from origin where type='ftp' limit 10;
   id    | type |                      url                       | lister | project
---------+------+------------------------------------------------+--------+---------
 4423668 | ftp  | rsync://ftp.gnu.org/gnu/3dldf                  |        |
 4423671 | ftp  | rsync://ftp.gnu.org/gnu/3dldf                  |        |
 4423974 | ftp  | rsync://ftp.gnu.org/gnu/GNUinfo/Audio/francais |        |
 4423561 | ftp  | rsync://ftp.gnu.org/gnu/a2ps                   |        |
 4423564 | ftp  | rsync://ftp.gnu.org/gnu/a2ps                   |        |
 4423581 | ftp  | rsync://ftp.gnu.org/gnu/acct                   |        |
 4423582 | ftp  | rsync://ftp.gnu.org/gnu/acct                   |        |
 4423610 | ftp  | rsync://ftp.gnu.org/gnu/acm                    |        |
 4423681 | ftp  | rsync://ftp.gnu.org/gnu/adns                   |        |
 4423544 | ftp  | rsync://ftp.gnu.org/gnu/alive                  |        |

Taking as example the origin {rsync://ftp.gnu.org/gnu/3dldf,ftp} with ids 4423668 and 4423671.

At first, this seems to be indeed duplicated origins since the tarball names reference the same 'project name'.

softwareheritage=> select r.metadata#>>'{original_artifact,0,name}' as tarball from revision r inner join occurrence_history occ on r.id=occ.target where occ.origin=4423671;
      tarball
--------------------
 3DLDF-1.1.3.tar.gz
(1 row)

softwareheritage=> select r.metadata#>>'{original_artifact,0,name}' as tarball from revision r inner join occurrence_history occ on r.id=occ.target where occ.origin=4423668;
           tarball
-----------------------------
 3DLDF-1.1.4.tar.gz
 3DLDF-1.1.5.1.tar.gz
 3DLDF-1.1.5.tar.gz
 3DLDF-2.0.1.tar.gz
 3DLDF-2.0.2.tar.gz
 3DLDF-2.0.3-examples.tar.gz
 3DLDF-2.0.3.tar.gz
 3DLDF-2.0.tar.gz
(8 rows)

It might be due to the fact that the loading of the tarballs was anterior to the origin_visit schema updates.

Anyway, the point is that there will be a cleaning (merging?) step needed for that task prior to adding the constraint step.

In T49#12329, @ardumont wrote:

There exist some duplicated origins (well, at least regarding the loader-tar's origins):

softwareheritage=> select * from origin where type='ftp' limit 10;
   id    | type |                      url                       | lister | project
---------+------+------------------------------------------------+--------+---------
 4423668 | ftp  | rsync://ftp.gnu.org/gnu/3dldf                  |        |
 4423671 | ftp  | rsync://ftp.gnu.org/gnu/3dldf                  |        |

Nice catch.

My current thinking on the general topic of "what are origins for distributions/package manager environments" is that in those contexts an origin should be a pair <distributor, package>. So, for instance, <pypi, django>, or <debian, ocaml>.
(It is possible that this wasn't yet clear at the time of the first GNU injection.)

According to that interpretation, the distributor here is GNU, so "3dldf at GNU", "acct at GNU", etc. should indeed all be merged in a single origin.

Comments on the general idea welcome. As well as query suggestions on how to fix this in the DB.

My current thinking on the general topic of "what are origins for distributions/package manager environments" is that in those contexts an origin should be a pair <distributor, package>. So, for instance, <pypi, django>, or <debian, ocaml>.

This sounds reasonable.

I'm wondering if it's not the idea behind the project column (foreign key on table entity's uuid, gnu is one such entity).
And if it's not, maybe it's still related.

'T3 - Discuss the project <-> origin mapping' to the rescue?

(It is possible that this wasn't yet clear at the time of the first GNU injection.)

It was not indeed :D

Ok, i've taken a closer look at the duplications. So far, only duplication in origins with type 'ftp' (from gnu injection) and 'git':

origin# duplicated
ftp32
git5555

Details:

swh-scratch=# select count(distinct url) from origin where type='ftp';
 count
-------
  1205
(1 row)

swh-scratch=# select count(url) from origin where type='ftp';
 count
-------
  1237
(1 row)

swh-scratch=# select count(url) from origin where type='git';
  count
----------
 57605238
(1 row)

swh-scratch=# select count(distinct url) from origin where type='git';                                                                                                                                                                        
  count
----------
 57599683
(1 row)

The other origins are fine (deb, svn):

swh-scratch=# select distinct type from origin;
 type
------
 deb
 ftp
 git
 svn
(4 rows)

swh-scratch=# select count(distinct url) from origin where type='svn';
 count
--------
 381328
(1 row)

swh-scratch=# select count(url) from origin where type='svn';
 count
--------
 381328
(1 row)

swh-scratch=# select count(url)  from origin where type='deb';
 count
-------
 33044
(1 row)

swh-scratch=# select count(distinct url)  from origin where type='deb';
 count
-------
 33044
(1 row)

Note:
swh-scratch is a local db with a snapshot of table origin (from yesterday - 29/03/2017 or so).

olasd claimed this task.

The missing unique constraint has now been added.

  • list duplicate origins
  • remove duplicate origins with no visits
  • merge origin_visit entries (ordered by date)
  • migrate fetch_history and origin_metadata to the new entries
  • remove duplicate origins (with no references left)
  • add unique index on origin(url);