Related Objects
- Mentioned Here
- T3: Discuss the project <-> origin mapping
Event Timeline
There exist some duplicated origins (well, at least regarding the loader-tar's origins):
softwareheritage=> select * from origin where type='ftp' limit 10; id | type | url | lister | project ---------+------+------------------------------------------------+--------+--------- 4423668 | ftp | rsync://ftp.gnu.org/gnu/3dldf | | 4423671 | ftp | rsync://ftp.gnu.org/gnu/3dldf | | 4423974 | ftp | rsync://ftp.gnu.org/gnu/GNUinfo/Audio/francais | | 4423561 | ftp | rsync://ftp.gnu.org/gnu/a2ps | | 4423564 | ftp | rsync://ftp.gnu.org/gnu/a2ps | | 4423581 | ftp | rsync://ftp.gnu.org/gnu/acct | | 4423582 | ftp | rsync://ftp.gnu.org/gnu/acct | | 4423610 | ftp | rsync://ftp.gnu.org/gnu/acm | | 4423681 | ftp | rsync://ftp.gnu.org/gnu/adns | | 4423544 | ftp | rsync://ftp.gnu.org/gnu/alive | |
Taking as example the origin {rsync://ftp.gnu.org/gnu/3dldf,ftp} with ids 4423668 and 4423671.
At first, this seems to be indeed duplicated origins since the tarball names reference the same 'project name'.
softwareheritage=> select r.metadata#>>'{original_artifact,0,name}' as tarball from revision r inner join occurrence_history occ on r.id=occ.target where occ.origin=4423671; tarball -------------------- 3DLDF-1.1.3.tar.gz (1 row) softwareheritage=> select r.metadata#>>'{original_artifact,0,name}' as tarball from revision r inner join occurrence_history occ on r.id=occ.target where occ.origin=4423668; tarball ----------------------------- 3DLDF-1.1.4.tar.gz 3DLDF-1.1.5.1.tar.gz 3DLDF-1.1.5.tar.gz 3DLDF-2.0.1.tar.gz 3DLDF-2.0.2.tar.gz 3DLDF-2.0.3-examples.tar.gz 3DLDF-2.0.3.tar.gz 3DLDF-2.0.tar.gz (8 rows)
It might be due to the fact that the loading of the tarballs was anterior to the origin_visit schema updates.
Anyway, the point is that there will be a cleaning (merging?) step needed for that task prior to adding the constraint step.
Nice catch.
My current thinking on the general topic of "what are origins for distributions/package manager environments" is that in those contexts an origin should be a pair <distributor, package>. So, for instance, <pypi, django>, or <debian, ocaml>.
(It is possible that this wasn't yet clear at the time of the first GNU injection.)
According to that interpretation, the distributor here is GNU, so "3dldf at GNU", "acct at GNU", etc. should indeed all be merged in a single origin.
Comments on the general idea welcome. As well as query suggestions on how to fix this in the DB.
My current thinking on the general topic of "what are origins for distributions/package manager environments" is that in those contexts an origin should be a pair <distributor, package>. So, for instance, <pypi, django>, or <debian, ocaml>.
This sounds reasonable.
I'm wondering if it's not the idea behind the project column (foreign key on table entity's uuid, gnu is one such entity).
And if it's not, maybe it's still related.
'T3 - Discuss the project <-> origin mapping' to the rescue?
(It is possible that this wasn't yet clear at the time of the first GNU injection.)
It was not indeed :D
Ok, i've taken a closer look at the duplications. So far, only duplication in origins with type 'ftp' (from gnu injection) and 'git':
origin | # duplicated |
ftp | 32 |
git | 5555 |
Details:
swh-scratch=# select count(distinct url) from origin where type='ftp'; count ------- 1205 (1 row) swh-scratch=# select count(url) from origin where type='ftp'; count ------- 1237 (1 row) swh-scratch=# select count(url) from origin where type='git'; count ---------- 57605238 (1 row) swh-scratch=# select count(distinct url) from origin where type='git'; count ---------- 57599683 (1 row)
The other origins are fine (deb, svn):
swh-scratch=# select distinct type from origin; type ------ deb ftp git svn (4 rows) swh-scratch=# select count(distinct url) from origin where type='svn'; count -------- 381328 (1 row) swh-scratch=# select count(url) from origin where type='svn'; count -------- 381328 (1 row) swh-scratch=# select count(url) from origin where type='deb'; count ------- 33044 (1 row) swh-scratch=# select count(distinct url) from origin where type='deb'; count ------- 33044 (1 row)
Note:
swh-scratch is a local db with a snapshot of table origin (from yesterday - 29/03/2017 or so).
The missing unique constraint has now been added.
- list duplicate origins
- remove duplicate origins with no visits
- merge origin_visit entries (ordered by date)
- migrate fetch_history and origin_metadata to the new entries
- remove duplicate origins (with no references left)
- add unique index on origin(url);