Page MenuHomeSoftware Heritage

swh-loader-tar origin validities are the current time instead of the mirroring time
Closed, MigratedEdits Locked

Description

select * from origin
         left join origin_visit
         on origin_visit.origin = origin.id
         where origin.id = 4475375

on the softwareheritage database shows that the glibc origin has been visited 187 times. swh-loader-tar should use the time of mirroring instead of the current time to create the occurrences

Event Timeline

Hint: Use the TIMESTAMP file which holds the date of the last rsync.

cat /srv/softwareheritage/mirrors/gnu.org/TIMESTAMP
Fri, 28 Aug 2015 13:13:26 +0200

swh-loader-tar should use the time of mirroring instead of the current time to create the occurrences

The code has been adapted for that.

Now remains the swh data to update in the db.

The gnu urls are of the form 'rsync://', that's the way to have hold on the origins to update.

softwareheritage=> select count(*) from origin where url like 'rsync://%gnu%';
 count
-------
  1237
(1 row)

softwareheritage=> select count(*) from origin where url like 'rsync://%';
 count
-------
  1237
(1 row)

List of impacted origins to update:

softwareheritage=> select count(id) from origin_visit v inner join origin o on o.id=v.origin where o.url like 'rsync://%';
 count
-------
  9772
(1 row)

There are also duplicates that need to be removed.
There should only be one origin_visit.
So the 187 number here should be 1.

Indeed.

softwareheritage=> select count(distinct id) from origin_visit v inner join origin o on o.id=v.origin where o.url like 'rsync://%';
 count
-------
  1237
(1 row)

So a modus operandi here could be:

  • update one origin id (amongst its duplicates) with the right date (cf. TIMESTAMP file's content)
  • delete all other origin_visit with same id where date < TIMESTAMP (maybe <> is better because there could exists date > TIMESTAMP's one)...
zack raised the priority of this task from Low to High.

In a transaction with swhstorage user on softwareheritage db:

-- select the origin ids concerned by the update
select * from origin_visit v join origin o on o.id = v.origin where o.url like 'rsync://%gnu%';

-- delete duplicate entries (at the end of this 1227 entries)
with tarballs as (select id from origin where url like 'rsync://%gnu%')
delete from origin_visit
where visit > 1
and origin in (select * from tarballs);

-- update the entries with the right date
with tarballs as (select id from origin where url like 'rsync://%gnu%')
update origin_visit
set date='Fri, 28 Aug 2015 13:13:26 +0200'
where origin in (select * from tarballs);

-- check the output is as expected (only 1227 entries with all the same mirror dates)
select * from origin_visit v join origin o on o.id = v.origin where o.url like 'rsync://%gnu%';
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:08 PM