⚓ T412 Bootstrap archiver's database

Status	Assigned	Task
Migrated	gitlab-migration	T239 preserve at least 2 copies of each content object
Migrated	gitlab-migration	T240 content archiver
Migrated	gitlab-migration	T482 First swh-storage-archiver run to catch up uffizi
Migrated	gitlab-migration	T412 Bootstrap archiver's database
Migrated	gitlab-migration	T484 List banco's current sha1s for injection in archiver db

qcampos created this task.May 23 2016, 11:52 AM

qcampos created this object in space S1 Public.

zack removed a project: Restricted Project.May 31 2016, 11:23 AM

ardumont mentioned this in T482: First swh-storage-archiver run to catch up uffizi.Jul 7 2016, 4:19 PM

ardumont added a parent task: T482: First swh-storage-archiver run to catch up uffizi.

ardumont mentioned this in T484: List banco's current sha1s for injection in archiver db.Jul 8 2016, 4:11 PM

ardumont created subtask T484: List banco's current sha1s for injection in archiver db.

ardumont closed subtask T484: List banco's current sha1s for injection in archiver db as Invalid.Jul 9 2016, 9:59 AM

Running on banco:

INSERT INTO archives(id, url)
VALUES('Banco', 'http://banco.softwareheritage.org:5003/');

begin;

-- prepare data

CREATE TABLE content_archive_tmp (
  content_id  sha1 REFERENCES content(sha1),
  PRIMARY KEY (content_id)
);

\copy content_archive_tmp from 'content-id-by-ctime.after-T7.txt';

-- insert into the real production table

insert into content_archive (sha1, archive_id, status, mtime)
select sha1, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
from content_archive_tmp;

-- drop temporary table

drop table content_archive_tmp;

Note:

the previous snippet may be changed after the \copy instruction (since it has not yet been tested).
the timestamp used for the insert is the modify time of the .txt file holding the list of sha1s we inject
'archives' table should be renamed 'archive' to respect our naming convention
archive_id is TEXT, i'm uneasy with that since we repeat it with every content we have (that's quite a huge repetition), a simple integer should be enough -(> i don't measure yet the impact on the archiver code though)
status is also TEXT and could be replaced with an enum or something

ardumont mentioned this in D74: sql/archiver: add the initialization script for the archiver database.Jul 9 2016, 10:17 AM

This is a failure for now.
After multiple attempts in the week-end, there is not enough space on disk for the process to finish.

I see 2 ways to improve this:

grow /srv/softwareheritage/postgres on prado.
create the archiver's db on /srv/softwareheritage/postgres-hdd (as some other db we have on the side).

If i understand right about the hardware postgres partition is ssd, and postgres-hdd partition is standard disk (so slower).

I'm more inclined to 2 as:

I don't think this could be much of a problem to have an archiver db slightly slower.
if it's not good we can always migrate back to 1
i have multiple blocking points for the solution 1 (How much is it reasonable to grow the partition? Do we even have the resources to do so? Also, @olasd showed me how to grow the partition disk but i don't remember since i did not try...)

In T412#7853, @ardumont wrote:

status is also TEXT and could be replaced with an enum or something

'status' is of type archive_status which is already an enum. I guess that Postgres do the right thing with integers.

CREATE TYPE archive_status AS ENUM (
  'missing',
  'ongoing',
  'present'
);

In T412#7853, @ardumont wrote:

archive_id is TEXT, i'm uneasy with that since we repeat it with every content we have (that's quite a huge repetition), a simple integer should be enough -(> i don't measure yet the impact on the archiver code though)

archive_id is a foreign key. I assumed that postgres would nicely do the job for us. If it's not the case, we clearly need to change that.

In T412#7853, @ardumont wrote:

the timestamp used for the insert is the modify time of the .txt file holding the list of sha1s we inject

When content are missing, the date is not relevant, so any would do the job.

I totally agree with the archives > archive change.

As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

'status' is of type archive_status which is already an enum. I guess that Postgres do the right thing with integers.

Yep, sorry i did not change my remark.
I saw this this morning when i took a closer look.
A good news then, one less change to do ^^

archive_id is a foreign key. I assumed that postgres would nicely do the job for us. If it's not the case, we clearly need to change that.

I do hope so as well ^^

When content are missing, the date is not relevant, so any would do the job.

ok

I totally agree with the archives > archive change.

ok

As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

Indeed, we'll wait to have some more space then ^^.

ardumont added a subtask: Unknown Object (Maniphest Task).Jul 11 2016, 2:27 PM

As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

As discussed on irc, we could, as a first approximation, drop this constraint since our identifier are quite stable (we never delete anything).

Indeed, we'll wait to have some more space then ^^.

Not necessarily, we need to discuss this but we could leverage our queue system (or in the possible future system kafka) to notify that some new contents have been added. And then update the archiver db from those notifications.

Otherwise, for the space T486 is in progress, and T413 in progress as well (or so i think from @rdicosmo's email)

Moving softwareheritage-log from the ssd to hdd (T487), we reclaimed 1.1T of data on the ssd (which were the initial blocking point).
So now, we can try to inject back the archiver's bootstrap data to finally... run it ^^

ardumont changed the task status from Open to Work in Progress.Jul 16 2016, 10:21 AM

For info, i ran another failed attempt yesterday (Saturday the 16th of July 2016).
This stopped before finishing.

Trying the following:

CREATE TABLE content_archive (
  content_id  sha1 REFERENCES content(sha1),
  archive_id  archive_id default 'Banco'  REFERENCES archives(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz,
  PRIMARY KEY (content_id, archive_id)
);

\copy content_archive from '/srv/storage/space/lists/todb';

Note: Default values being temporary the time to bootstrap.

and to effectively load the data:

ardumont@prado:/srv/storage/space/lists$ mkfifo todb
ardumont@prado:/srv/storage/space/lists$ pv content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb

It failed around 500M lines and was stopped for missing space yet again.

Looks like, for now, the only way is to store this is in the other cluster (hdd rotating spin, so other mount point).
Thus effectively dropping the constraint about the foreign key on content_id.

And then improve the swh.storage.storage.content_add api function with some way of notifying we added new contents.

swh-environment: 151e779
swh-core: 23ee17f
swh-storage-testdata: c6b8add
swh-storage: 73139f9

(forgot to add the Related keyword in commits... and already pushed so better luck next time...)

ardumont mentioned this in rDENV151e7794d1de: Rebuild archiver's tables.Jul 18 2016, 8:36 PM

Also in regards to db, softwareheritage-archiver has been created with the following schema.

First run (TL; DR - too slow so stopped)

As of Monday the 18th on prado, was running in a tmux session (under ardumont) a process to inject data.

This was too slow so i stopped it just now.
The idea was to use direct injection in an altered content_archive table with default values.
Done 447951000 in ~24h.

As postgres user, using psql on softwareheritage-archiver:

begin;

DROP TABLE content_archive;

CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id default 'Banco'  REFERENCES archive(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz,
  PRIMARY KEY (content_id, archive_id)
);

COPY content_archive(content_id) from '/var/lib/postgres/todb';

-- Alter content_archive to remove default values
-- ... (yet to be determined)

commit;

This must have been too slow for the index creation done at the same time and maybe the default values policy...

second run (so far so good)

As postgres user, using psql on softwareheritage-archiver:

CREATE table content_archive_tmp(content_id sha1);
COPY content_archive_tmp(content_id) from '/var/lib/postgres/todb';

INSERT INTO content_archive(content_id, archive_id, status, mtime)
SELECT content_id, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
FROM content_archive_tmp;

DROP content_archive_tmp;

And in another tmux pane:

postgres@prado:~$ pv /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb

This goes way faster for now (for the pure copy at least):

postgres@prado:~$ pv /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb
0:16:12 [15.3MiB/s] ETA 1:02:04

softwareheritage-archiver=# explain select count(*) from content_archive_tmp;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Aggregate  (cost=11340915.30..11340915.31 rows=1 width=0)
   ->  Seq Scan on content_archive_tmp  (cost=0.00..9912800.04 rows=571246104 width=0)
(2 rows)

w00t

w00t

and the first part is done ^^

0:41:23 [15.8MiB/s]

According to documentation, to defer a constraint, first said constraint must be deferrable (which it is not the default).

So changing first in the original table those constraints:

DROP TABLE content_archive;

-- make each constraint deferrable by default (they are not by default)
CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id REFERENCES archive(id) DEFERRABLE,
  status      archive_status,
  mtime       timestamptz,
  PRIMARY KEY (content_id, archive_id) DEFERRABLE 
);

Then, only inside the transaction can we change those constraints property adequately:

BEGIN;

SET CONSTRAINTS content_archive_pkey DEFERRED; 
SET CONSTRAINTS content_archive_archive_id_fkey DEFERRED;

INSERT INTO content_archive(content_id, archive_id, status, mtime)
SELECT content_id, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
FROM content_archive_tmp;

-- revert back to default
SET CONSTRAINTS content_archive_pkey IMMEDIATE; 
SET CONSTRAINTS content_archive_archive_id_fkey IMMEDIATE;

COMMIT;

Note:
The constraint names were found using '\d content_archive'.

softwareheritage-archiver=# \d content_archive
          Table "public.content_archive"
   Column   |           Type           | Modifiers
------------+--------------------------+-----------
 content_id | sha1                     | not null
 archive_id | archive_id               | not null
 status     | archive_status           |
 mtime      | timestamp with time zone |
Indexes:
    "content_archive_pkey" PRIMARY KEY, btree (content_id, archive_id) DEFERRABLE
Foreign-key constraints:
    "content_archive_archive_id_fkey" FOREIGN KEY (archive_id) REFERENCES archive(id) DEFERRABLE

ardumont mentioned this in rDSTO9e34ae930b8d: Make the archiver's content_archive constraints deferrable.Jul 19 2016, 2:42 PM

ardumont mentioned this in rSPPROFf71b1c2f03bb: Update profile archiver's configuration file template.Jul 19 2016, 3:13 PM

ardumont mentioned this in rSPSITE28161e38671f: Update profile archiver's configuration.

ardumont renamed this task from Bootstrap database's archival tables to Bootstrap archiver's database.Jul 19 2016, 4:00 PM

ardumont claimed this task.Jul 19 2016, 7:42 PM

Ok so, status, faster but still too slow.
One third done in ~24h or so.

The first part of the copy was a happy moment.
But the insert part after that is not the way to go.

Of course, i did not find the right documentation, hat tip to @zack to notice my misguided ways.

So effectively, we must:

use copy all the way
drop index and constraints altogether
either keep the default values in table to populate (we choose that), either rework the inputs (what's read from the todb fifo) to add the missing values
create index and constraints after that (it's faster that way according to the doc)

So here it goes:

DROP TABLE content_archive_tmp;
DROP TABLE content_archive;

CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id default 'banco', -- REFERENCES archive(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz
--  PRIMARY KEY (content_id, archive_id)
);

COPY content_archive(content_id) from '/var/lib/postgres/todb';

And on the side still:

postgres@prado:~$ mkfifo todb
postgres@prado:~$ pigz -dc /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz | pv --progress --timer --eta --rate --average-rate --size 70g > todb

And it's done.
It took around 45min, awesome.

So next step, we recreate the right information on table (index, constraint, default values):

\timing
ALTER TABLE content_archive ALTER COLUMN archive_id DROP DEFAULT;
ALTER TABLE content_archive ALTER COLUMN status DROP DEFAULT;
ALTER TABLE content_archive ALTER COLUMN mtime DROP DEFAULT;
ALTER TABLE content_archive ADD PRIMARY KEY(content_id, archive_id);
ALTER TABLE content_archive ADD FOREIGN KEY(archive_id) REFERENCES archive(id);

still running...

Primary key done

softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN archive_id DROP DEFAULT;
ALTER TABLE
Time: 78.389 ms
softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN status DROP DEFAULT;
ALTER TABLE
Time: 2.648 ms
softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN mtime DROP DEFAULT;
ALTER TABLE
Time: 4.029 ms
softwareheritage-archiver=# alter table content_archive add primary key(content_id, archive_id);

ALTER TABLE
Time: 25526460.506 ms

En route for the foreign key... ^^

And we are done

softwareheritage-archiver=# ALTER TABLE content_archive ADD FOREIGN KEY(archive_id) REFERENCES archive(id);
ALTER TABLE
Time: 593917.431 ms

^^

Sum up:

softwareheritage-archiver=# \d+
                                List of relations
 Schema |      Name       | Type  |   Owner    |  Size  |      Description
--------+-----------------+-------+------------+--------+------------------------
 public | archive         | table | postgres   | 16 kB  |
 public | content_archive | table | postgres   | 107 GB |
 public | dbversion       | table | swhstorage | 16 kB  | Schema update tracking
(3 rows)

softwareheritage-archiver=# \di+
                                     List of relations
 Schema |         Name         | Type  |   Owner    |      Table      | Size  | Description
--------+----------------------+-------+------------+-----------------+-------+-------------
 public | archive_pkey         | index | postgres   | archive         | 16 kB |
 public | content_archive_pkey | index | postgres   | content_archive | 78 GB |
 public | dbversion_pkey       | index | swhstorage | dbversion       | 16 kB |
(3 rows)

softwareheritage-archiver=# \d+ content_archive;
                              Table "public.content_archive"
   Column   |           Type           | Modifiers | Storage  | Stats target | Description
------------+--------------------------+-----------+----------+--------------+-------------
 content_id | sha1                     | not null  | extended |              |
 archive_id | archive_id               | not null  | plain    |              |
 status     | archive_status           |           | plain    |              |
 mtime      | timestamp with time zone |           | plain    |              |
Indexes:
    "content_archive_pkey" PRIMARY KEY, btree (content_id, archive_id)
Foreign-key constraints:
    "content_archive_archive_id_fkey" FOREIGN KEY (archive_id) REFERENCES archive(id)

ardumont closed this task as Resolved.Jul 20 2016, 7:47 PM

After some discussion, we need to rework the schema some more.

ardumont mentioned this in rDSTO134ac77701aa: Update archiver's schema.Jul 21 2016, 2:10 PM

ardumont mentioned this in rDSTOTbd2cc5e7900e: Update archiver's schema.

qcampos removed a parent task: T240: content archiver.Jul 22 2016, 12:29 PM

qcampos mentioned this in rDSTOf4ef5a5d9c1c: Archiver db's get of content_archive is now ordered by sha1.Jul 22 2016, 3:35 PM

status:

contents present in uffizi and missing in banco added in softwareheritage-archiver.content_archive
index currently being built on content_archive (still).

create unique index concurrently content_archive_pk on content_archive(content_id);

For information, I did not cleanup softwareheritage.content_archive_tmp (just in case)

Lists contents not present on banco but on uffizi is stored on /srv/storage/space/lists/content-id-uffizi-not-on-banco.txt.gz

The primary key has been added and the database should now match the schema.

zack removed a subtask: Unknown Object (Maniphest Task).Jul 29 2016, 12:04 AM

ardumont mentioned this in rSPSITEf71b1c2f03bb: Update profile archiver's configuration file template.Jun 15 2018, 2:29 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T484: List banco's current sha1s for injection in archiver db from Invalid to Migrated.Jan 8 2023, 4:19 PM

Bootstrap archiver's database
Closed, MigratedEdits Locked
Actions

Description

Related Objects
Search...

Event Timeline

First run (TL; DR - too slow so stopped)

second run (so far so good)

Bootstrap archiver's databaseClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

First run (TL; DR - too slow so stopped)

second run (so far so good)

Bootstrap archiver's database
Closed, MigratedEdits Locked
Actions

Related Objects
Search...