Page MenuHomeSoftware Heritage

Bootstrap archiver's database
Closed, MigratedEdits Locked

Description

The tables archives and content_archive should be initialized with respectively the archives servers, and the content. content_archive should be filled properly to have as 'present' on Banco the already copied content.

Event Timeline

qcampos created this object in space S1 Public.
zack removed a project: Restricted Project.May 31 2016, 11:23 AM

Running on banco:

INSERT INTO archives(id, url)
VALUES('Banco', 'http://banco.softwareheritage.org:5003/');

begin;

-- prepare data

CREATE TABLE content_archive_tmp (
  content_id  sha1 REFERENCES content(sha1),
  PRIMARY KEY (content_id)
);

\copy content_archive_tmp from 'content-id-by-ctime.after-T7.txt';

-- insert into the real production table

insert into content_archive (sha1, archive_id, status, mtime)
select sha1, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
from content_archive_tmp;

-- drop temporary table

drop table content_archive_tmp;

Note:

  • the previous snippet may be changed after the \copy instruction (since it has not yet been tested).
  • the timestamp used for the insert is the modify time of the .txt file holding the list of sha1s we inject
  • 'archives' table should be renamed 'archive' to respect our naming convention
  • archive_id is TEXT, i'm uneasy with that since we repeat it with every content we have (that's quite a huge repetition), a simple integer should be enough -(> i don't measure yet the impact on the archiver code though)
  • status is also TEXT and could be replaced with an enum or something

This is a failure for now.
After multiple attempts in the week-end, there is not enough space on disk for the process to finish.

I see 2 ways to improve this:

  1. grow /srv/softwareheritage/postgres on prado.
  2. create the archiver's db on /srv/softwareheritage/postgres-hdd (as some other db we have on the side).

If i understand right about the hardware postgres partition is ssd, and postgres-hdd partition is standard disk (so slower).

I'm more inclined to 2 as:

  • I don't think this could be much of a problem to have an archiver db slightly slower.
  • if it's not good we can always migrate back to 1
  • i have multiple blocking points for the solution 1 (How much is it reasonable to grow the partition? Do we even have the resources to do so? Also, @olasd showed me how to grow the partition disk but i don't remember since i did not try...)
In T412#7853, @ardumont wrote:
  • status is also TEXT and could be replaced with an enum or something

'status' is of type archive_status which is already an enum. I guess that Postgres do the right thing with integers.

CREATE TYPE archive_status AS ENUM (
  'missing',
  'ongoing',
  'present'
);
In T412#7853, @ardumont wrote:
  • archive_id is TEXT, i'm uneasy with that since we repeat it with every content we have (that's quite a huge repetition), a simple integer should be enough -(> i don't measure yet the impact on the archiver code though)

archive_id is a foreign key. I assumed that postgres would nicely do the job for us. If it's not the case, we clearly need to change that.

In T412#7853, @ardumont wrote:
  • the timestamp used for the insert is the modify time of the .txt file holding the list of sha1s we inject

When content are missing, the date is not relevant, so any would do the job.


  • I totally agree with the archives > archive change.
  • As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

'status' is of type archive_status which is already an enum. I guess that Postgres do the right thing with integers.

Yep, sorry i did not change my remark.
I saw this this morning when i took a closer look.
A good news then, one less change to do ^^

archive_id is a foreign key. I assumed that postgres would nicely do the job for us. If it's not the case, we clearly need to change that.

I do hope so as well ^^

When content are missing, the date is not relevant, so any would do the job.

ok

I totally agree with the archives > archive change.

ok

As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

Indeed, we'll wait to have some more space then ^^.

ardumont added a subtask: Unknown Object (Maniphest Task).Jul 11 2016, 2:27 PM
As we said on irc, the foreign key from content_archive.id to content.sha1 makes the creation of an archiver single db quite uneasy.

As discussed on irc, we could, as a first approximation, drop this constraint since our identifier are quite stable (we never delete anything).

Indeed, we'll wait to have some more space then ^^.

Not necessarily, we need to discuss this but we could leverage our queue system (or in the possible future system kafka) to notify that some new contents have been added. And then update the archiver db from those notifications.

Otherwise, for the space T486 is in progress, and T413 in progress as well (or so i think from @rdicosmo's email)

Moving softwareheritage-log from the ssd to hdd (T487), we reclaimed 1.1T of data on the ssd (which were the initial blocking point).
So now, we can try to inject back the archiver's bootstrap data to finally... run it ^^

ardumont changed the task status from Open to Work in Progress.Jul 16 2016, 10:21 AM

For info, i ran another failed attempt yesterday (Saturday the 16th of July 2016).
This stopped before finishing.

Trying the following:

CREATE TABLE content_archive (
  content_id  sha1 REFERENCES content(sha1),
  archive_id  archive_id default 'Banco'  REFERENCES archives(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz,
  PRIMARY KEY (content_id, archive_id)
);

\copy content_archive from '/srv/storage/space/lists/todb';

Note: Default values being temporary the time to bootstrap.

and to effectively load the data:

ardumont@prado:/srv/storage/space/lists$ mkfifo todb
ardumont@prado:/srv/storage/space/lists$ pv content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb

It failed around 500M lines and was stopped for missing space yet again.

Looks like, for now, the only way is to store this is in the other cluster (hdd rotating spin, so other mount point).
Thus effectively dropping the constraint about the foreign key on content_id.

And then improve the swh.storage.storage.content_add api function with some way of notifying we added new contents.

Related:

  • swh-environment: 151e779
  • swh-core: 23ee17f
  • swh-storage-testdata: c6b8add
  • swh-storage: 73139f9

(forgot to add the Related keyword in commits... and already pushed so better luck next time...)

Also in regards to db, softwareheritage-archiver has been created with the following schema.

First run (TL; DR - too slow so stopped)

As of Monday the 18th on prado, was running in a tmux session (under ardumont) a process to inject data.

This was too slow so i stopped it just now.
The idea was to use direct injection in an altered content_archive table with default values.
Done 447951000 in ~24h.

As postgres user, using psql on softwareheritage-archiver:

begin;

DROP TABLE content_archive;

CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id default 'Banco'  REFERENCES archive(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz,
  PRIMARY KEY (content_id, archive_id)
);

COPY content_archive(content_id) from '/var/lib/postgres/todb';

-- Alter content_archive to remove default values
-- ... (yet to be determined)

commit;

This must have been too slow for the index creation done at the same time and maybe the default values policy...

second run (so far so good)

As postgres user, using psql on softwareheritage-archiver:

CREATE table content_archive_tmp(content_id sha1);
COPY content_archive_tmp(content_id) from '/var/lib/postgres/todb';

INSERT INTO content_archive(content_id, archive_id, status, mtime)
SELECT content_id, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
FROM content_archive_tmp;

DROP content_archive_tmp;

And in another tmux pane:

postgres@prado:~$ pv /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb

This goes way faster for now (for the pure copy at least):

postgres@prado:~$ pv /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz -s 70g -e -a -t | pigz -dc > todb
0:16:12 [15.3MiB/s] ETA 1:02:04

softwareheritage-archiver=# explain select count(*) from content_archive_tmp;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Aggregate  (cost=11340915.30..11340915.31 rows=1 width=0)
   ->  Seq Scan on content_archive_tmp  (cost=0.00..9912800.04 rows=571246104 width=0)
(2 rows)

w00t

w00t

and the first part is done ^^

0:41:23 [15.8MiB/s]

According to documentation, to defer a constraint, first said constraint must be deferrable (which it is not the default).

So changing first in the original table those constraints:

DROP TABLE content_archive;

-- make each constraint deferrable by default (they are not by default)
CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id REFERENCES archive(id) DEFERRABLE,
  status      archive_status,
  mtime       timestamptz,
  PRIMARY KEY (content_id, archive_id) DEFERRABLE 
);

Then, only inside the transaction can we change those constraints property adequately:

BEGIN;

SET CONSTRAINTS content_archive_pkey DEFERRED; 
SET CONSTRAINTS content_archive_archive_id_fkey DEFERRED;

INSERT INTO content_archive(content_id, archive_id, status, mtime)
SELECT content_id, 'Banco', 'present', '2016-02-04 14:19:59.000000000 +0000'::timestamptz
FROM content_archive_tmp;

-- revert back to default
SET CONSTRAINTS content_archive_pkey IMMEDIATE; 
SET CONSTRAINTS content_archive_archive_id_fkey IMMEDIATE;

COMMIT;

Note:
The constraint names were found using '\d content_archive'.

softwareheritage-archiver=# \d content_archive
          Table "public.content_archive"
   Column   |           Type           | Modifiers
------------+--------------------------+-----------
 content_id | sha1                     | not null
 archive_id | archive_id               | not null
 status     | archive_status           |
 mtime      | timestamp with time zone |
Indexes:
    "content_archive_pkey" PRIMARY KEY, btree (content_id, archive_id) DEFERRABLE
Foreign-key constraints:
    "content_archive_archive_id_fkey" FOREIGN KEY (archive_id) REFERENCES archive(id) DEFERRABLE
ardumont renamed this task from Bootstrap database's archival tables to Bootstrap archiver's database.Jul 19 2016, 4:00 PM

Ok so, status, faster but still too slow.
One third done in ~24h or so.

The first part of the copy was a happy moment.
But the insert part after that is not the way to go.

Of course, i did not find the right documentation, hat tip to @zack to notice my misguided ways.

So effectively, we must:

  • use copy all the way
  • drop index and constraints altogether
  • either keep the default values in table to populate (we choose that), either rework the inputs (what's read from the todb fifo) to add the missing values
  • create index and constraints after that (it's faster that way according to the doc)

So here it goes:

DROP TABLE content_archive_tmp;
DROP TABLE content_archive;

CREATE TABLE content_archive (
  content_id  sha1,
  archive_id  archive_id default 'banco', -- REFERENCES archive(id),
  status      archive_status default 'present',
  mtime       timestamptz default '2016-02-04 14:19:59.000000000 +0000'::timestamptz
--  PRIMARY KEY (content_id, archive_id)
);

COPY content_archive(content_id) from '/var/lib/postgres/todb';

And on the side still:

postgres@prado:~$ mkfifo todb
postgres@prado:~$ pigz -dc /srv/storage/space/lists/content-id-by-ctime.after-T7.txt.gz | pv --progress --timer --eta --rate --average-rate --size 70g > todb

And it's done.
It took around 45min, awesome.

So next step, we recreate the right information on table (index, constraint, default values):

\timing
ALTER TABLE content_archive ALTER COLUMN archive_id DROP DEFAULT;
ALTER TABLE content_archive ALTER COLUMN status DROP DEFAULT;
ALTER TABLE content_archive ALTER COLUMN mtime DROP DEFAULT;
ALTER TABLE content_archive ADD PRIMARY KEY(content_id, archive_id);
ALTER TABLE content_archive ADD FOREIGN KEY(archive_id) REFERENCES archive(id);

still running...

Primary key done

softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN archive_id DROP DEFAULT;
ALTER TABLE
Time: 78.389 ms
softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN status DROP DEFAULT;
ALTER TABLE
Time: 2.648 ms
softwareheritage-archiver=# ALTER TABLE content_archive ALTER COLUMN mtime DROP DEFAULT;
ALTER TABLE
Time: 4.029 ms
softwareheritage-archiver=# alter table content_archive add primary key(content_id, archive_id);

ALTER TABLE
Time: 25526460.506 ms

En route for the foreign key... ^^

And we are done

softwareheritage-archiver=# ALTER TABLE content_archive ADD FOREIGN KEY(archive_id) REFERENCES archive(id);
ALTER TABLE
Time: 593917.431 ms

^^

Sum up:

softwareheritage-archiver=# \d+
                                List of relations
 Schema |      Name       | Type  |   Owner    |  Size  |      Description
--------+-----------------+-------+------------+--------+------------------------
 public | archive         | table | postgres   | 16 kB  |
 public | content_archive | table | postgres   | 107 GB |
 public | dbversion       | table | swhstorage | 16 kB  | Schema update tracking
(3 rows)

softwareheritage-archiver=# \di+
                                     List of relations
 Schema |         Name         | Type  |   Owner    |      Table      | Size  | Description
--------+----------------------+-------+------------+-----------------+-------+-------------
 public | archive_pkey         | index | postgres   | archive         | 16 kB |
 public | content_archive_pkey | index | postgres   | content_archive | 78 GB |
 public | dbversion_pkey       | index | swhstorage | dbversion       | 16 kB |
(3 rows)

softwareheritage-archiver=# \d+ content_archive;
                              Table "public.content_archive"
   Column   |           Type           | Modifiers | Storage  | Stats target | Description
------------+--------------------------+-----------+----------+--------------+-------------
 content_id | sha1                     | not null  | extended |              |
 archive_id | archive_id               | not null  | plain    |              |
 status     | archive_status           |           | plain    |              |
 mtime      | timestamp with time zone |           | plain    |              |
Indexes:
    "content_archive_pkey" PRIMARY KEY, btree (content_id, archive_id)
Foreign-key constraints:
    "content_archive_archive_id_fkey" FOREIGN KEY (archive_id) REFERENCES archive(id)
ardumont reopened this task as Work in Progress.Jul 21 2016, 1:10 PM

After some discussion, we need to rework the schema some more.

status:

  • contents present in uffizi and missing in banco added in softwareheritage-archiver.content_archive
  • index currently being built on content_archive (still).
create unique index concurrently content_archive_pk on content_archive(content_id);

For information, I did not cleanup softwareheritage.content_archive_tmp (just in case)

Lists contents not present on banco but on uffizi is stored on /srv/storage/space/lists/content-id-uffizi-not-on-banco.txt.gz

The primary key has been added and the database should now match the schema.

zack removed a subtask: Unknown Object (Maniphest Task).Jul 29 2016, 12:04 AM