Yes, we are hitting the same problem.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 13 2018
Basic checks on the archive is fine:
softwareheritage=> select count(*) from origin_visit inner join origin on origin_visit.origin = origin.id where origin.type = 'hg'; count -------- 126678 (1 row) softwareheritage=> select count(*) from origin o inner join origin_visit ov on o.id=ov.origin where type='hg' and url like '%googlecode%' and ov.snapshot_id = 16; # empty snapshot count -------- 126661 (1 row)
Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.
FYI all loaded repositories point to an empty snapshot.
Feb 12 2018
FYI all loaded repositories point to an empty snapshot.
FYI all loaded repositories point to an empty snapshot.
Some error speaks for themselves (OSError, error during extraction), some are not.
I'm currently digging into this and will open dedicated tasks when deemed necessary.
Out of 127k (127048) only ~125k (124899, query on swh db) are referenced.
Feb 9 2018
yay
This is now running on our swh-workers, scheduling running on saatchi:
Feb 6 2018
Feb 5 2018
It appears that in this case, the properties must be changed not to the symlink but to its source.
It's more empty repository case than a repository starting its commit range at 0...
Feb 2 2018
This is in stand-by during the snapshot migration.
Dec 21 2017
Dec 20 2017
Dec 14 2017
P202 checked and ok locally.
Now asked for review as it will remove data from the main db.
After discussion with the team, it has been decided to remove from the re-scheduling the svn dumps whose compressed size exceeds 2Gib.
This reflects the same decision took for git repositories.
Dec 13 2017
Dec 11 2017
Scheduled back from saatchi (as i needed the producer credentials to access the queue properties):
Dec 9 2017
Dec 1 2017
Nov 23 2017
The old tool is id 7, the new one is 9:
Nov 22 2017
Depends on T761
Bogus mimetype values are identified by the following queries:
softwareheritage=> select count(*) from content_mimetype where mimetype LIKE '[%' or mimetype like '' and indexer_configuration_id=7; count ------- 50733 (1 row)
Nov 16 2017
Status:
- Final listing of bogus values: /srv/storage/space/lists/indexer/mimetype/sha1-with-bogus-values.txt.gz (50733)
- Queue reached the sane point.
- workers stopped.
Nov 15 2017
I am waiting for the queue to drop at 10000 as that will avoid rescheduling the already done 10000 (well except for the new bogus values :)
There might be other bogus values in the stats that I haven't noticed.
I don't see how i can easily check this though since we don't have the sha1 provenance yet.
From the top of my head, i would say that i forgot to clean up those bogus values after the initial runs around december 2016.
I don't see how i can easily check this though since we don't have the sha1 provenance yet.
Nov 13 2017
The directories have been corrected, but the bogus file entries have not been deleted yet.
This is now done on uffizi.
Potentially remaining sub-tasks before closing this:
Nov 12 2017
Updated SQL to also delete objects from tables that references them, e.g., the indexer ones.
Nov 8 2017
The replication is now functional, indexes have been recreated, and the frontend now points to the new database.
thanks for the review, I've updated the SQL query accordingly
Nov 6 2017
create extension pglogical; select pglogical.create_node(node_name := 'prado', dsn := 'host=prado.internal.softwareheritage.org port=5433 dbname=softwareheritage'); select pglogical.replication_set_add_table('default', 'content', true);
As a comment on 4, the object_id column is per-table, so you should avoid carrying it over to the skipped_content table.