Query: Advanced Search

	Include stories about projects I am a member of.

ok, here are the building blocks I've prepared to resolve this task, as a step by step recipe:

FTR, the query I've used to generate the stats is:

mimetype-stats.sql133 BDownload

(the encoding there is needed due to T818)

and here's the actual disk usage (!= compressed size)

/srv/softwareheritage/scratch/lists $ xzcat oversize-contents.txt.xz | while read id ; do du $(swh-ls-obj $id) ; done | cut -f 1 | paste -sd+ | bc
16847900696

those are KB, so the total disk usage is ~15.5 TB (not bad!)

as a datapoint, at the time of writing the total (uncompressed) size occupied by content objects that are larger than our current limit is as follows:

Reopened since a subtask (or child task) is still opened (T676).

It got restarted 2 weeks ago (Monday 18th September 2017).
It just finished (Monday 2nd October 2017).

I now backfilled the rrd files in munin with historic data grabbed from the content table.

The content table has a nice ctime field that will allow us to regenerate historical data. I'm looking into this now.

My current point of view is thus: we've been bitten by inconsistencies between primary and replica before, so I think the counts should run on the primary and get replicated through the standard means to replicas, even if that means stressing the primary a bit more.

In T719#13453, @zack wrote:

From what you wrote I'm assuming you plan to run the cron count on the replica also when in production.

Sounds viable and good to me. (From what you wrote I'm assuming you plan to run the cron count on the replica also when in production.)

Well, it turns out that pg_stat_user_tables is pretty bad as well, just in a different way than pg_class is...

As of now, ingestion, after multiple (re)schedulings, has been done.

And of course now there's a discrepancy between the graphs (exported from statistics on the main database) and the counter (exported from real-time statistics on the replica database, on which vacuum has never been run)...

All updates performed.

softwareheritage=> select origin.id, count(distinct visit) from origin left join origin_visit ov on ov.origin = origin.id where type = 'ftp' group by origin.id having count(distinct visit) <> 1;
┌────┬───────┐
│ id │ count │
├────┼───────┤
└────┴───────┘
(0 ligne)

Not with the current public API: the referential integrity of occurrences generated by the GNU injection is not verified, so the occurrences for "not version 1.2.4" are unreachable.

So with the API can we access directly version 1.3.12? aka object_id: 2994581

An update on this, this is still work in progress.

All the directories should now have been corrected.

Directories have all been updated.

The queries updating the directory table with conflicting directory entries are in progress.

The query updating the directory_entry_dir table is in progress.

Advanced Search
Use Results
Edit Query
Hide Query

Nov 5 2017

Nov 4 2017

Nov 3 2017

Oct 26 2017

Oct 21 2017

Oct 20 2017

Oct 3 2017

Oct 2 2017

Sep 15 2017

Sep 2 2017

Sep 1 2017

Jul 26 2017

Jul 24 2017

Jun 6 2017

May 30 2017

May 12 2017

May 4 2017

Apr 26 2017

Apr 24 2017

Apr 20 2017

Apr 19 2017

Apr 7 2017

Advanced SearchUse ResultsEdit QueryHide Query