Page MenuHomeSoftware Heritage

Make the front page "archive size" graphs consistent with one another
Closed, MigratedEdits Locked

Description

The archive size graphs on the frontpage of the archive are misleading in two ways:

  • the revision and origin graphs truncate the Y axis (so the bottom of the axis is not zero, which makes the increase look much larger than it actually is)
  • the revision and origin graphs don't have the same timescale as the content graph.

I'm not sure which way we should fix these issues (I guess @rdicosmo needs to make an executive decision here), but the current inconsistency is a common source of confusion.

The main problem to solve this issue is that we don't really have the old data for the revision counts and origin counts.

We don't even have proper data to "backfill" the graph for counts of revisions: while we initially recreated the graph for contents by using the ctime field on that table, we don't have such a field on other object types (and ctimes on contents are a pain, because their meaning across multiple mirrors of the archive is quite unclear, and they're not intrinsic, so we're unlikely to want to generalize them).

We could probably backfill the origin graph by using the date of first (successful) visit of any single origin.

And, in the end, we need to decide if we want these graphs to last "forever" or if a snapshot of the last year is sufficient.

Event Timeline

olasd triaged this task as Wishlist priority.Sep 21 2020, 5:28 PM
olasd created this task.

According to the slides repo, historical counts follow:

Daterevisionsorigins
2015-09-0100
2016-09-14644,628,80025,258,776
2016-11-24704,845,95253,488,904
2017-05-10780,882,04858,257,484
2017-09-26853,277,24165,546,644
2018-01-24943,061,51771,814,787
2018-03-25980,390,19183,797,945
2018-10-041,126,348,33585,202,432
2019-01-271,248,389,31988,288,721
2019-06-271,326,776,43289,301,694
2019-09-221,379,380,52790,231,104
2020-01-011,414,420,36991,400,586
2020-02-061,428,955,76191,512,130
2020-04-071,590,436,149107,875,943
2020-05-171,717,420,203121,172,621
2020-05-271,744,034,936123,781,438
In T2619#49514, @olasd wrote:

This is an amazing unexpected contribution of the Internet Archive to Software Heritage ! :-)

Below are the consolidated graphs using the historical data posted in https://forge.softwareheritage.org/T2619#49513

Below is the consolidated graph augmented with historical data retrieved from the Internet Archive https://forge.softwareheritage.org/T2619#49514.

More data can be extracted from the archival of the main page of swh.org: https://web.archive.org/web/*/https://www.softwareheritage.org/,
but loading each snapshot is quite slow.

go for the internet archive version, and i think the points you have are enough, even if there are bumps it's much better than what we currently have !

go for the internet archive version

More precisely, the second graph contains the counter values posted by @olasd plus those retrieved from the Internet Archive.

and i think the points you have are enough, even if there are bumps it's much better than what we currently have !

I think so too, this looks consistent. Plus we could still add more points in the future if we find other historical data.

Great to see this! Let's go :-)