Page MenuHomeSoftware Heritage

Investigate scheduler journal client discrepancies
Closed, MigratedEdits Locked

Description

T2993 deployed and first run completed.

But, we face discrepancies:

  • have 98M [1] origins referenced in the cache table but we should have around 131M [2] origins. That difference ~33M is currently unexplained and high.
  • Checking for example on "linux" origin, the cache data computed is off as well [3]

[1]

softwareheritage-scheduler=> select now(), count(*) from origin_visit_stats;
              now              |  count
-------------------------------+----------
 2021-01-28 08:34:40.152554+00 | 98231002

[2]

softwareheritage=> select count(distinct origin) from origin_visit_status where status in ('full', 'partial');

[3]

softwareheritage-scheduler=> select * from origin_visit_stats where url='https://github.com/torvalds/linux';
                url                | visit_type |         last_eventful         | last_uneventful |          last_failed          | last_notfound |               last_snapshot                | last_scheduled
-----------------------------------+------------+-------------------------------+-----------------+-------------------------------+---------------+--------------------------------------------+----------------
 https://github.com/torvalds/linux | git        | 2017-09-07 18:43:13.021746+00 |                 | 2018-08-23 11:53:06.553328+00 |               | \x3e3045be901bacc7594176e79ba13fe030f601e2 |
(1 row)

softwareheritage=> select * from origin_visit_status where origin=2 order by date desc limit 10;
 origin | visit |             date              | status  | metadata |                  snapshot                  | type
--------+-------+-------------------------------+---------+----------+--------------------------------------------+------
      2 |    67 | 2020-09-21 21:55:01.586191+00 | full    |          | \xc7beb2432b7e93c4cf6ab09cd194c7c1998df2f9 |
      2 |    67 | 2020-09-21 19:15:24.238712+00 | created |          |                                            |
      2 |    66 | 2020-09-21 17:12:11.930011+00 | partial |          |                                            |
      2 |    66 | 2020-09-21 17:07:41.94459+00  | created |          |                                            |
      2 |    65 | 2020-08-24 11:51:54.472736+00 | full    |          | \xb16664848afbd3e867e8fce516ef15c1772679b2 |
      2 |    65 | 2020-08-24 09:22:41.181224+00 | created |          |                                            |
      2 |    64 | 2020-03-19 23:29:59.614232+00 | full    |          | \x89eed60d46be8b8963a1a2268762aee5bbb41038 |
      2 |    63 | 2020-01-20 19:50:46.750039+00 | full    |          | \xcabcc7d7bf639bbe1cc3b41989e1806618dd5764 |
      2 |    62 | 2019-12-16 13:44:56.685885+00 | ongoing |          |                                            |
      2 |    61 | 2019-08-25 14:04:07.603463+00 | full    |          | \xeb8087624d47f6e8ee89692df041b2f568fb0e5f |

Related to T2993

Event Timeline

Using a semi-trivial journal client, I've dumped the data from swh-journal regarding the linux origin visit statuses:

{'origin': 'https://github.com/torvalds/linux', 'visit': 57, 'date': datetime.datetime(2018, 12, 27, 7, 47, 0, 154971, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'|\xc1\x9bJ{Y\xf0\xbb\xa2\x06\xb0,z\xacM\xf8b\x9f\xa6R', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 25, 'date': datetime.datetime(2017, 10, 8, 11, 54, 25, 582463, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x1b2xW\xf7.\x8c\xde\xb4\xd1wHrt\xaf\x18\xb2i\x03\x12', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 28, 'date': datetime.datetime(2017, 11, 21, 19, 37, 42, 517795, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x86=F\x9d&1<Cau\xb4\xdb\x04)6u\x97+\xf9l', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 51, 'date': datetime.datetime(2018, 9, 30, 15, 54, 18, 68701, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xdf\x1dX\xdb\xafoJ!!=\xced\x1aLg\xabN>.\x97', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 18, 'date': datetime.datetime(2017, 9, 11, 5, 49, 58, 300518, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b' \xec\x8f!\xe9\xb7@\xb5\x9b\xbd\x04\xe68C\x08\x95{\x98\x97\xd5', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 67, 'date': datetime.datetime(2020, 9, 21, 19, 15, 24, 238712, tzinfo=datetime.timezone.utc), 'status': 'created', 'snapshot': None, 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 35, 'date': datetime.datetime(2018, 7, 5, 17, 18, 5, 610308, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 47, 'date': datetime.datetime(2018, 9, 18, 6, 38, 28, 791590, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 64, 'date': datetime.datetime(2020, 3, 19, 23, 29, 59, 614232, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x89\xee\xd6\rF\xbe\x8b\x89c\xa1\xa2&\x87b\xae\xe5\xbb\xb4\x108', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 21, 'date': datetime.datetime(2017, 9, 12, 18, 12, 35, 344511, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xe5\xb1\xbd8\xd5\xb0l\xf9\x9f1L\xf89\xd5Mi`\xe1f\xa1', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 42, 'date': datetime.datetime(2018, 9, 7, 20, 8, 54, 581862, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 39, 'date': datetime.datetime(2018, 8, 31, 6, 32, 17, 742578, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 56, 'date': datetime.datetime(2018, 10, 21, 5, 22, 50, 678295, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'-A\x80Y7\xba1\xd3\xbbm\xb3|\xfa\x8a\x9a#j\xa9\x81\x0f', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 27, 'date': datetime.datetime(2017, 11, 5, 22, 45, 54, 59797, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'l\xd9\x08\x9f\xaa\x91x\x9f\xb4A\xee\xbb\x90$;\x0f\xea\xd6 s', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 29, 'date': datetime.datetime(2018, 3, 8, 0, 0, 53, 641972, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 26, 'date': datetime.datetime(2017, 10, 18, 6, 25, 9, 481005, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x83&\xea\xba\xfa\x96\xfd\x12\xc5\xfb\xd9sU\xd3r\xcdz\xb8>#', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 17, 'date': datetime.datetime(2017, 9, 10, 17, 35, 20, 158515, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xf3\xbb\xa4\n\x96G{\xc4\xe16\xca\xa3\xbf\x95\xfeL\xa0\xe0\x12\xa2', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 66, 'date': datetime.datetime(2020, 9, 21, 17, 12, 11, 930011, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 53, 'date': datetime.datetime(2018, 10, 4, 18, 58, 58, 156154, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'w\t\x86-\xaa\xe70\xe3\xa0C\xb5F\x88$K}\xf2\x13@]', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 7, 'date': datetime.datetime(2016, 8, 29, 18, 55, 54, 153721, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'k\xd6i\x93\x83\x9d\xc8\x97\xaa\x15\xa4C\xc4\xe3\xb9\x16O\x81\x14\x99', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 32, 'date': datetime.datetime(2018, 7, 4, 9, 1, 25, 333541, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 37, 'date': datetime.datetime(2018, 8, 23, 11, 53, 6, 553328, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 60, 'date': datetime.datetime(2019, 8, 5, 2, 16, 12, 840378, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x8e\x9cP;]\xe5H\x8d\x17\xcfX\xb8\xce\x1c|_\x82x\xd7\x07', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 65, 'date': datetime.datetime(2020, 8, 24, 9, 22, 41, 181224, tzinfo=datetime.timezone.utc), 'status': 'created', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 13, 'date': datetime.datetime(2017, 9, 7, 18, 43, 13, 21746, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'>0E\xbe\x90\x1b\xac\xc7YAv\xe7\x9b\xa1?\xe00\xf6\x01\xe2', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 15, 'date': datetime.datetime(2017, 9, 9, 17, 18, 54, 307789, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xad\xd0\xa0#b\xf3\xe5\x93\x14\xe8\x0f\xa6\x01\xed\xd6\t*d\xbal', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 55, 'date': datetime.datetime(2018, 10, 13, 19, 9, 33, 773125, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b"L\xc3,\xd1V'4\xe1=\xb5\x96)\xc8z\x1e:\xe6\x1bv\xa8", 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 43, 'date': datetime.datetime(2018, 9, 8, 21, 36, 3, 892238, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 24, 'date': datetime.datetime(2017, 10, 6, 4, 27, 51, 146287, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b"u\x9a\xef\x82/\x07\xefy\ne\x05\xbfk'[W\x088\xde\x0e", 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 6, 'date': datetime.datetime(2016, 8, 17, 9, 16, 22, 52065, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'i\x03\xf8h\xdfm\x94\xa4D\x81\x8bP\xbe\xcdH5\xb2\x9b\xe2t', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 46, 'date': datetime.datetime(2018, 9, 18, 2, 22, 45, 500261, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 4, 'date': datetime.datetime(2016, 6, 18, 1, 22, 24, 808485, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xce!\xf3\x17\xd9\xfdt\xbbJ\xf3\x1b\x06 r@\x03\x1fK%\x16', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 50, 'date': datetime.datetime(2018, 9, 24, 6, 23, 8, 81958, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'I(\xe6k=Z7\x07]\x15\xda\xae\xcd\x9a>!\xab\xf2I\xcf', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 40, 'date': datetime.datetime(2018, 8, 31, 18, 53, 23, 407469, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 5, 'date': datetime.datetime(2016, 8, 14, 12, 10, 0, 536702, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xfe\x0e\xac\x19\x14\x1f\xdc\xdf\x03\x9e\x8fZ\xce^A\xb9\xa29\x8aI', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 41, 'date': datetime.datetime(2018, 9, 6, 23, 54, 1, 808678, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 38, 'date': datetime.datetime(2018, 8, 23, 22, 43, 57, 639257, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 16, 'date': datetime.datetime(2017, 9, 10, 5, 29, 1, 462971, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x88{\x850\x81\xdd\xff.\xa2\x1b5\xaeYF\x1b}(\xe794', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 36, 'date': datetime.datetime(2018, 8, 1, 7, 26, 22, 603055, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 67, 'date': datetime.datetime(2020, 9, 21, 21, 55, 1, 586191, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xc7\xbe\xb2C+~\x93\xc4\xcfj\xb0\x9c\xd1\x94\xc7\xc1\x99\x8d\xf2\xf9', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 65, 'date': datetime.datetime(2020, 8, 24, 11, 51, 54, 472736, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xb1fd\x84\x8a\xfb\xd3\xe8g\xe8\xfc\xe5\x16\xef\x15\xc1w&y\xb2', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 33, 'date': datetime.datetime(2018, 7, 4, 12, 3, 2, 684804, tzinfo=datetime.timezone.utc), 'status': 'ongoing', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 9, 'date': datetime.datetime(2016, 9, 14, 10, 36, 21, 505296, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'@\xa58\x1e+l\x0c\x04w\\[~{7(L:\xff\xc1)', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 2, 'date': datetime.datetime(2016, 2, 23, 18, 5, 23, 312045, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'&\xbe\xfd\xbfK9=\x1e\x03\xaa\x80\xf2\xa9U\xbc8\xb2A\xa8\xac', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 54, 'date': datetime.datetime(2018, 10, 13, 5, 8, 9, 273129, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b"L\xc3,\xd1V'4\xe1=\xb5\x96)\xc8z\x1e:\xe6\x1bv\xa8", 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 58, 'date': datetime.datetime(2019, 3, 30, 10, 44, 8, 484859, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'J\xaeZ\x00Dj\xa0i\xd90gbIG\xb5\x0e\xb2\xffv\xdb', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 34, 'date': datetime.datetime(2018, 7, 4, 12, 3, 59, 959837, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x91\xec\xbck=\x9e$;+\x1f\x9f\xc9aN\xd6\x85\xfe\xc4sr', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 19, 'date': datetime.datetime(2017, 9, 11, 18, 0, 15, 37345, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b' \xec\x8f!\xe9\xb7@\xb5\x9b\xbd\x04\xe68C\x08\x95{\x98\x97\xd5', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 48, 'date': datetime.datetime(2018, 9, 22, 13, 51, 35, 886745, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'v\x8ai\xf2\x05\xd4\xfa4i\x0c>\x9a4\x12\x08A\x19O?\xa8', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 10, 'date': datetime.datetime(2016, 9, 23, 10, 14, 2, 169862, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'"R\xb4\xd4\x9b\x9exn\xb7w\xa0\tzB\xe5\x1cq\x93\xbb\x9c', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 20, 'date': datetime.datetime(2017, 9, 12, 6, 6, 34, 703343, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b' \xec\x8f!\xe9\xb7@\xb5\x9b\xbd\x04\xe68C\x08\x95{\x98\x97\xd5', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 30, 'date': datetime.datetime(2018, 6, 1, 12, 48, 0, 902827, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 63, 'date': datetime.datetime(2020, 1, 20, 19, 50, 46, 750039, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xca\xbc\xc7\xd7\xbfc\x9b\xbe\x1c\xc3\xb4\x19\x89\xe1\x80f\x18\xddWd', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 8, 'date': datetime.datetime(2016, 9, 7, 8, 44, 47, 861875, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xc0j\x96_\x85_Ms\xc8O\xbe\xfd\x85\x9f}\xf5\x07\x18}\x9c', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 59, 'date': datetime.datetime(2019, 6, 21, 19, 28, 48, 774654, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\xcb\x0eVD\x8f\xea<i\xa2y%9\x89e\xf8\x07\xdc\xba\x8f$', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 66, 'date': datetime.datetime(2020, 9, 21, 17, 7, 41, 944590, tzinfo=datetime.timezone.utc), 'status': 'created', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 44, 'date': datetime.datetime(2018, 9, 9, 12, 40, 48, 143883, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 11, 'date': datetime.datetime(2017, 2, 16, 7, 53, 39, 467657, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': b'\x1a\x88\x93\xe6\xa8oDN\x8b\xe8\xe7\xbd\xa6\xcb4\xfb\x175\xa0\x0e', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 62, 'date': datetime.datetime(2019, 12, 16, 13, 44, 56, 685885, tzinfo=datetime.timezone.utc), 'status': 'ongoing', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 22, 'date': datetime.datetime(2017, 9, 13, 6, 26, 36, 580675, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x82\xd9\x11q\xabGJP\xb4e=I\xaf\x9f\xab\x1b\x9f\x1a\xf9\xc1', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 49, 'date': datetime.datetime(2018, 9, 23, 1, 14, 36, 694248, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'v\x8ai\xf2\x05\xd4\xfa4i\x0c>\x9a4\x12\x08A\x19O?\xa8', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 31, 'date': datetime.datetime(2018, 7, 1, 3, 51, 21, 433801, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 14, 'date': datetime.datetime(2017, 9, 9, 5, 14, 33, 466107, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\x1e\xf6\xb3q\xc8WF\xe2`\xaf<\x84\x13\xf8\x02\xfd\xd7,\xec\x9d', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 12, 'date': datetime.datetime(2017, 5, 4, 19, 40, 9, 336451, tzinfo=datetime.timezone.utc), 'status': 'full', 'snapshot': b'\\\x9bEM\xae\x06\x82\x81\xd4\x84E\xc7\x9a$\x08\xa2\xc2C\xbc\xa2', 'type': 'git', 'metadata': None}
{'origin': 'https://github.com/torvalds/linux', 'visit': 45, 'date': datetime.datetime(2018, 9, 17, 6, 15, 20, 340219, tzinfo=datetime.timezone.utc), 'status': 'partial', 'snapshot': None, 'type': 'git', 'metadata': None}

A few observations:

  • some visit statuses don't have a type (?);
  • the data doesn't look complete
  • the contents of the origin visit stats table still doesn't match what the journal apparently contains

The first two issues make me think there's a (hidden) issue with the default settings of the journal client.

I've restarted the journal client with:

  • debug logging
  • a bumped up max.message.size (matching what was set in the backfill, see P934)
  • a bumped up batch_size (10000 instead of the default of 200)

For now, this seems to

  • return more messages
  • return all the more recent messages

This is all quite promising.

I've also added a test with "actual production" origin visit status messages (D4981) which doesn't seem to reproduce the prod issue, which is some more food for the "something is missing on the journal client side".

The dump using a bumped up max.message.size completed without issue. Thanks to the bumped up batch_size, it only took 2 hours.

In that dump, all expected messages are present. Looks like the messages without type haven't been compacted away yet.

I then reset the offsets of my client, and restarted the dump with default settings for max.message.size and batch_size. Looks like I didn't reproduce any of the previous issues with missing messages, until the client ran out of disk space for logs :(

I'm completely stumped.

I've added the following to the swh-scheduler-journal-client config on saatchi:

message.max.bytes: 100000000
batch_size: 5000

I've also patched the journal client to ignore messages without a type:

interesting_messages = [
    msg for msg in messages[msg_type] if "type" in msg and msg["status"] not in ("created", "ongoing")
]

if not interesting_messages:
    return

and restarted the client in debug mode.

After landing rDSCHecab745a5f20, we got partway there, but we still had only 98m origins in the visit stats table.

After checking the diff between origin lists in the swh-scheduler db and in swh-storage, I thought there was only origins with relatively high origin ids missing.

This led me to believe that the hardcoded limit of 100m origins in swh.storage.backfill would have been the culprit, but some origins higher than 100m got backfilled properly.

After (finally!) checking the backfill logs (on getty:/srv/softwareheritage/backfill-2021-01), looks like a bunch of backfill processes have been interrupted with the following traceback:

2021-01-26T15:09:27 INFO     swh.storage.backfill Processing origin_visit_status range 91132000 to 91133000
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.11.0', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill
    dry_run=dry_run,
  File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 552, in run
    writer.write_additions(object_type, objects)
  File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 66, in write_additions
    self.journal.write_additions(object_type, values)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 247, in write_additions
    self._write_addition(object_type, object_)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 235, in _write_addition
    self.send(topic, key=key, value=dict_)
  File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 175, in send
    topic=topic, key=kafka_key, value=value_to_kafka(value),
BufferError: Local: Queue full
$ zgrep --files-with-matches BufferError logs/origin_visit_status/*
logs/origin_visit_status/100000000.log.gz
logs/origin_visit_status/110000000.log.gz
logs/origin_visit_status/120000000.log.gz
logs/origin_visit_status/60000000.log.gz
logs/origin_visit_status/80000000.log.gz
logs/origin_visit_status/90000000.log.gz

Which means that the visit statuses for around 50-60m origins haven't been backfilled with a type key, explaining the discrepancy...

I'll re-run the backfill on these broken shards (ouch!), and we'll see how that finally goes...

I had an immediate resurgence of the issue when restarting the backfill of the missing chunks, so I hot-patched a retry on BufferErrors in swh.journal.writer.kafka. Glancing at sentry, we never had this issue in prod. The hot patch seems to work; I'll write tests on that and submit it as a diff tomorrow.

The backfill is now complete, and the scheduler journal client has processed its backlog of messages.

Time to compare origin lists again...

After the full backfill:

softwareheritage-scheduler=> select count(*) from origin_visit_stats;
   count
-----------
 138073057
(1 ligne)
$ psql service=swh -c "\copy (select url from origin where exists (select 1 from origin_visit_status where origin=id and status in ('full', 'partial')) order by url) to stdout csv" > archive_origins

$ wc -l archive_origins                                                                                                                                                                                                                                                                    
138041707 archive_origins

The amount of origins seems to be consistent in both lists.

I'm currently dumping the list of origins from the scheduler, to do an actual comparison, but this is very promising!

olasd claimed this task.

That's all been working consistently for months now, closing!