Page MenuHomeSoftware Heritage

gitlab lister: make full listing on large instance more robust to concurrency writings
Open, NormalPublic

Description

Investigate and fix:

Jun 30 09:00:54 worker15 python3[23590]: [2019-06-30 09:00:54,448: ERROR/ForkPoolWorker-2] Task swh.lister.gitlab.tasks.RangeGitLabLister[474d600e-ff5c-43b0-83f9-afc29b1cfd88] raised unexpected: IntegrityError('(psycopg2.IntegrityError) duplicate key value violates unique constraint "gitlab_repo_pkey"\nDETAIL:  Key (uid)=(debian/nathanruiz-guest/apt) already exists.\n',) [13/6560]
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
                                             context)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
                                             cursor.execute(statement, parameters)
                                         psycopg2.IntegrityError: duplicate key value violates unique constraint "gitlab_repo_pkey"
                                         DETAIL:  Key (uid)=(debian/nathanruiz-guest/apt) already exists.


                                         The above exception was the direct cause of the following exception:

                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 382, in trace_task
                                             R = retval = fun(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 45, in __call__
                                             return super().__call__(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 641, in __protected_call__
                                             return self.run(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/swh/lister/gitlab/tasks.py", line 36, in range_gitlab_lister
                                             lister.run(min_bound=start, max_bound=end)
                                           File "/usr/lib/python3/dist-packages/swh/lister/core/page_by_page_lister.py", line 123, in run
                                             checks=check_existence)
                                           File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 492, in ingest_data
                                             injected = self.inject_repo_data_into_db(models_list)
                                           File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 435, in inject_repo_data_into_db
                                             injected_repos[m['uid']] = self.db_inject_repo(m)
                                           File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 372, in db_inject_repo
                                             sql_repo = self.db_query_equal('uid', model_dict['uid'])
                                           File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 335, in db_query_equal
                                             .filter(key == value).first()
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2659, in first
                                             ret = list(self[0:1])
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2457, in __getitem__
                                             return list(res)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2760, in __iter__
                                             self.session._autoflush()
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 1303, in _autoflush
                                             util.raise_from_cause(e)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
                                             reraise(type(exception), exception, tb=exc_tb, cause=cause)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise
                                             raise value
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 1293, in _autoflush
                                             self.flush()
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2019, in flush
                                             self._flush(objects)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2137, in _flush
                                             transaction.rollback(_capture_exception=True)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
                                             compat.reraise(exc_type, exc_value, exc_tb)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise
                                             raise value
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2101, in _flush
                                             flush_context.execute()
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
                                             rec.execute(self)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
                                             uow
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
                                             mapper, table, insert)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/orm/persistence.py", line 767, in _emit_insert_statements
                                             execute(statement, multiparams)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
                                             return meth(self, multiparams, params)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
                                             return connection._execute_clauseelement(self, multiparams, params)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
                                             compiled_sql, distilled_params
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
                                             context)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
                                             exc_info
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
                                             reraise(type(exception), exception, tb=exc_tb, cause=cause)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 185, in reraise
                                             raise value.with_traceback(tb)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
                                             context)
                                           File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
                                             cursor.execute(statement, parameters)
                                         sqlalchemy.exc.IntegrityError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.IntegrityError) duplicate key value violates unique constraint "gitlab_repo_pkey"
                                         DETAIL:  Key (uid)=(debian/nathanruiz-guest/apt) already exists.
                                          [SQL: 'INSERT INTO gitlab_repo (name, full_name, html_url, origin_url, origin_type, last_seen, task_id, uid, instance) VALUES (%(name)s, %(full_name)s, %(html_url)s, %(origin_url)s, %(origin_type)s, %(last_seen)s, %(task_id)s, %(uid)s, %(instance)s)'] [parameters: {'instance': 'debian', 'last_seen': datetime.datetime(2019, 6, 30, 9, 0, 36,
 155540), 'origin_url': 'https://salsa.debian.org/nathanruiz-guest/apt.git', 'full_name': 'nathanruiz-guest/apt', 'name': 'apt', 'html_url': 'https://salsa.debian.org/nathanruiz-guest/apt', 'task_id': None, 'origin_type': 'git', 'uid': 'debian/nathanruiz-guest/apt'}]
Jun 30 09:00:54 worker15 python3[23574]: [2019-06-30 09:00:54,518: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.RangeGitLabLister[71da1490-b1ac-4d93-bc7f-5402472e05d1]

With @douardda, we might have encountered those occurrences already.
It was possibly due to range interval overlap IMSMW.

In any case, that must be dealt with:

  • by either checking the range computations to avoid overlap
  • as a fallback, either trap those errors (if the source of the error is not found for example). Then make sure the main process continues to avoid having holes

Event Timeline

ardumont triaged this task as Normal priority.
ardumont added a project: Lister.
ardumont renamed this task from gitlab lister: full listing on large instance does not handle correctly concurrency writings to gitlab lister: make full listing on large instance more robust to concurrency writings.Mon, Jul 1, 10:16 AM