Page MenuHomeSoftware Heritage

bitbucket lister does not work
Closed, ResolvedPublic

Description

It has been discussed on irc to deploy it if it works or open a task as fallback (as per my proposition ;).
From current swh-lister (tag ~0.0.27 or so), the current bitbucket lister fails to execute properly.

  1. Expected behavior
  2. listing properly
  3. no error in logs
  4. new cache entries in lister's bitbucket_repo table
  5. new scheduling tasks (load-hg, load-git) in scheduler db

What really happens

After task scheduling for that lister, the task fails as show below.

Details

Scheduling the task within the docker-env:

SCHEDULER_API_URL=http://localhost:5008/; swh scheduler --url $SCHEDULER_API_URL task add list-bitbucket-full --policy recurring api_baseurl='https://api.bitbucket.org/2.0'
Created 1 tasks

Task 2276
  Next run: just now (2019-06-18 12:46:34+00:00)
  Interval: 90 days, 0:00:00
  Type: list-bitbucket-full
  Policy: recurring
  Args:
  Keyword args:
    api_baseurl: 'https://api.bitbucket.org/2.0'

Letting it run:

swh-lister_1                  | [2019-06-18 12:46:36,547: ERROR/ForkPoolWorker-1] Task swh.lister.bitbucket.tasks.FullBitBucketRelister[32a85cb7-eb3b-41b8-b663-dce60e4fbaba] raised unexpected: ValueError("Can't partition an empty range")
swh-lister_1                  | Traceback (most recent call last):
swh-lister_1                  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
swh-lister_1                  |     R = retval = fun(*args, **kwargs)
swh-lister_1                  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/scheduler/task.py", line 45, in __call__
swh-lister_1                  |     return super().__call__(*args, **kwargs)
swh-lister_1                  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
swh-lister_1                  |     return self.run(*args, **kwargs)
swh-lister_1                  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/lister/bitbucket/tasks.py", line 34, in full_bitbucket_relister
swh-lister_1                  |     ranges = lister.db_partition_indices(split or GROUP_SPLIT)
swh-lister_1                  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/lister/core/indexing_lister.py", line 116, in db_partition_indices
swh-lister_1                  |     raise ValueError("Can't partition an empty range")
swh-lister_1                  | ValueError: Can't partition an empty range

Don't know the reason (could be an api change?), the task's goal is to analyze and fix.

Event Timeline

ardumont triaged this task as Normal priority.Jun 18 2019, 2:58 PM
ardumont created this task.

Heads up, a priori, there is a bootstrap step missing.
@olasd told me it's possible we need to start an incremental one first (thanks).

Indeed, it does more things but stop real soon (only 213 origins are found, the main bitbucket instance is quite larger than that).

There are some more analysis to do.

@douardda As discussed orally, as i was unsure it ran at the time, I checked and yes.
We have data in the swh-lister db (~214 entries listed in table bitbucket_repo).
It ran around 2017-06-28 13:13:30.077108 (most probably prior to the scheduler existence).

So i believe the full lister implementation never worked (looks like a copy/paste from github lister's, db_partition_indice implem. cannot work here).
The incremental one is mostly ok though (as already mentioned, it's not complete).

But somehow, it's stopped because the supposed next link returned by api call is not found after a certain iteration.

Tryout within docker gives the following log output:

...06-20 17:45:27,466: INFO/ForkPoolWorker-1] stopping after index 2008-09-07T21:41:16.564922+00:00, no next link found

Nevertheless, executing that query manually though (through a top-level), i do have a data output with a 'next' link...
I'm wondering if it's not a bitbucket api limitation on their side. A kind of rate limit implementation.

In [64]: url4='https://api.bitbucket.org/2.0/repositories?after=2008-09-07T21:41:16.564922+00:00'

In [65]: d4 = requests.get(url4)

In [66]: d4
Out[66]: <Response [200]>

In [67]: data = d4.json()

In [68]: data.keys()
Out[68]: dict_keys(['pagelen', 'values', 'next'])

In [69]: data['next']
Out[69]: 'https://api.bitbucket.org/2.0/repositories?after=2008-09-07T21%3A41%3A16.564922%2B00%3A00'

So i believe the full lister implementation never worked (looks like a copy/paste from github lister's, db_partition_indice implem. cannot work here).

Thinking further, that might not be the case as the bitbucket lister was amongst the first lister with the current scaffolding (if not even the first?).
In any case, still no longer works today. So might be time to fix it or remove it altogether (if unused and fix solely the incremental one).

In any case, still no longer works today. So might be time to fix it or remove it altogether (if unused and fix solely the incremental one).

After discussing with @olasd, it's not a good idea to remove it.
The full lister is here to fill in the holes of the incremental lister.
So definitely, fixing it is the right way to go.

Taking a look further with olasd, we might have one way to fix the full lister.
I'll open a diff to fix that (D1629)

I'm wondering if it's not a bitbucket api limitation on their side. A kind of rate limit implementation.

Nope, i was wrong again.
Taking an even closer look, it turns out that the listing stops because the last 2 api requests, even though differents, give the same next link (same next pagination index).
This enters the termination condition check we have in the lister, thus stopping the iteration...

So a priori, not a regression on our part which is always a good thing to know \m/.

I'll open a diff to try and improve the pagination behavior for that peculiar case (D1634).

ardumont changed the task status from Open to Work in Progress.Jun 21 2019, 3:31 PM
ardumont claimed this task.