Those two core functions are called by the principal lister method,
:py:meth:`Lister.run`, found in the base class.
:py:meth:`get_pages` is the guts of the lister. It takes no arguments and must produce
data pages. An iterator is fine here, as the :py:meth:`Lister.run` method only mean to
iterate in a single pass on it. This method gets its input from a network request to a
remote service's endpoint to retrieve the data we long for.
Depending on whether the data is adequately structured for our purpose can be tricky.
Here you may have to show off your data scraping skills, or just consume a well-designed
API. Those aspects are discussed more specifically in the section
:ref:`handling-specific-topics`.
In any case, we want the data we return to be usefully filtered and structured. The
easiest way to create an iterator is to use the ``yield`` keyword. Yield each data page
you have structured in accordance with the page type you have declared. The page type
exists only for static type checking of data passed from :py:meth:`get_pages` to
:py:meth:`get_origins_from_page`; you can choose whatever fits the bill.
:py:meth:`get_origins_from_page` is simpler. For each individual software origin you
have received in the page, you convert and yield a :py:class:`ListedOrigin` model
object. This datatype has the following mandatory fields:
* lister id: you generally fill this with the value of :py:attr:`self.lister_obj.id`
* visit type: the type of software distribution format the service provides. For use by
a corresponding loader. It is an identifier, so you have to either use an existing
value or craft a new one if you get off the beaten track and tackle a new software
source. But then you will have to discuss the name with the core developers.
Example: Phabricator is a forge that can handle Git or SVN repositories. The visit
type would be "git" when listing such a repo that provides a Git URL that we can load.
* origin URL: an URL that, combined with the visit type, will serve as the input of
loader.
This datatype can also further be detailed with the optional fields:
* last update date: freshness information on this origin, which is useful to the
scheduler for optimizing its scheduling decisions. Fill it if provided by the service,
at no substantial additional runtime cost, e.g. in the same request.
* extra loader arguments: extra parameters to be passed to the loader for it to be
able to load the origin. It is needed for example when additional context is needed
along with the URL to effectively load from the origin.
See the definition of :swh_web:`ListedOrigin <browse/swh:1:rev:03460207a17d82635ef5a6f12358392143eb9eef/?origin_url=https://forge.softwareheritage.org/source/swh-scheduler.git&path=swh/scheduler/model.py&revision=03460207a17d82635ef5a6f12358392143eb9eef#L134-L177>`.
Now that that we showed how those two methods operate, let's put it together by showing
how they fit in the principal :py:meth:`Lister.run` method::
def run(self) -> ListerStats:
full_stats = ListerStats()
try:
for page in self.get_pages():
full_stats.pages += 1
origins = self.get_origins_from_page(page)
full_stats.origins += self.send_origins(origins)
self.commit_page(page)
finally:
self.finalize()
if self.updated:
self.set_state_in_scheduler()
return full_stats
:py:meth:`Lister.send_origins` is the method that sends listed origins to the scheduler.
The :py:class:`ListerState` datastructure, defined along the base lister class, is used
to compute the number of listed pages and origins in a single lister run. It is useful
both for the scheduler that automatically collects this information and to test the
lister.
You see that the bulk of a lister run consists in streaming data gathered from the
remote service to the scheduler. And this is done under a ``try...finally`` construct to
have the lister state reliably recorded in case of unhandled error. We will explain the
role of the remaining methods and attributes appearing here in the next section as it is
related to the lister state.
.._handling-lister-state:
Handling lister state
---------------------
With what we have covered until now you can write a stateless lister. Unfortunately,
some services provide too much data to efficiently deal with it in a one-shot fashion.
Listing a given software source can take several hours or days to process. Our listers
can also give valid output, but fail on an unexpected condition and would have to start
over. As we want to be able to resume the listing process from a given element, provided
by the remote service and guaranteed to be ordered, such as a date or a numeric
identifier, we need to deal with state.
The remaining part of the lister API is reserved for dealing with lister state.
If the service to list has no pagination, then the data set to handle is small enough to
not require keeping lister state. In the opposite case, you will have to determine which
piece of information should be recorded in the lister state. As said earlier, we
recommend declaring a dataclass for the lister state::
@dataclass
class NewForgeListerState:
current: str = ""
class NewForgeLister(Lister[NewForgeListerState, NewForgePage]):
...
A pair of methods, :py:meth:`state_from_dict` and :py:meth:`state_to_dict` are used to
respectively import lister state from the scheduler and export lister state to the
scheduler. Some fields may need help to be serialized to the scheduler, such as dates,
so this needs to be handled there.
Where is the state used? Taking the general case of a paginating service, the lister
state is used at the beginning of the :py:meth:`get_pages` method to initialize the
variables associated with the last listing progress. That way we can start from an
arbitrary element, or just the first one if there is no last lister state.
The :py:meth:`commit_page` is called on successful page processing, after the new
origins are sent to the scheduler. Here you should mainly update the lister state by
taking into account the new page processed, e.g. advance a date or serial field.
Finally, upon either completion or error, the :py:meth:`finalize` is called. There you
must set attribute :py:attr:`updated` to True if you were successful in advancing in the
listing process. To do this you will commonly retrieve the latest saved lister state
from the scheduler and compare with your current lister state. If lister state was
updated, ultimately the current lister state will be recorded in the scheduler.
We have now seen the stateful lister API. Note that some listers may implement more
flexibility in the use of lister state. Some allow an `incremental` parameter that
governs whether or not we will do a stateful listing or not. It is up to you to support
additional functionality if it seems relevant.
.._handling-specific-topics:
Handling specific topics
------------------------
Here is a quick coverage of common topics left out from lister construction and
:py:meth:`get_pages` descriptions.
Sessions
^^^^^^^^
When requesting a web service repeatedly, most parameters including headers do not
change and could be set up once initially. We recommend setting up a e.g. HTTP session,
as instance attribute so that further requesting code can focus on what really changes.
Some ubiquitous HTTP headers include "Accept" to set to the service response format and
"User-Agent" for which we provide a recommended value :py:const:`USER_AGENT` to be
imported from :py:mod:`swh.lister`. Authentication is also commonly provided through
headers, so you can also set it up in the session.
Transport error handling
^^^^^^^^^^^^^^^^^^^^^^^^
We generally recommend logging every unhandleable error with the response content and
then immediately stop the listing by doing an equivalent of
:py:meth:`Response.raise_for_status` from the ``requests`` library. As for rate-limiting
errors, we have a strategy of using a flexible decorator to handle the retrying for us.
It is based on the ``tenacity`` library and accessible as :py:func:`throttling_retry` from
:py:mod:`swh.lister.utils`.
Pagination
^^^^^^^^^^
This one is a moving target. You have to understand how the pagination mechanics of the
particular service works. Some guidelines though. The identifier may be minimal (an id
to pass as query parameter), compound (a set of such parameters) or complete (a whole
URL). If the service provides the next URL, use it. The piece of information may be
found either in the response body, or in a header. Once identified, you still have to
implement the logic of requesting and extracting it in a loop and quitting the loop when
there is no more data to fetch.
Page results
^^^^^^^^^^^^
First, when retrieving page results, which involves some protocols and parsing logic,
please make sure that any deviance from what was expected will result in an
informational error. You also have to simplify the results, both with filtering request
parameters if the service supports it, and by extracting from the response only the
information needed into a structured page. This all makes for easier debugging.
Misc files
^^^^^^^^^^
There are also a few files that need to be modified outside of the lister directory, namely:
*:file:`/setup.py` to add your lister to the end of the list in the *setup* section::
entry_points="""
[swh.cli.subcommands]
lister=swh.lister.cli
[swh.workers]
lister.bitbucket=swh.lister.bitbucket:register
lister.cgit=swh.lister.cgit:register
..."""
*:file:`/swh/lister/tests/test_cli.py` to get a default set of parameters in scheduler-related tests.
*:file:`/README.md` to reference the new lister.
*:file:`/CONTRIBUTORS` to add your name.
Testing your lister
-------------------
When developing a new lister, it's important to test. For this, add the tests
(check :file:`swh/lister/*/tests/`) and register the celery tasks in the main