Changeset View
Standalone View
swh/lister/cgit/lister.py
- This file was added.
# Copyright (C) 2019 the Software Heritage developers | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
import random | |||||
import logging | |||||
from bs4 import BeautifulSoup | |||||
import requests | |||||
from urllib.parse import urlparse | |||||
from .models import CGitModel | |||||
from swh.lister.core.simple_lister import SimpleLister | |||||
from swh.lister.core.lister_transports import ListerOnePageApiTransport | |||||
class CGitLister(ListerOnePageApiTransport, SimpleLister): | |||||
ardumont: Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`.
Gitlab lister… | |||||
Done Inline ActionsI checked PageByPageLister (used in gitlab) and SWHIndexingHttpLister (used in github). They can be used here but :
2)Current technique which I used fits quite well in the code, if we change to `PageByPageLister` then it would require a lot of revamping of code
FWIW, I don't think using PageByPageLister would serve any help. I would like to know you opinion on this. Btw why do we have both PageByPageLister and SWHIndexingHttpLister , they serve quite similar purpose. nahimilega: I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github). | |||||
Done Inline Actions
I'll take a closer look at your diff.
(*urk* at the name, we should drop the SWH prefix as it's redundant within swh modules ;) ardumont: > FWIW, I don't think using PageByPageLister would serve any help. I would like to know you… | |||||
MODEL = CGitModel | |||||
LISTER_NAME = 'cgit' | |||||
PAGE = None | |||||
Done Inline ActionsWhy not None? ardumont: Why not None? | |||||
url_prefix_present = True | |||||
Done Inline ActionsWe should stop using different names around. That way, we can push the following check systematically done here (about the trailing /). ardumont: We should stop using different names around.
Latest other lister named that api_baseurl.
Maybe… | |||||
Done Inline ActionsNow I took help of urllib.parse, so there is no need of that check anymore nahimilega: Now I took help of `urllib.parse`, so there is no need of that check anymore | |||||
Done Inline ActionsI am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url. nahimilega: I am not sure what you meant here, shall I put a comment in __init__() telling that base_url… | |||||
Done Inline ActionsI meant (i think because that's an old comment) to name that url url. Please explicit the argument in the constructor (__init__) docstring. Please keep the instance as second parameter to have something consistent with other listers. That'd give: def __init__(self, url, instance, url_prefix... I did not read the reasoning behind the url_prefix below yet. ardumont: I meant (i think because that's an old comment) to name that url `url`.
Please explicit the… | |||||
def __init__(self, url, instance=None, url_prefix=None, | |||||
Done Inline ActionsWhy is this needed? If it's not to work around some common gotcha with cgit, we should not disturb configuration data. If it's for the sake of uniformity, that uniformity should be enforced on the cgit lister configuration, not here. zack: Why is this needed?
If it's not to work around some common gotcha with cgit, we should not… | |||||
Done Inline ActionsThis is for if someone enters the base url as http://abc.com . nahimilega: This is for if someone enters the base url as http://abc.com .
It will convert it to http… | |||||
Done Inline ActionsThis is also done in phabricator lister, anlambert recommended me to do that in pharicator lister. So I used it here as well. nahimilega: This is also done in phabricator lister, anlambert recommended me to do that in pharicator… | |||||
override_config=None): | |||||
"""Inits Class with PAGE url and origin url prefix. | |||||
Args: | |||||
url (str): URL of the CGit instance. | |||||
Done Inline ActionsTry to explicit a bit why you need that url_prefix. ardumont: Try to explicit a bit why you need that `url_prefix`.
We know now but we will forget.
When… | |||||
Done Inline ActionsI have also mentioned it in commit message nahimilega: I have also mentioned it in commit message | |||||
instance (str): Name of cgit instance. | |||||
url_prefix (str): Prefix of the origin_url. Origin link of the | |||||
repos of some special instances do not match | |||||
the url of the repository page, they have origin | |||||
Done Inline ActionsI did not get that part. Can't urllib.parse.urlparse help a bit? ardumont: I did not get that part.
Are you sure it works for multiple instances?
Can't `urllib.parse. | |||||
Done Inline ActionsAlso, i saw the previous exchange about this, still not clear to me what this does. That will have the advantage to somehow document it? (I'm still interesting by the explanations though ;) ardumont: Also, i saw the previous exchange about this, still not clear to me what this does.
Can you… | |||||
Done Inline ActionsNow I have extracted that part into a function find_netloc() and also written test cases for it. nahimilega: Now I have extracted that part into a function `find_netloc()` and also written test cases for… | |||||
url in the format <url_prefix>/<repo_name>. | |||||
""" | |||||
Done Inline Actionsself.url_prefix = url if url_prefix is None else url_prefix ardumont: self.url_prefix = url if url_prefix is None else url_prefix | |||||
self.PAGE = url | |||||
if url_prefix is None: | |||||
self.url_prefix = url | |||||
Done Inline Actionsdictionaries. ardumont: dictionaries. | |||||
self.url_prefix_present = False | |||||
else: | |||||
Done Inline ActionsA wee bit more detail of the algo used to parse the output would be welcome. ardumont: A wee bit more detail of the algo used to parse the output would be welcome. | |||||
Not Done Inline Actionsself.url_netloc = find_netloc(urlparse(self.PAGE)) ardumont: self.url_netloc = find_netloc(urlparse(self.PAGE)) | |||||
Not Done Inline ActionsI don't think this a good idea because urllib object of self.PAGE is also used in line 50 (and also in line 47 for which the comment originally was) nahimilega: I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50… | |||||
Not Done Inline ActionsI don't understand your reply but not too worry. ardumont: I don't understand your reply but not too worry.
You can keep the code as is.
| |||||
self.url_prefix = url_prefix | |||||
if not self.url_prefix.endswith('/'): | |||||
self.url_prefix += '/' | |||||
Done Inline ActionsAs this snippet BeautifulSoup(...) used quite a lot, i'd extract this in a function and use that function call instead (including in tests). So can you please refactor to something like: repo_soup = make_repo_soup(response.text) ardumont: As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use… | |||||
url = urlparse(self.PAGE) | |||||
Done Inline Actionsthis way of indenting looks odd, wouldn't: BeautifulSoup(response.text, features="html.parser") \ .find('div', {"class": "content"}) be more customary? (Note: I haven't checked what PEP8 has to say about either version.) zack: this way of indenting looks odd, wouldn't:
```
BeautifulSoup(response.text, features="html. | |||||
self.url_netloc = find_netloc(url) | |||||
Not Done Inline ActionsI forgot to ask before, why do you need to call this like that? Why not super().__init__...? (i think i saw why but still, can you explicit?) Also, why isn't it in the first part of the constructor? ardumont: I forgot to ask before, why do you need to call this like that?
Why not `super().__init__...`? | |||||
if not instance: | |||||
instance = url.hostname | |||||
self.instance = instance | |||||
ListerOnePageApiTransport .__init__(self) | |||||
SimpleLister.__init__(self, override_config=override_config) | |||||
Done Inline ActionsDrop the all the expression you use everywhere. Find repositories metadata by parsing the html page (response's raw content). If there are links in the html page, retrieve those repositories metadata from those pages as well. Return the repositories as list of dictionaries. ardumont: Drop the `all the` expression you use everywhere.
Simplify this to something like:
```
Find… | |||||
Done Inline ActionsSorry, now I realise all the is redundant in the docstring. nahimilega: Sorry, now I realise `all the` is redundant in the docstring. | |||||
Done Inline ActionsArgs: response (Response): http api request response Returns: repository origin urls (as dict) included in the response ymmv but that's the gist of it, i think. ardumont: ```
Args:
response (Response): http api request response
Returns:
repository origin… | |||||
def list_packages(self, response): | |||||
Done Inline Actionsitems in the dict should be only one extra indent level vlorentz: items in the dict should be only one extra indent level | |||||
Done Inline ActionsOk, talking a bit between us, we are wondering whether this is needed. We could have a more lightweight approach first. That'd have the benefit to:
ardumont: Ok, talking a bit between us, we are wondering whether this is needed.
Going that way (priority… | |||||
Done Inline ActionsBefore choosing this method of visiting every repo page, I did some research, here is the result: For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Hence I decided to take the approach to visit every page nahimilega: Before choosing this method of visiting every repo page, I did some research, here is the… | |||||
Done Inline ActionsI just now thought of a clever approach which is lightweight and will do the work in a matter of seconds( unlike ~1 hour). The repositories in cgit instance are divided into groups with a group heading (in pale white colour as in https://cgit.freedesktop.org/) For all the members of the group the clone link is in format <some_url_comman_to_all members_of_a_group>/<name>.git To find the <some_url_comman_to_all members_of_a_group> we need to visit the page of only one repository for a group. This would drastically decrease the no. of requests and make the code faster. (although would increase the quantity of code) I will test this method and inform you about the result. Does this method sound good? nahimilega: I just now thought of a clever approach which is lightweight and will do the work in a matter… | |||||
Done Inline Actions
which then will be out of scope for that lister.
This repo does not list anything so that's it.
That's the loader's concern, not the lister's. Some listers currently list private repositories for example. It's not an issue.
Well, as a first approximation, i'd say let's go towards removing the visit step anyway (and compute basic git clone url, self.get_url sounds good enough IIRC). It's good to start incrementally and not overthink everything. Mostly, aside my remarks on the diff (that i'd like you to take into accounts nonetheless), the lister is almost ready.
Which is a reasonable choice.
I'd say yes. ardumont: > For https://cgit.freedesktop.org/
> It has some clone link as https://gitlab.freedesktop. | |||||
"""List the actual cgit instance origins from the response. | |||||
Find repositories metadata by parsing the html page (response's raw | |||||
content). If there are links in the html page, retrieve those | |||||
repositories metadata from those pages as well. Return the | |||||
repositories as list of dictionaries. | |||||
Done Inline Actionsadd a trailing comma after this line, it will make diff that in the future might add an additional line nicer/shorter zack: add a trailing comma after this line, it will make diff that in the future might add an… | |||||
Done Inline ActionsDo we have roughly an idea of the number of pages we can encounter in different cgit instances? ardumont: Do we have roughly an idea of the number of pages we can encounter in different cgit instances? | |||||
Not Done Inline ActionsI think its 275, in http://hdiff.luite.com/cgit/ nahimilega: I think its 275, in http://hdiff.luite.com/cgit/ | |||||
Not Done Inline ActionsSorry, i was unclear, I meant the average number of pages. ardumont: Sorry, i was unclear, I meant the average number of pages.
The instance you mentioned is so far… | |||||
Done Inline ActionsMost of the cgit instances don't have a pagination. nahimilega: Most of the cgit instances don't have a pagination.
I think the average would be around 2-3… | |||||
Done Inline ActionsFollowing the logic presented in comments below, here you want to exclude the first page so: repos.extend(self.get_repos(pages[1:]) Note:
ardumont: Following the logic presented in comments below, here you want to exclude the first page so… | |||||
Args: | |||||
Done Inline ActionsIndentation is still off to me. ardumont: Indentation is still off to me. | |||||
response (Response): http api request response. | |||||
Returns: | |||||
List of repository origin urls (as dict) included in the response. | |||||
Done Inline Actionsindentation is off. ardumont: indentation is off.
| |||||
""" | |||||
Done Inline ActionsI still don't think indentation is correct here, can you please show me what would be the correct indentation here nahimilega: I still don't think indentation is correct here, can you please show me what would be the… | |||||
Done Inline ActionsIn doubt, look at how generally the source code is written first in this very repository (and in others if not enough). Just align with the beginning of the line (at current indentation level): repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, }) ardumont: In doubt, look at how generally the source code is written first in this very repository (and… | |||||
repos_details = [] | |||||
Done Inline ActionsWhy the need to shuffle? ardumont: Why the need to shuffle? | |||||
Done Inline ActionsI first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and cran lister and here too. Although I don't really know its importance. nahimilega: I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and… | |||||
Done Inline ActionsOk. I don't remember the reasons though, too bad. ardumont: Ok.
I don't remember the reasons though, too bad. | |||||
repos = get_repo_list(response.text) | |||||
url_soup = make_soup(response.text) | |||||
pages = self.get_pages(url_soup) | |||||
if len(pages) > 1: | |||||
Done Inline ActionsOne last modification, extract this in a method. def _yield_repo_from_responses(self, response): """Yield repositories from request responses... Args: ... Yields: ... """ html = response.text yield from get_repo_list(html) pages = self.get_pages(make_soup(html)) if len(pages) > 1: yield from self.get_repos_from_pages(pages[1:]) def list_packages(self, response): """... <existing docstring> """ for repo in self._list_repo_from_response(response): ... ardumont: One last modification, extract this in a method.
To avoid the `list` wrapping.
```
def… | |||||
repos.extend(list(self.get_repos_from_pages(pages[1:]))) | |||||
for repo in repos: | |||||
repo_name = repo.a.text | |||||
origin_url = self.find_origin_url(repo, repo_name) | |||||
Done Inline Actionsthis way of typesetting args/return is not consistent with our python coding guidelines, see, e.g., https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html zack: this way of typesetting args/return is not consistent with our python coding guidelines, see, e. | |||||
try: | |||||
Done Inline ActionsHere I don't think this is the best way to do it. It looks too messy but I can't use s.find_all('/') because find_all is method present in bs4. Can you please recommend some smart approach to do it nahimilega: Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all… | |||||
Done Inline Actions(part1, part2, next_url) = self.base_url.split('/', 2) vlorentz: `(part1, part2, next_url) = self.base_url.split('/', 2)` | |||||
Done Inline ActionsThanks, @vlorentz, This is really neat of doing it nahimilega: Thanks, @vlorentz, This is really neat of doing it | |||||
Done Inline Actions*neat way of doing it nahimilega: *neat way of doing it | |||||
time = repo.span['title'] | |||||
except Exception: | |||||
time = None | |||||
if origin_url is not None: | |||||
repos_details.append({ | |||||
'name': repo_name, | |||||
Done Inline Actionsif not pages: ardumont: `if not pages:` | |||||
'time': time, | |||||
'origin_url': origin_url, | |||||
}) | |||||
Done Inline ActionsWhat's the format of a['href']? Is that a href without the absolute url? ardumont: What's the format of a['href']?
Is that a href without the absolute url?
Is that true for all… | |||||
Done Inline Actionsa['href'] is of string format. It is without absolute url, it is in relative to the url_netloc I computed in my code. For example for http://git.savannah.gnu.org/cgit/, all the URL are respect to http://git.savannah.gnu.org
It is true for all the instance I saw so far(all mentioned in T1835 nahimilega: a['href'] is of string format.
It is without absolute url, it is in relative to the… | |||||
Done Inline Actionsgreat, thanks. ardumont: great, thanks. | |||||
random.shuffle(repos_details) | |||||
Done Inline ActionsAlso why not using a list comprehension here? return [self.get_url(page) for page in pages] ardumont: Also why not using a list comprehension here?
```
return [self.get_url(page) for page in… | |||||
return repos_details | |||||
def find_origin_url(self, repo, repo_name): | |||||
Done Inline Actionspages ([str]): .. ardumont: pages ([str]): .. | |||||
"""Finds the origin url for a repository | |||||
Args: | |||||
Done Inline Actionspages (except (missing space) ardumont: `pages (except` (missing space) | |||||
repo (Beautifulsoup): Beautifulsoup object of the repository | |||||
Done Inline ActionsReturns: List of beautifulsoup of all the repositories (url) row present in all the pages (except the first one). Drop the of the html code of, we mention it's a beautifulsoup object, which as far as i got it is an xml/html parser. ardumont: ```
Returns:
List of beautifulsoup of all the repositories (url) row present in all the… | |||||
row present in base url. | |||||
repo_name (str): Repository name. | |||||
Done Inline ActionsYou are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page). So either:
ardumont: You are not retrieving pages here, you are fetching repositories from pages (all the pages you… | |||||
Returns: | |||||
string: origin url. | |||||
Done Inline Actionsinstance ardumont: instance | |||||
Done Inline ActionsRequest the available reposfrom the pages. This yields the available repositories found as beautiful object representation. ardumont: Request the `available repos`from the pages. This yields the available repositories found as… | |||||
""" | |||||
if self.url_prefix_present: | |||||
return self.url_prefix + repo_name | |||||
return self.get_url(repo) | |||||
Done Inline ActionsYields (if you follow the snippet proposed below). ardumont: Yields (if you follow the snippet proposed below). | |||||
def get_pages(self, url_soup): | |||||
"""Find URL of all pages. | |||||
Done Inline Actionsas before: this doesn't conform with Napoleon style zack: as before: this doesn't conform with Napoleon style | |||||
Done Inline Actionsthis looks familiar... ardumont: this looks familiar...
Please refactor this within a method or a function. | |||||
Done Inline Actionsthis looks familiar... ardumont: this looks familiar...
Please refactor this within a method or a function. | |||||
Finds URL of pages that are present by parsing over the HTML of | |||||
Done Inline ActionsHeads up, code review stopped here! ardumont: Heads up, code review stopped here! | |||||
pagination present at the end of the page. | |||||
Args: | |||||
url_soup (Beautifulsoup): a beautifulsoup object of base URL | |||||
Done Inline Actionsdoctests should be written like this: >>> find_origin_url('http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') 'https://git.savannah.gnu.org/git/fbvbconv-py.git' with no indent vlorentz: doctests should be written like this:
```
>>> find_origin_url('http://git.savannah.gnu. | |||||
Returns: | |||||
list: URL of pages present for a cgit instance | |||||
Done Inline ActionsYou can use a generator here: for page in pages: response = requests.get(page) if not response.ok: # deal with error as warning without impeding the listing to finish logger.warn('Failed to retrieve repositories from page %s', page) continue yield from get_repo_list(response.text) Also, remove the trick about removing the first page. ardumont: You can use a generator here:
```
for page in pages:
response = requests.get(page)
if… | |||||
""" | |||||
pages = url_soup.find('div', {"class": "content"}).find_all('li') | |||||
if not pages: | |||||
return [self.PAGE] | |||||
return [self.get_url(page) for page in pages] | |||||
def get_repos_from_pages(self, pages): | |||||
"""Find repos from all pages. | |||||
Request the available repos from the pages. This yields | |||||
the available repositories found as beautiful object representation. | |||||
Args: | |||||
pages ([str]): list of urls of all pages present for a | |||||
Done Inline Actionssame: Napoleon mismatch zack: same: Napoleon mismatch | |||||
particular cgit instance. | |||||
Yields: | |||||
List of beautifulsoup object of repository (url) rows | |||||
present in pages(except first). | |||||
""" | |||||
ardumontUnsubmitted Done Inline ActionsRemove extra line. Start directly from the last docstring comment. ardumont: Remove extra line. Start directly from the last docstring comment.
We do always have the extra… | |||||
for page in pages: | |||||
Done Inline Actionssame vlorentz: same | |||||
Done Inline ActionsI think this is indented too much (at least the first line, not sure about the rest w.r.t. what doctest expects) zack: I think this is indented too much (at least the first line, not sure about the rest w.r.t. what… | |||||
response = requests.get(page) | |||||
if not response.ok: # deal with error as warning without impeding | |||||
# the listing to finish | |||||
ardumontUnsubmitted Done Inline Actionscomment here is not necessarily used, i added it in the diff comment to explicit it to you ;) ardumont: comment here is not necessarily used, i added it in the diff comment to explicit it to you ;) | |||||
logging.warning('Failed to retrieve repositories from page %s', | |||||
page) | |||||
Done Inline ActionsHow come response is already in the right format for this lister? ardumont: How come `response` is already in the right format for this lister? | |||||
continue | |||||
yield from get_repo_list(response.text) | |||||
def get_url(self, repo): | |||||
"""Finds url of a repo page. | |||||
Finds the url of a repo page by parsing over the html of the row of | |||||
that repo present in the base url. | |||||
Args: | |||||
repo (Beautifulsoup): a beautifulsoup object of the repository | |||||
row present in base url. | |||||
Returns: | |||||
string: The url of a repo. | |||||
""" | |||||
suffix = repo.a['href'] | |||||
return self.url_netloc + suffix | |||||
def get_model_from_repo(self, repo): | |||||
Done Inline ActionsNapoleon zack: Napoleon | |||||
"""Transform from repository representation to model. | |||||
""" | |||||
return { | |||||
'uid': self.PAGE + repo['name'], | |||||
'name': repo['name'], | |||||
'full_name': repo['name'], | |||||
'html_url': repo['origin_url'], | |||||
'origin_url': repo['origin_url'], | |||||
'origin_type': 'git', | |||||
Done Inline Actions"""Find all origin urls per repository by... ardumont: """Find all origin urls per repository by... | |||||
Done Inline ActionsFind repositories (as beautifulsoup object) available within the server response. ardumont: Find repositories (as beautifulsoup object) available within the server response.
| |||||
'time_updated': repo['time'], | |||||
'instance': self.instance, | |||||
} | |||||
Done Inline ActionsDrop this, the first line proposed is enough (i think). ardumont: Drop this, the first line proposed is enough (i think). | |||||
Done Inline Actionsa beautiful soup object repo representation ardumont: `a beautiful soup object repo representation` | |||||
def transport_response_simplified(self, repos_details): | |||||
"""Transform response to list for model manipulation. | |||||
Done Inline ActionsAll possible origin urls for a repository (dict with key 'protocol', value the associated url). ardumont: All possible origin urls for a repository (dict with key 'protocol', value the associated url). | |||||
""" | |||||
return [self.get_model_from_repo(repo) for repo in repos_details] | |||||
Done Inline ActionsList all the repositories as beautifulsoup object within the response. ardumont: List all the repositories as beautifulsoup object within the response. | |||||
def find_netloc(url): | |||||
Done Inline ActionsThese are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too? nahimilega: These are just one line function and not much of logic is present in these functions, they are… | |||||
Done Inline ActionsYes, to avoid decreasing too much the coverage. ardumont: Yes, to avoid decreasing too much the coverage.
If they are not much logic, they are simple to… | |||||
"""Finds the network location from then url. | |||||
URL in the repo are relative to the network location part of base | |||||
URL, so we need to compute it to reconstruct URLs. | |||||
Done Inline ActionsWhat's repo_url's type? please, mention it in the docstring. Args: repo_url (<TYPE>): ... Please, do so in the current diff everywhere it's missing. In case the type is from a tiers-party module, it's fine to use that as the type (for example response (Response): ...). ardumont: What's repo_url's type?
please, mention it in the docstring.
```
Args:
repo_url (<TYPE>)… | |||||
Args: | |||||
Done Inline ActionsInstantiates a beautiful soup object from the response object. ardumont: Instantiates a beautiful soup object from the response object. | |||||
url (urllib): urllib object of url. | |||||
Returns: | |||||
string: Scheme and Network location part in the base URL. | |||||
Example: | |||||
For url = https://git.kernel.org/pub/scm/ | |||||
>>> find_netloc(url) | |||||
'https://git.kernel.org' | |||||
""" | |||||
return '%s://%s' % (url.scheme, url.netloc) | |||||
def get_repo_list(response): | |||||
"""Find repositories (as beautifulsoup object) available within the server | |||||
response. | |||||
Args: | |||||
response (Response): server response | |||||
Done Inline Actionsorigin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the associated url) ardumont: origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the… | |||||
Returns: | |||||
List all repositories as beautifulsoup object within the response. | |||||
Done Inline ActionsUrl (str) with the highest... ardumont: Url (str) with the highest... | |||||
""" | |||||
repo_soup = make_soup(response) | |||||
Done Inline ActionsI've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned in the docstring what the type is expected). ardumont: I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned… | |||||
return repo_soup \ | |||||
.find('div', {"class": "content"}).find_all("tr", {"class": ""}) | |||||
def make_soup(response): | |||||
"""Instantiates a beautiful soup object from the response object. | |||||
""" | |||||
return BeautifulSoup(response, features="html.parser") |
Don't know if you saw, but for pagination purposes, IIRC, the PageByPageLister.
Gitlab lister uses it.
Maybe this one could?
I did not check the rest yet (just saw the diff update with the pagination support).