Differential D1610 Diff 5348 swh/lister/cgit/lister.py

Changeset View

Standalone View

swh/lister/cgit/lister.py

This file was added.

				# Copyright (C) 2019 the Software Heritage developers
				# License: GNU General Public License version 3, or any later version
				# See top-level LICENSE file for more information

				import random
				from bs4 import BeautifulSoup
				from collections import defaultdict
				import requests
				import urllib.parse

				from .models import CGitModel

				from swh.lister.core.simple_lister import SimpleLister
				from swh.lister.core.lister_transports import ListerOnePageApiTransport


				class CGitLister(ListerOnePageApiTransport, SimpleLister):
				ardumontUnsubmitted Done Inline Actions Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`. Gitlab lister uses it. Maybe this one could? I did not check the rest yet (just saw the diff update with the pagination support). ardumont: Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`. Gitlab lister…
				nahimilegaAuthorUnsubmitted Done Inline Actions I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github). They can be used here but : Here we doing HTML parsing and in cgit we get all the links to all the pages in one time(not like github api where we get only next url), so if we use this we need to parse over the pagination part again and again. We can avoid this by some smart tricks but that might make to code difficult to understand. 2)Current technique which I used fits quite well in the code, if we change to `PageByPageLister` then it would require a lot of revamping of code Most of the cgit don't even have a pagination system and all of the biggest cgit instance that are currently listed in T1835 can be listed perfectly by using the approach currently used. FWIW, I don't think using `PageByPageLister` would serve any help. I would like to know you opinion on this. Btw why do we have both `PageByPageLister` and `SWHIndexingHttpLister` , they serve quite similar purpose. nahimilega: I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github).
				ardumontUnsubmitted Done Inline Actions FWIW, I don't think using PageByPageLister would serve any help. I would like to know you opinion on this. I'll take a closer look at your diff. I'll get back to you on this to answer ;) Btw why do we have both PageByPageLister and SWHIndexingHttpLister , they serve quite similar purpose. (urk at the name, we should drop the SWH prefix as it's redundant within swh modules ;) IIRC they serve the same purpose but the api underneath does not work the same way. And i think they could not be reconciliated within the same api... (somehow, details are fuzzy). ardumont: > FWIW, I don't think using PageByPageLister would serve any help. I would like to know you…
				MODEL = CGitModel
				LISTER_NAME = 'cgit'

				ardumontUnsubmitted Done Inline Actions Why not None? ardumont: Why not None?
				def __init__(self, base_url, instance=None, override_config=None):
				if not base_url.endswith('/'):
				ardumontUnsubmitted Done Inline Actions We should stop using different names around. Latest other lister named that api_baseurl. Maybe using url, expliciting in the docstring that the url is actually is (api base url, url to start from, etc...) That way, we can push the following check systematically done here (about the trailing /). ardumont: We should stop using different names around. Latest other lister named that api_baseurl. Maybe…
				nahimilegaAuthorUnsubmitted Done Inline Actions Now I took help of `urllib.parse`, so there is no need of that check anymore nahimilega: Now I took help of `urllib.parse`, so there is no need of that check anymore
				nahimilegaAuthorUnsubmitted Done Inline Actions I am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url. Althought we no more need to check '/' in base_url after we switched to `urllib.parse` . But there is a need to check for '/' in origin_url_prefix. nahimilega: I am not sure what you meant here, shall I put a comment in __init__() telling that base_url…
				ardumontUnsubmitted Done Inline Actions I meant (i think because that's an old comment) to name that url `url`. Please explicit the argument in the constructor (`__init__`) docstring. Please keep the instance as second parameter to have something consistent with other listers. That'd give: def __init__(self, url, instance, url_prefix... I did not read the reasoning behind the `url_prefix` below yet. ardumont: I meant (i think because that's an old comment) to name that url `url`. Please explicit the…
				base_url = base_url+'/'
				zackUnsubmitted Done Inline Actions Why is this needed? If it's not to work around some common gotcha with cgit, we should not disturb configuration data. If it's for the sake of uniformity, that uniformity should be enforced on the cgit lister configuration, not here. zack: Why is this needed? If it's not to work around some common gotcha with cgit, we should not…
				nahimilegaAuthorUnsubmitted Done Inline Actions This is for if someone enters the base url as http://abc.com . It will convert it to http://abc.com/ which would ease up the computation further. nahimilega: This is for if someone enters the base url as http://abc.com . It will convert it to http…
				nahimilegaAuthorUnsubmitted Done Inline Actions This is also done in phabricator lister, anlambert recommended me to do that in pharicator lister. So I used it here as well. nahimilega: This is also done in phabricator lister, anlambert recommended me to do that in pharicator…
				self.base_url = base_url
				if not instance:
				instance = urllib.parse.urlparse(base_url).hostname
				self.instance = instance
				ListerOnePageApiTransport .__init__(self)
				ardumontUnsubmitted Done Inline Actions Try to explicit a bit why you need that `url_prefix`. We know now but we will forget. When that happens that docstring will help first. ardumont: Try to explicit a bit why you need that `url_prefix`. We know now but we will forget. When…
				nahimilegaAuthorUnsubmitted Done Inline Actions I have also mentioned it in commit message nahimilega: I have also mentioned it in commit message
				SimpleLister.__init__(self, override_config=override_config)

				def list_packages(self, response):
				"""List the actual cgit instance origins from the response.
				ardumontUnsubmitted Done Inline Actions I did not get that part. Are you sure it works for multiple instances? Can't `urllib.parse.urlparse` help a bit? ardumont: I did not get that part. Are you sure it works for multiple instances? Can't `urllib.parse.
				ardumontUnsubmitted Done Inline Actions Also, i saw the previous exchange about this, still not clear to me what this does. Can you please also extract that in a function and test it? That will have the advantage to somehow document it? (I'm still interesting by the explanations though ;) ardumont: Also, i saw the previous exchange about this, still not clear to me what this does. Can you…
				nahimilegaAuthorUnsubmitted Done Inline Actions Now I have extracted that part into a function `find_netloc()` and also written test cases for it. nahimilega: Now I have extracted that part into a function `find_netloc()` and also written test cases for…

				"""
				repos_details = []
				soup = BeautifulSoup(
				ardumontUnsubmitted Done Inline Actions self.url_prefix = url if url_prefix is None else url_prefix ardumont: self.url_prefix = url if url_prefix is None else url_prefix
				response.text,
				features="html.parser").find('div', {"class": "content"})
				repos = soup.find_all("tr", {"class": ""})
				ardumontUnsubmitted Done Inline Actions dictionaries. ardumont: dictionaries.
				for repo in repos:
				repo_name = repo.a.text
				ardumontUnsubmitted Done Inline Actions A wee bit more detail of the algo used to parse the output would be welcome. ardumont: A wee bit more detail of the algo used to parse the output would be welcome.
				ardumontUnsubmitted Not Done Inline Actions self.url_netloc = find_netloc(urlparse(self.PAGE)) ardumont: self.url_netloc = find_netloc(urlparse(self.PAGE))
				nahimilegaAuthorUnsubmitted Not Done Inline Actions I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50 (and also in line 47 for which the comment originally was) nahimilega: I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50…
				ardumontUnsubmitted Not Done Inline Actions I don't understand your reply but not too worry. You can keep the code as is. ardumont: I don't understand your reply but not too worry. You can keep the code as is.
				repo_url = self.get_url(repo_name, repo)
				origin_url = find_origin_url(repo_url)

				try:
				ardumontUnsubmitted Done Inline Actions As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use that function call instead (including in tests). `soup` is not quite telling, `repo_soup` is not that good either but at least, we know we are dealing with some repository representation. It's also kind of consistent with some `repo_url` you use already. So can you please refactor to something like: repo_soup = make_repo_soup(response.text) ardumont: As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use…
				time = repo.span['title']
				zackUnsubmitted Done Inline Actions this way of indenting looks odd, wouldn't: BeautifulSoup(response.text, features="html.parser") \ .find('div', {"class": "content"}) be more customary? (Note: I haven't checked what PEP8 has to say about either version.) zack: this way of indenting looks odd, wouldn't: ``` BeautifulSoup(response.text, features="html.
				except Exception:
				ardumontUnsubmitted Not Done Inline Actions I forgot to ask before, why do you need to call this like that? Why not `super().__init__...`? (i think i saw why but still, can you explicit?) Also, why isn't it in the first part of the constructor? ardumont: I forgot to ask before, why do you need to call this like that? Why not `super().__init__...`?
				time = None

				if origin_url is not None:
				repos_details.append({
				'name': repo_name,
				'time': time,
				'origin_url': origin_url
				ardumontUnsubmitted Done Inline Actions Drop the `all the` expression you use everywhere. Simplify this to something like: Find repositories metadata by parsing the html page (response's raw content). If there are links in the html page, retrieve those repositories metadata from those pages as well. Return the repositories as list of dictionaries. ardumont: Drop the `all the` expression you use everywhere. Simplify this to something like: ``` Find…
				nahimilegaAuthorUnsubmitted Done Inline Actions Sorry, now I realise `all the` is redundant in the docstring. nahimilega: Sorry, now I realise `all the` is redundant in the docstring.
				})
				ardumontUnsubmitted Done Inline Actions Args: response (Response): http api request response Returns: repository origin urls (as dict) included in the response ymmv but that's the gist of it, i think. ardumont: ``` Args: response (Response): http api request response Returns: repository origin…

				vlorentzUnsubmitted Done Inline Actions items in the dict should be only one extra indent level vlorentz: items in the dict should be only one extra indent level
				ardumontUnsubmitted Done Inline Actions Ok, talking a bit between us, we are wondering whether this is needed. Going that way (priority url detection and all that), could be heavy on the server we are listing. (As many pagination requests per pagination plus as many requests as there are repositories there) We could have a more lightweight approach first. The lister computes the canonical way of exposing a repository (that'd need checking amongst different cgit instances that that computation is indeed shared between them). That'd have the benefit to: decrease the quantity of code, thus less to maintain. allow performance comparison with your first go on the git-savannah instance (which was roughly ~1 hour IIRC ). ardumont: Ok, talking a bit between us, we are wondering whether this is needed. Going that way (priority…
				nahimilegaAuthorUnsubmitted Done Inline Actions Before choosing this method of visiting every repo page, I did some research, here is the result: The extraction of clone link from the page where the repos are listed(base_url) cannot be done in a stable way. Here is an example - For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. Some pages like https://cgit.freedesktop.org/~cworth/piglit-www/ are empty There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Hence I decided to take the approach to visit every page nahimilega: Before choosing this method of visiting every repo page, I did some research, here is the…
				nahimilegaAuthorUnsubmitted Done Inline Actions I just now thought of a clever approach which is lightweight and will do the work in a matter of seconds( unlike ~1 hour). The repositories in cgit instance are divided into groups with a group heading (in pale white colour as in https://cgit.freedesktop.org/) For all the members of the group the clone link is in format <some_url_comman_to_all members_of_a_group>/<name>.git To find the <some_url_comman_to_all members_of_a_group> we need to visit the page of only one repository for a group. This would drastically decrease the no. of requests and make the code faster. (although would increase the quantity of code) I will test this method and inform you about the result. Does this method sound good? nahimilega: I just now thought of a clever approach which is lightweight and will do the work in a matter…
				ardumontUnsubmitted Done Inline Actions For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as which then will be out of scope for that lister. The gitlab lister instance for freedesktop will pick up those repositories (probably done already). https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. This repo does not list anything so that's it. Some pages like https://cgit.freedesktop.org/~cworth/piglit-www/ are empty That's the loader's concern, not the lister's. Given enough failure, a loader's associated task will be disabled by the scheduler. Some listers currently list private repositories for example. Because possible they were public at the time, then, when they are loaded, they fail (401) and their task got disabled. It's not an issue. There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Well, as a first approximation, i'd say let's go towards removing the visit step anyway (and compute basic git clone url, `self.get_url` sounds good enough IIRC). It's good to start incrementally and not overthink everything. Let's start small, deploy the thing, see where it goes. Adapt when it fails. Mostly, aside my remarks on the diff (that i'd like you to take into accounts nonetheless), the lister is almost ready. So, if we could deploy this soon, that'd be great. And what's proposed is to simplify some code, which tends towards less to maintain ;) Hence I decided to take the approach to visit every page Which is a reasonable choice. That we are currently challenging for the good cause ;) I just now thought of a clever approach... I will test this method and inform you about the result. Does this method sound good? I'd say yes. Please, first amend the current diff with my remarks (and i have also some questions btw). Then you could try this in another branch (starting from this diff, and eventually open another diff depending on this). (After all, if that becomes really necessary, we could also imagine having different listing policies for the same lister within different instances. We have currently identified 3 distinct policies ;) ardumont: > For https://cgit.freedesktop.org/ > It has some clone link as https://gitlab.freedesktop.
				random.shuffle(repos_details)
				return repos_details

				def get_url(self, repo):
				"""
				Finds the url of a repo page bu parsing over the html of the row of
				zackUnsubmitted Done Inline Actions add a trailing comma after this line, it will make diff that in the future might add an additional line nicer/shorter zack: add a trailing comma after this line, it will make diff that in the future might add an…
				ardumontUnsubmitted Done Inline Actions Do we have roughly an idea of the number of pages we can encounter in different cgit instances? ardumont: Do we have roughly an idea of the number of pages we can encounter in different cgit instances?
				nahimilegaAuthorUnsubmitted Not Done Inline Actions I think its 275, in http://hdiff.luite.com/cgit/ nahimilega: I think its 275, in http://hdiff.luite.com/cgit/
				ardumontUnsubmitted Not Done Inline Actions Sorry, i was unclear, I meant the average number of pages. The instance you mentioned is so far the worst case ;) ardumont: Sorry, i was unclear, I meant the average number of pages. The instance you mentioned is so far…
				nahimilegaAuthorUnsubmitted Done Inline Actions Most of the cgit instances don't have a pagination. I think the average would be around 2-3 pages nahimilega: Most of the cgit instances don't have a pagination. I think the average would be around 2-3…
				ardumontUnsubmitted Done Inline Actions Following the logic presented in comments below, here you want to exclude the first page so: repos.extend(self.get_repos(pages[1:]) Note: `get_all_pages` is renamed to `get_repos` (see below). pages[1:] as you want to exclude the first page ardumont: Following the logic presented in comments below, here you want to exclude the first page so…
				that repo present in the base url.

				ardumontUnsubmitted Done Inline Actions Indentation is still off to me. ardumont: Indentation is still off to me.
				The url of the repo present is present under href attribute of the name
				of url. This function removes any suffix from the base url and attaches
				the suffix that is found in the href tag of the name.
				for example for base_url = https://git.kernel.org/pub/scm/ it will
				ardumontUnsubmitted Done Inline Actions indentation is off. ardumont: indentation is off.
				convert it into https://git.kernel.org and then attach the suffix
				for a particular repo.
				nahimilegaAuthorUnsubmitted Done Inline Actions I still don't think indentation is correct here, can you please show me what would be the correct indentation here nahimilega: I still don't think indentation is correct here, can you please show me what would be the…
				ardumontUnsubmitted Done Inline Actions In doubt, look at how generally the source code is written first in this very repository (and in others if not enough). There is the answer in your current module. Just align with the beginning of the line (at current indentation level): repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, }) ardumont: In doubt, look at how generally the source code is written first in this very repository (and…

				ardumontUnsubmitted Done Inline Actions Why the need to shuffle? ardumont: Why the need to shuffle?
				nahimilegaAuthorUnsubmitted Done Inline Actions I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and cran lister and here too. Although I don't really know its importance. nahimilega: I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and…
				ardumontUnsubmitted Done Inline Actions Ok. I don't remember the reasons though, too bad. ardumont: Ok. I don't remember the reasons though, too bad.
				Args:
				repo - a beautifulsoup object of the html code of the repo row
				present in base url

				ardumontUnsubmitted Done Inline Actions One last modification, extract this in a method. To avoid the `list` wrapping. def _yield_repo_from_responses(self, response): """Yield repositories from request responses... Args: ... Yields: ... """ html = response.text yield from get_repo_list(html) pages = self.get_pages(make_soup(html)) if len(pages) > 1: yield from self.get_repos_from_pages(pages[1:]) def list_packages(self, response): """... <existing docstring> """ for repo in self._list_repo_from_response(response): ... ardumont: One last modification, extract this in a method. To avoid the `list` wrapping. ``` def…
				Returns:
				url of a repo
				"""
				suffix = repo.a['href']
				position = self.base_url.index('/')+2
				second = self.base_url[position:].index('/')
				zackUnsubmitted Done Inline Actions this way of typesetting args/return is not consistent with our python coding guidelines, see, e.g., https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html zack: this way of typesetting args/return is not consistent with our python coding guidelines, see, e.
				next_url = self.base_url[:second+position]
				nahimilegaAuthorUnsubmitted Done Inline Actions Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all('/') because find_all is method present in bs4. Can you please recommend some smart approach to do it nahimilega: Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all…
				vlorentzUnsubmitted Done Inline Actions `(part1, part2, next_url) = self.base_url.split('/', 2)` vlorentz: `(part1, part2, next_url) = self.base_url.split('/', 2)`
				nahimilegaAuthorUnsubmitted Done Inline Actions Thanks, @vlorentz, This is really neat of doing it nahimilega: Thanks, @vlorentz, This is really neat of doing it
				nahimilegaAuthorUnsubmitted Done Inline Actions neat way of doing it nahimilega:* *neat way of doing it
				return next_url + suffix

				def get_model_from_repo(self, repo):
				"""Transform from repository representation to model

				"""
				return {
				ardumontUnsubmitted Done Inline Actions `if not pages:` ardumont: `if not pages:`
				'uid': self.base_url + repo['name'],
				'name': repo['name'],
				'full_name': repo['name'],
				'html_url': repo['origin_url'],
				ardumontUnsubmitted Done Inline Actions What's the format of a['href']? Is that a href without the absolute url? Is that true for all cgit instances? ardumont: What's the format of a['href']? Is that a href without the absolute url? Is that true for all…
				nahimilegaAuthorUnsubmitted Done Inline Actions a['href'] is of string format. It is without absolute url, it is in relative to the `url_netloc` I computed in my code. For example for http://git.savannah.gnu.org/cgit/, all the URL are respect to http://git.savannah.gnu.org Is that true for all cgit instances? It is true for all the instance I saw so far(all mentioned in T1835 nahimilega: a['href'] is of string format. It is without absolute url, it is in relative to the…
				ardumontUnsubmitted Done Inline Actions great, thanks. ardumont: great, thanks.
				'origin_url': repo['origin_url'],
				ardumontUnsubmitted Done Inline Actions Also why not using a list comprehension here? return [self.get_url(page) for page in pages] ardumont: Also why not using a list comprehension here? ``` return [self.get_url(page) for page in…
				'origin_type': 'git',
				'time_updated': repo['time'],
				}
				ardumontUnsubmitted Done Inline Actions pages ([str]): .. ardumont: pages ([str]): ..

				def transport_response_simplified(self, response):
				"""Transform response to list for model manipulation
				ardumontUnsubmitted Done Inline Actions `pages (except` (missing space) ardumont: `pages (except` (missing space)

				ardumontUnsubmitted Done Inline Actions Returns: List of beautifulsoup of all the repositories (url) row present in all the pages (except the first one). Drop the `of the html code of`, we mention it's a beautifulsoup object, which as far as i got it is an xml/html parser. So it implicits we deal with html which in that case i found reasonable. ardumont: ``` Returns: List of beautifulsoup of all the repositories (url) row present in all the…
				"""
				return [self.get_model_from_repo(repo) for repo in response]
				ardumontUnsubmitted Done Inline Actions You are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page). So either: get_repos get_repos_from_pages ardumont: You are not retrieving pages here, you are fetching repositories from pages (all the pages you…


				def find_origin_url(repo_url):
				ardumontUnsubmitted Done Inline Actions instance ardumont: instance
				"""
				ardumontUnsubmitted Done Inline Actions Request the `available repos`from the pages. This yields the available repositories found as beautiful object representation. ardumont: Request the `available repos`from the pages. This yields the available repositories found as…
				Finds the origin url for a particular repo by parsing over the page of
				that repo

				Args:
				repo_url - url of the repo

				Returns:
				ardumontUnsubmitted Done Inline Actions Yields (if you follow the snippet proposed below). ardumont: Yields (if you follow the snippet proposed below).
				origin url for the repo

				zackUnsubmitted Done Inline Actions as before: this doesn't conform with Napoleon style zack: as before: this doesn't conform with Napoleon style
				ardumontUnsubmitted Done Inline Actions this looks familiar... Please refactor this within a method or a function. ardumont: this looks familiar... Please refactor this within a method or a function.
				example:
				ardumontUnsubmitted Done Inline Actions this looks familiar... Please refactor this within a method or a function. ardumont: this looks familiar... Please refactor this within a method or a function.

				ardumontUnsubmitted Done Inline Actions Heads up, code review stopped here! ardumont: Heads up, code review stopped here!
				>> find_origin_url('http://git.savannah.gnu.org/cgit/fbvbconv-py.git/')
				>> 'https://git.savannah.gnu.org/git/fbvbconv-py.git'

				"""
				vlorentzUnsubmitted Done Inline Actions doctests should be written like this: >>> find_origin_url('http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') 'https://git.savannah.gnu.org/git/fbvbconv-py.git' with no indent vlorentz: doctests should be written like this: ``` >>> find_origin_url('http://git.savannah.gnu.

				response = requests.get(repo_url)
				soup = BeautifulSoup(response.text, features="html.parser")
				ardumontUnsubmitted Done Inline Actions You can use a generator here: for page in pages: response = requests.get(page) if not response.ok: # deal with error as warning without impeding the listing to finish logger.warn('Failed to retrieve repositories from page %s', page) continue yield from get_repo_list(response.text) Also, remove the trick about removing the first page. Just pass the list of pages you effectively want to parse data from when you call the method. That simplifies the method and the docstring. Just add a comment when you call the method if you feel it's not explicit enough. ardumont: You can use a generator here: ``` for page in pages: response = requests.get(page) if…

				origin_urls = find_all_origin_url(soup)
				return priority_origin_url(origin_urls)


				def find_all_origin_url(soup):
				"""
				Finds all the origin url for a particular repo by parsing over the html of
				repo page

				Args:
				soup - a beautifulsoup object of the html code of the repo

				Returns:
				dictionary of all possible origin urls with their protocol as key

				example
				zackUnsubmitted Done Inline Actions same: Napoleon mismatch zack: same: Napoleon mismatch
				if soup is beautifulsoup object of the html code at
				http://git.savannah.gnu.org/cgit/fbvbconv-py.git/

				>> find_all_origin_url(soup)
				>> { 'https': 'https://git.savannah.gnu.org/git/fbvbconv-py.git',
				'ssh': 'ssh://git.savannah.gnu.org/srv/git/fbvbconv-py.git',
				'git': 'git://git.savannah.gnu.org/fbvbconv-py.git'}
				"""
				ardumontUnsubmitted Done Inline Actions Remove extra line. Start directly from the last docstring comment. We do always have the extra line in the docstring. ardumont: Remove extra line. Start directly from the last docstring comment. We do always have the extra…
				origin_urls = defaultdict(dict)
				vlorentzUnsubmitted Done Inline Actions same vlorentz: same
				zackUnsubmitted Done Inline Actions I think this is indented too much (at least the first line, not sure about the rest w.r.t. what doctest expects) zack: I think this is indented too much (at least the first line, not sure about the rest w.r.t. what…
				found_clone_word = False

				for i in soup.find_all('tr'):
				ardumontUnsubmitted Done Inline Actions comment here is not necessarily used, i added it in the diff comment to explicit it to you ;) ardumont: comment here is not necessarily used, i added it in the diff comment to explicit it to you ;)
				if found_clone_word:
				link = i.text
				ardumontUnsubmitted Done Inline Actions How come `response` is already in the right format for this lister? ardumont: How come `response` is already in the right format for this lister?
				protocol = link[:link.find(':')]
				origin_urls[protocol] = link
				if i.text == 'Clone':
				found_clone_word = True

				return origin_urls


				def priority_origin_url(origin_url):
				"""
				Finds the highest priority link from all the origin urls for a particular
				repo
				Priority order is https>http>git>ssh

				Args:
				origin_urls - a dictionary of origin links with their protocol as key

				Returns :
				url the highest priority

				"""
				for protocol in ['https', 'http', 'git', 'ssh']:
				zackUnsubmitted Done Inline Actions Napoleon zack: Napoleon
				if protocol in origin_url:
				return origin_url[protocol]
				ardumontUnsubmitted Done Inline Actions origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the associated url) ardumont: origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the…
				ardumontUnsubmitted Done Inline Actions Url (str) with the highest... ardumont: Url (str) with the highest...
				ardumontUnsubmitted Done Inline Actions """Find all origin urls per repository by... ardumont: """Find all origin urls per repository by...
				ardumontUnsubmitted Done Inline Actions `a beautiful soup object repo representation` ardumont: `a beautiful soup object repo representation`
				ardumontUnsubmitted Done Inline Actions All possible origin urls for a repository (dict with key 'protocol', value the associated url). ardumont: All possible origin urls for a repository (dict with key 'protocol', value the associated url).
				ardumontUnsubmitted Done Inline Actions What's repo_url's type? please, mention it in the docstring. Args: repo_url (<TYPE>): ... Please, do so in the current diff everywhere it's missing. (Even if currently, we are missing those, that's a fix we are trying to fill in). In case the type is from a tiers-party module, it's fine to use that as the type (for example `response (Response):` ...). ardumont: What's repo_url's type? please, mention it in the docstring. ``` Args: repo_url (<TYPE>)…
				ardumontUnsubmitted Done Inline Actions I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned in the docstring what the type is expected). ardumont: I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned…
				nahimilegaAuthorUnsubmitted Done Inline Actions These are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too? nahimilega: These are just one line function and not much of logic is present in these functions, they are…
				ardumontUnsubmitted Done Inline Actions Yes, to avoid decreasing too much the coverage. If they are not much logic, they are simple to test as well. ardumont: Yes, to avoid decreasing too much the coverage. If they are not much logic, they are simple to…
				ardumontUnsubmitted Done Inline Actions Find repositories (as beautifulsoup object) available within the server response. ardumont: Find repositories (as beautifulsoup object) available within the server response.
				ardumontUnsubmitted Done Inline Actions Drop this, the first line proposed is enough (i think). ardumont: Drop this, the first line proposed is enough (i think).
				ardumontUnsubmitted Done Inline Actions List all the repositories as beautifulsoup object within the response. ardumont: List all the repositories as beautifulsoup object within the response.
				ardumontUnsubmitted Done Inline Actions Instantiates a beautiful soup object from the response object. ardumont: Instantiates a beautiful soup object from the response object.