Page MenuHomeSoftware Heritage

Implementation of Gogs Lister
Closed, MigratedEdits Locked

Description

Gogs hosts git repositories so it will be easy to "load" it into swh.
It's base url is https://try.gogs.io/api/v1 and to get the projects by page the api is https://try.gogs.io/api/v1/repos/search (default limit is 10 results per page) and the next url can be found in the headers in "Link" (It returns two urls one is the url of next page and the second is the url of last page separated by comma and with some python logic it can be extracted easily). It requires you to have a token which I have downloaded and also you can limit the number of results. In their documentation they have written that they want their api to be like that of github V3 so I think implementation of gogs lister will be similar. The token needs to be sent as a parameter in the request.

Plan:

  • D8160: Implement full-mode lister
  • D8218: Implement incremental-mode lister
  • T1721#88903: Run within docker-dev (on the developer's machine)
  • T1721#88915: Open upstream forge ticket to reference the gogs api misbehavior T4423
  • T4478: Deploy to staging
  • Call for public review
  • If green light, deploy to production

Event Timeline

faux triaged this task as Low priority.May 16 2019, 2:46 PM
faux created this task.
faux created this object in space S1 Public.

Hey @ardumont, I noticed the following problem:

If we follow the Gogs v1 API, The last page for try.gogs.io repo search is 28 (with page size = 20)

But if open the explore section, they go up to 685 pages instead! (with page size = 20)

To clearly understand the problem, I skimmed through Gogs source code and found ExploreRepos and Search functions which are responsible for these results. Both of them seem to use db.SearchRepositoryByName and the differences are in terms of args like OwnerID, UserID, OrderBy and Private.

Tomorrow, I'll properly go through their code and post updates here. Feel free to add your thoughts on this :)

they go up to 685 pages instead

Added a screenshot so that you don't have to create an account to see it:

Hey @ardumont, I noticed the following problem:

If we follow the Gogs v1 API, The last page for try.gogs.io repo search is 28 (with page size = 20)

But if open the explore section, they go up to 685 pages instead! (with page size = 20)

To clearly understand the problem, I skimmed through Gogs source code and found ExploreRepos and Search functions which are responsible for these results. Both of them seem to use db.SearchRepositoryByName and the differences are in terms of args like OwnerID, UserID, OrderBy and Private.

Tomorrow, I'll properly go through their code and post updates here. Feel free to add your thoughts on this :)

Maybe it'd be worth opening an issue upstream regarding this behavior.
It might just be a problem on this testing instance (from the name, i'm under the impression it's a testing instance, not a production one).

Maybe it'd be worth opening an issue upstream regarding this behavior.

Done https://github.com/gogs/docs-api/issues/34

It might just be a problem on this testing instance

Actually, I just did some experiments to find out the difference. I created 4 types of repos that are supported by Gogs:

  • private
  • public (Link)
  • unlisted: publically accessible via direct link but not via search or APIs (Link)
  • mirror (Link)

https://try.gogs.io/api/v1/repos/search?q=KShivendu&page=1 only returns the public repo and the mirror repo. While the "explore" section shows these along with the "unlisted" repos that the user created. This seems like the expected behavior.

So probably we can just move ahead with our current approach and make changes later on if required.

Maybe it'd be worth opening an issue upstream regarding this behavior.

Done https://github.com/gogs/docs-api/issues/34

Great.

It might just be a problem on this testing instance Actually, I just did some

experiments to find out the difference.
[...]
https://try.gogs.io/api/v1/repos/search?q=KShivendu&page=1 only returns the public
repo and the mirror repo. While the "explore" section shows these along with the
"unlisted" repos that the user created. This seems like the expected behavior.

Nice investigation ;)

So, yes, that looks like the expected behavior. I'd say it's worth mentioning this
inside the issue you opened upstream. That might trigger a faster "yes, it's the
expected behavior".

Relatedly to this, isn't there some metadata about the nature of a repository in the
json response api output? (I gather that's a no but bear with me ;)

So probably we can just move ahead with our current approach and make changes later on
if required.

yes, sounds like a good plan.

I forgot to mention it to you explicitely (you may have seen this very task's
description update though). Can you please make sure you run the lister within "docker".
You will find what you need regarding lister in swh-environment:/docker/conf/lister.yml
(add the cog lister entry there).

And make sure that the lister is doing what you designed it to do.

Don't hesitate to push diff about the changes needed to add that lister's support within
docker (tagging this task - Related to T1721 - as well in the diff change(s) you
eventually open).

Thanks in advance.

I just ran the lister using swh lister run -l gogs url=https://try.gogs.io/api/v1/ api_token=xxx

But got 500 Server Error: Internal Server Error for url: https://try.gogs.io/api/v1/repos/search?page=17

After some experiments, I found that exactly the repo at position 162 is the culprit.

So assuming that we want to crawl maximum repos, I suggest the following change in the lister:

  • It should crawl the provided Gogs instance with page size 20.
  • But, whenever it encounters 500, it should list the repos on the page one by one. (i.e. temporarily set page size = 1)
  • While page size is 1 and the lister encounters 500, it should just skip the repo.

What do you think? :)

I just ran the lister using swh lister run -l gogs url=https://try.gogs.io/api/v1/ api_token=xxx

Great ^.

But got 500 Server Error: Internal Server Error for url: https://try.gogs.io/api/v1/repos/search?page=17
...

For the issue you found, it's worth opening a dedicated forge issue to track the effort.

And open an upstream issue to gogs maintainer to mention their api is behaving erratically (and reference
that upstream issue in the newly opened swh forge issue).

Depending on their reactivity then, you will either have nothing to do or have to open your suggested lister change as diff ;)

Cheers,

ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
bchauvet added a parent task: Unknown Object (Maniphest Task).Aug 4 2022, 9:51 AM
bchauvet mentioned this in Unknown Object (Maniphest Task).

worth opening a dedicated forge issue

Done. T4423

open an upstream issue

Done. https://github.com/gogs/gogs/issues/7124

Depending on their reactivity then

Umm. Based on what I can see in the existing issues, chances are less that they will get time to reply anytime soon. (nothing less than a few weeks).
Furthermore, I had emailed as well as sent a Linkedin message to the author regarding the original issue (difference in the two endpoints) on 1 Aug. But I didn't get any response on either of the platforms :(

So, do you still suggest waiting for a few days? :)

Also, what else can I do if I have to wait on this one? Work on other listers that I've proposed?

ardumont raised the priority of this task from Low to Normal.Aug 4 2022, 3:59 PM
ardumont added a project: Archive coverage.

worth opening a dedicated forge issue

Done. T4423

open an upstream issue

Done. https://github.com/gogs/gogs/issues/7124

*thumbs up*, thanks.

So, do you still suggest waiting for a few days? :)

They may be busy so might as well waiting a bit yes.

Also, what else can I do if I have to wait on this one? Work on other listers that I've proposed?

Well, you can try and give this lister an incremental behavior if it's possible?
It's currently a stateless lister which we could render stateful by keeping the last paginated link we saw or something.

If it's not possible then we can strike the box like i suggested in the description.
And then you can move on to some other listers you proposed indeed.

Cheers,

ardumont changed the status of subtask T4478: Deploy Gogs lister to staging from Open to Work in Progress.Sep 13 2022, 11:31 AM