Page MenuHomeSoftware Heritage

Deploy gitlab instance lister to infra
Closed, ResolvedPublic

Description

Deploy gitlab lister through our scheduling utilities:

  • package and deploy new python3-swh.lister
  • create db for gitlab lister
  • default configuration (swh-site)
  • private configuration (swh-private-data with token, db password)
  • swh_lister_gitlab.pp (swh-site, derived from swh_lister_github.pp)

Event Timeline

ardumont created this task.

great, thanks for working on this!

a couple of questions:

  1. i guess this is about deploying the gitlab.com lister, right? or do you plan to also deploy listers for other gitlab instances? (of course it's great that we can support any instances we want, just wanted to clarify what's the specific purpose of this task)
  2. this is a more far fetching question, but I wonder about the long-term sustainability of having a different DB for each lister. Is that scalable? I wonder if it wouldn't be better to have a single DB for all listers. I don't want to hold onto the deployment of the gitlab lister just for this, so go ahead as you feel better. But maybe take the chance to discuss this with other sysadms. If appropriate, we can consolidate things later.

Hello,

i guess this is about deploying the gitlab.com lister, right?

Bluntly, it's a subtask of the ingest gitlab.com lister ;) So gitlab.com it is, yes. [1]

My previous comment [2] about other instances might have mislead you (as i kind of derailed in my thoughts when i wrote that ;).

or do you plan to also deploy listers for other gitlab instances? (of course it's great that we can support any instances we want, just wanted to clarify what's the specific purpose of this task)

I could open another task for the other instances indeed [2].

this is a more far fetching question, but I wonder about the long-term sustainability of having a different DB for each lister.

Oh yeah, i see your point.
I just got along with the current way of doing thing as i was pretty new to the lister stack.

Just to clarify, for the gitlab instance, it is indeed one db for all gitlab instances (that was not clear to me at first). [3]

Is that scalable? I wonder if it wouldn't be better to have a single DB for all listers.

That's a good point.

For now, I guess i can only say it depends on the nature of the lister though.
Github, Bitbucket, Gitlab ones are simple db schema (1 table, <something>_repo), so they could be aggregated together.

The debian one's model is not that simple though (multiple tables). And as far as you explained it to me, there is a chance the pypi one will be as well.
But i don't know yet if that will share anything with the debian lister.

So yeah, discussion would be good.

I don't want to hold onto the deployment of the gitlab lister just for this, so go ahead as you feel better.

Yes, thanks, finishing what i start makes me feel better :)

But maybe take the chance to discuss this with other sysadms.

Right.

If appropriate, we can consolidate things later.

Indeed.

[1] Also note that the gitlab lister is about listing 'gitlab' instances. Gitlab.com is just the biggest known (so far).
My point being, I will have to deploy once the gitlab lister. Then i'll have to parametrize the scheduler to create the recurring listing gitlab.com tasks (2 i think).
That's the only instance (here gitlab.com) specific part.
That's what was implied in the not so detailed message 'some data inserted in db' in the description.
I'll try to clarify the task's intent.

[2] https://forge.softwareheritage.org/T1111#20926

[3] All listed origins from any instance will be there. Each origin having an instance column (gitlab(.com), inria, debian, freedesktop, gnome, etc...) in their model to distinguish between those.

Cheers,

ardumont renamed this task from Deploy gitlab lister to infra to Deploy gitlab instance lister to infra + start listing gitlab.com.Jul 18 2018, 10:24 AM
ardumont updated the task description. (Show Details)

I'll quite even possible separate the tasks in 2 (deploy gitlab lister, start listing gitlab.com)...
I'm a big fan of separation of concerns ;)

In T1137#21093, @zack wrote:
  1. this is a more far fetching question, but I wonder about the long-term sustainability of having a different DB for each lister. Is that scalable? I wonder if it wouldn't be better to have a single DB for all listers. I don't want to hold onto the deployment of the gitlab lister just for this, so go ahead as you feel better. But maybe take the chance to discuss this with other sysadms. If appropriate, we can consolidate things later.

I think it's more convenient to keep using one database per lister: it lets us have listers that use separate tables (e.g. Debian) without them stomping on one another; It also makes it easier to start back from scratch by just erasing the database.

I don't foresee any scalability issues with this, the only net cost of a new database is a few files in the postgresql cluster, we can reconsider when we have a hundred of them ;) From a deployment standpoint, it might be sensible, to have the listers share credentials for access to the database, which would avoid having an ever-increasing list of passwords in our manifests... but again I don't foresee that to be an issue for a while.

ardumont renamed this task from Deploy gitlab instance lister to infra + start listing gitlab.com to Deploy gitlab instance lister to infra.Jul 18 2018, 4:30 PM
ardumont closed this task as Resolved.
ardumont updated the task description. (Show Details)