Page MenuHomeSoftware Heritage

Git loaderFolder
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Recent Activity

Thu, Jun 16

anlambert added a revision to T4311: Package and deploy dulwich 0.20.43 in production: D7996: loader: Bump dulwich and remove no longer valid comments.
Thu, Jun 16, 1:47 PM · System administration, Git loader
olasd closed T4311: Package and deploy dulwich 0.20.43 in production as Resolved.

All production loaders have been restarted now.

Thu, Jun 16, 11:36 AM · System administration, Git loader

Fri, Jun 10

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Let's do this the other way around, closing this as i'm done.
Please reopen if you need something else.

Fri, Jun 10, 9:05 AM · System administration, Git loader
ardumont closed T4283: Load https://github.com/chromium/chromium with a higher packfile size limit as Resolved.
Fri, Jun 10, 9:05 AM · System administration, Git loader

Thu, Jun 9

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

@vlorentz I don't have anything left to do, can i close it now?

Thu, Jun 9, 6:10 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

And the 2nd fork ingestion is done as well:

swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
Enumerating objects: 12661350, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 12661350 (delta 140), reused 135 (delta 135), pack-reused 12661159
INFO:swh.loader.git.loader:Listed 15230 refs for repo https://github.com/Tomahawkd/chromium
INFO:swh.loader.git.loader.GitLoader:Fetched 12661351 objects; 2 are new
self.statsd.constant_tags: {'visit_type': 'git', 'incremental_enabled': True, 'has_parent_origins': True, 'has_parent_snapshot': True, 'has_previous_snapshot': False}
self.parent_origins: [Origin(url='https://github.com/chromium/chromium', id=b'\xa9\xf66\xa1/\\\xc3\\\xa4\x18+\r\xe7L\x91\x94\xe9\x00\x96J')]
{'status': 'eventful'} for origin 'https://github.com/Tomahawkd/chromium'
        Command being timed: "swh loader run git https://github.com/Tomahawkd/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368"
        User time (seconds): 62323.33
        System time (seconds): 3001.76
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 25:03:29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 29352136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 8
        Minor (reclaiming a frame) page faults: 10355329
        Voluntary context switches: 265156
        Involuntary context switches: 265330
        Swaps: 0
        File system inputs: 2048
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
Thu, Jun 9, 6:01 PM · System administration, Git loader
olasd added a comment to T4311: Package and deploy dulwich 0.20.43 in production.

I've restarted the staging workers (loader_git and loader_high_priority) with the new dulwich version

Thu, Jun 9, 5:36 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

heh, ok, so it's indeed because github sends us way too much

Thu, Jun 9, 2:41 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

status, second fork ingestion done (prior to the other one still ongoing) [1]

Thu, Jun 9, 2:28 PM · System administration, Git loader

Wed, Jun 8

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

That's still a pretty big packfile ~12.6G [1]... I'm pondering whether i should stop it,
install the new python3-dulwich olasd packaged and trigger it back...

Wed, Jun 8, 4:32 PM · System administration, Git loader
ardumont added a comment to T4311: Package and deploy dulwich 0.20.43 in production.

fwiw, jenkins is python3-dulwich aware.

I don't see the point of that for packages that can be backported with no changes, which is what I had done before, so I admit I hadn't even looked.

Wed, Jun 8, 4:25 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

So the first fork ingestion finished and took less time.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

jsyk, I've edited accordingly the file and triggered back another fork ingestion:

swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
Wed, Jun 8, 4:02 PM · System administration, Git loader
olasd added a comment to T4311: Package and deploy dulwich 0.20.43 in production.

fwiw, jenkins is python3-dulwich aware.

Wed, Jun 8, 3:59 PM · System administration, Git loader
ardumont added a comment to T4311: Package and deploy dulwich 0.20.43 in production.

fwiw, jenkins is python3-dulwich aware.

Wed, Jun 8, 3:52 PM · System administration, Git loader
olasd added a comment to T4311: Package and deploy dulwich 0.20.43 in production.

I checked that the swh.loader.git tests are green with the new dulwich version.

Wed, Jun 8, 3:51 PM · System administration, Git loader
olasd changed the status of T4311: Package and deploy dulwich 0.20.43 in production from Open to Work in Progress.
Wed, Jun 8, 3:43 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

@vlorentz I also encountered [1] this morning which might explain the large packfile...

Wed, Jun 8, 3:30 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

So the first fork ingestion finished and took less time.

Wed, Jun 8, 3:20 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)

Wed, Jun 8, 3:10 PM · System administration, Git loader

Tue, Jun 7

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Which one has that much more commit, the initial one?

Yes

If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.

I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test

Tue, Jun 7, 4:28 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Which one has that much more commit, the initial one?

Tue, Jun 7, 4:24 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

initial load of a different repository, which has 338k more commits

Tue, Jun 7, 4:16 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

initial load of a different repository, which has 338k more commits

Tue, Jun 7, 3:36 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

Tue, Jun 7, 3:34 PM · System administration, Git loader
vlorentz added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

Tue, Jun 7, 3:30 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).

Tue, Jun 7, 3:12 PM · System administration, Git loader
anlambert triaged T4311: Package and deploy dulwich 0.20.43 in production as Normal priority.
Tue, Jun 7, 2:45 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:

Tue, Jun 7, 10:59 AM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Success for production worker [1]. Staging worker is still working on it.

Tue, Jun 7, 9:25 AM · System administration, Git loader

Thu, Jun 2

ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

heads up, ingestion still ongoing with quite some stability in regards to memory consumption.

Thu, Jun 2, 1:40 PM · System administration, Git loader

Wed, Jun 1

ardumont moved T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from code-review/await-feedback to deployed/landed/monitoring on the System administration board.
Wed, Jun 1, 5:37 PM · System administration, Git loader
ardumont moved T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from in-progress to code-review/await-feedback on the System administration board.
Wed, Jun 1, 5:37 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Status update, both worker1.staging and worker17 are beyond the failing step of pack file limit where they usually crash \o/:

Wed, Jun 1, 5:16 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

I've started a 32g experiment in worker1.staging and 64g in worker17.

Wed, Jun 1, 4:24 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.

Wed, Jun 1, 4:04 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

worker17 is complaining as well but differently somehow.
same version for both though [2].

Wed, Jun 1, 2:58 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].

Wed, Jun 1, 2:54 PM · System administration, Git loader
ardumont changed the status of T4283: Load https://github.com/chromium/chromium with a higher packfile size limit from Open to Work in Progress.
Wed, Jun 1, 2:35 PM · System administration, Git loader
ardumont added a comment to T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.

I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).

Wed, Jun 1, 2:35 PM · System administration, Git loader

May 30 2022

vlorentz triaged T4283: Load https://github.com/chromium/chromium with a higher packfile size limit as Low priority.
May 30 2022, 3:41 PM · System administration, Git loader
vlorentz added a parent task for T3273: Use "fork" relationships to speed-up initial load of large repositories: T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.
May 30 2022, 3:41 PM · Origin-GitHub, Origin-GitLab, Git loader, Extrinsic metadata, Core Loader
vlorentz added a subtask for T4283: Load https://github.com/chromium/chromium with a higher packfile size limit: T3273: Use "fork" relationships to speed-up initial load of large repositories.
May 30 2022, 3:41 PM · System administration, Git loader
vlorentz created T4283: Load https://github.com/chromium/chromium with a higher packfile size limit.
May 30 2022, 3:40 PM · System administration, Git loader

May 20 2022

vlorentz added revisions to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7873: Add an unweighted average for filtered_objects + fix existing metric name, D7876: Log summary of filtered objects in store_data.
May 20 2022, 3:54 PM · Origin-GitHub, Git loader
vlorentz added a revision to T4219: Investigate why GitHub fork detection did not bring a speed-up: D7871: Add metrics in store_data on ratios of objects already stored.
May 20 2022, 1:48 PM · Origin-GitHub, Git loader
vlorentz added a comment to T4219: Investigate why GitHub fork detection did not bring a speed-up.

I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:

May 20 2022, 10:55 AM · Origin-GitHub, Git loader

May 16 2022

vlorentz added a comment to T4219: Investigate why GitHub fork detection did not bring a speed-up.

This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".

May 16 2022, 3:33 PM · Origin-GitHub, Git loader

May 13 2022

ardumont added a subtask for T4219: Investigate why GitHub fork detection did not bring a speed-up: T4242: Deployed loader.git v1.8.
May 13 2022, 6:01 PM · Origin-GitHub, Git loader
ardumont added a parent task for T4242: Deployed loader.git v1.8: T4219: Investigate why GitHub fork detection did not bring a speed-up.
May 13 2022, 6:01 PM · System administration, Git loader
ardumont closed T4243: Deploy loader.metadata credentials for high and oneshot loaders as Resolved.
May 13 2022, 5:59 PM · System administration, Metadata Loaders, Git loader