Page MenuHomeSoftware Heritage

Load https://github.com/chromium/chromium with a higher packfile size limit
Closed, MigratedEdits Locked

Description

Most occurences of https://sentry.softwareheritage.org/share/issue/bbcb3aef5b974dac9a3194f7bf8ede87/ happen on chromium forks.

Now that we have fork detection; a single successful load of https://github.com/chromium/chromium might unstuck all its forks; unfortunately we cannot load that repository either, not even from its previous full snapshot.

I think we should try loading https://github.com/chromium/chromium manually as a one-time thing to get it going again (and future loads of this repository should success too, assuming we visit it often enough).

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

I've triggered a run on worker1.staging [1] and worker17 as is for now.
We'll see for the pack file size limit after that run fails (if it does).

swhworker@worker1:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url | tee chromium-20220601-01.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'

[2]

swhworker@worker17:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url | tee chromium-20220601-01.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
ardumont changed the task status from Open to Work in Progress.Jun 1 2022, 2:35 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

Ok, expectedly, it does not work as is [1] ;)
Second run then with twice the actual pack file limit [2].

[2]

swhworker@worker1:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url pack_size_bytes=8589934592 | tee chromium-20220601-02.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
...

[1]

swhworker@worker1:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url | tee chromium-20220601-01.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
Enumerating objects: 18243310, done.
Counting objects: 100% (5895/5895), done.
Compressing objects: 100% (3180/3180), done.
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 374, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 318, in fetch_data
    self.origin.url, base_repo, do_progress
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 240, in fetch_pack_from_origin
    progress=do_activity,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 2087, in fetch_pack
    progress,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 915, in _handle_upload_pack_tail
    SIDE_BAND_CHANNEL_PROGRESS: progress,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 674, in _read_side_band64k_data
    cb(pkt)
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 228, in do_pack
    f"Pack file too big for repository {origin_url}, "
OSError: Pack file too big for repository https://github.com/chromium/chromium, limit is 4294967296 bytes, current size is 4294959115, would write 8192
{'status': 'failed'} for origin 'https://github.com/chromium/chromium'
        Command being timed: "swh loader run git https://github.com/chromium/chromium"
        User time (seconds): 563.53
        System time (seconds): 343.31
        Percent of CPU this job got: 43%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 34:25.20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 22185788
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1707154
        Minor (reclaiming a frame) page faults: 22250583
        Voluntary context switches: 1920190
        Involuntary context switches: 94371
        Swaps: 0
        File system inputs: 68880160
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

worker17 is complaining as well but differently somehow.
same version for both though [2].

Anyway, no point in waiting for the same issue so triggering the same as staging (that might take a while to finish so...).

[2]

ii  python3-swh.loader.git 1.9.0-1~swh1~bpo10+1 all          Software Heritage Git loader

[1]

swhworker@worker17:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url | tee chromium-20220601-01.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 374, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 318, in fetch_data
    self.origin.url, base_repo, do_progress
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 240, in fetch_pack_from_origin
    progress=do_activity,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 2076, in fetch_pack
    "git-upload-pack", url, data=req_data.getvalue()
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 1952, in _smart_request
    resp, read = self._http_request(url, headers, data)
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 2181, in _http_request
    "POST", url, headers=req_headers, body=data
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 72, in request
    **urlopen_kw)
  File "/usr/lib/python3/dist-packages/urllib3/request.py", line 150, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 616, in urlopen
    **response_kw)
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 525, in from_httplib
    **response_kw)
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 209, in __init__
    self._body = self.read(decode_content=decode_content)
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 438, in read
    data = self._fp.read()
  File "/usr/lib/python3.7/http/client.py", line 468, in read
    return self._readall_chunked()
  File "/usr/lib/python3.7/http/client.py", line 580, in _readall_chunked
    return b''.join(value)
MemoryError
{'status': 'failed'} for origin 'https://github.com/chromium/chromium'
        Command being timed: "swh loader run git https://github.com/chromium/chromium"
        User time (seconds): 448.15
        System time (seconds): 319.74
        Percent of CPU this job got: 52%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 24:26.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 29481236
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 7362239
        Voluntary context switches: 161768
        Involuntary context switches: 39326
        Swaps: 0
        File system inputs: 8
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

8g (pack size limit) was not enough either, it broke on both workers ¯\_(ツ)_/¯.
We have no clue as to what size limit should be done so i'm clearly taking shots in the dark.
I've started a 32g experiment in worker1.staging and 64g in worker17.
We will see.

I've started a 32g experiment in worker1.staging and 64g in worker17.

64g was a bit too much for worker17 [1], it ran out of memory so fail!
The staging worker seems to be taking a nicer path (still up and running) so
i've started that same ingestion (32g of pack size limit) in worker17 now.

[1]

swhworker@worker17:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url pack_size_bytes=68719476736 | tee chromium-20220601-04-pack-size-limit-64g.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
Enumerating objects: 18243673, done.
Counting objects: 100% (1536/1536), done.
Compressing objects: 100% (939/939), done.
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 374, in load
    more_data_to_fetch = self.fetch_data()
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 318, in fetch_data
    self.origin.url, base_repo, do_progress
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 240, in fetch_pack_from_origin
    progress=do_activity,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 2087, in fetch_pack
    progress,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 915, in _handle_upload_pack_tail
    SIDE_BAND_CHANNEL_PROGRESS: progress,
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 674, in _read_side_band64k_data
    cb(pkt)
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 233, in do_pack
    pack_buffer.write(data)
  File "/usr/lib/python3.7/tempfile.py", line 903, in write
    rv = file.write(s)
OSError: [Errno 28] No space left on device
{'status': 'failed'} for origin 'https://github.com/chromium/chromium'
        Command being timed: "swh loader run git https://github.com/chromium/chromium pack_size_bytes=68719476736"
        User time (seconds): 409.20
        System time (seconds): 398.17
        Percent of CPU this job got: 48%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 27:35.21
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 58774716
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 155
        Minor (reclaiming a frame) page faults: 10859992
        Voluntary context switches: 178535
        Involuntary context switches: 18268
        Swaps: 0
        File system inputs: 30112
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Status update, both worker1.staging and worker17 are beyond the failing step of pack
file limit where they usually crash \o/ [1].

So, current chromium ingestion retrieves a pack file of ~18G (if i read the log
correctly).

And their memory use is now way more reasonable that it's using prior to the starting up
of the ingestion [2] (respectively virt/rss: ~2g/2g vs ~56g/21g at the initialization).

[1]

swhworker@worker1:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url pack_size_bytes=34359738368 | tee chromium-20220601-pack-size-32g.txtromium; /usr/bin/time -v swh loader run git $url pack_
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
Enumerating objects: 18243617, done.
Counting objects: 100% (1476/1476), done.
Compressing objects: 100% (893/893), done.
Total 18243617 (delta 622), reused 705 (delta 572), pack-reused 18242141
INFO:swh.loader.git.loader:Listed 28831 refs for repo https://github.com/chromium/chromium
...

[2] from htop:

4091208 swhworker  20   0 2225M 2185M  5240 S  0.0  9.1 52:32.93 │                    └─ /usr/bin/python3 /usr/bin/swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368

heads up, ingestion still ongoing with quite some stability in regards to memory consumption.

worker17:

PID     USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
3808758 swhworker  20   0  2272   304   240 S  0.0  0.0  0:00.00 │  │              └─ time -v swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368
3808760 swhworker  20   0 2290M 2260M 10764 S  0.0  3.5  1h47:20 │  │                 └─ python3 /usr/bin/swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368

worker1.staging:

PID     USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
4091206 swhworker  20   0  2276    52    52 S  0.0  0.0  0:00.00 │                 └─ /usr/bin/time -v swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368
4091208 swhworker  20   0 2320M 1618M  5096 S  0.0  6.8  2h17:39 │                    └─ /usr/bin/python3 /usr/bin/swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368

Success for production worker [1]. Staging worker is still working on it.

[1] worker17

swhworker@worker17:~$ url=https://github.com/chromium/chromium; /usr/bin/time -v swh loader run git $url pack_size_bytes=34359738368 | tee chromium-20220601-03-pack-size-limit-32g.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/chromium/chromium' with type 'git'
Enumerating objects: 18243862, done.
Counting objects: 100% (1723/1723), done.
Compressing objects: 100% (1094/1094), done.
Total 18243862 (delta 717), reused 890 (delta 607), pack-reused 18242139
INFO:swh.loader.git.loader:Listed 28832 refs for repo https://github.com/chromium/chromium
INFO:swh.loader.git.loader.GitLoader:Fetched 18243863 objects; 6260568 are new
{'status': 'eventful'} for origin 'https://github.com/chromium/chromium'
        Command being timed: "swh loader run git https://github.com/chromium/chromium pack_size_bytes=34359738368"
        User time (seconds): 102415.56
        System time (seconds): 5484.61
        Percent of CPU this job got: 22%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 134:40:21
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 58810708
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 35
        Minor (reclaiming a frame) page faults: 23570033
        Voluntary context switches: 535303
        Involuntary context switches: 307983
        Swaps: 0
        File system inputs: 33808
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Triggered a run to ingest a fork (extra arguments needed with the cli) on production worker:

swhworker@worker17:~$ url=https://github.com/thebigbrain/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-04-pack-size-limit-32g-fork.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/thebigbrain/chromium' with type 'git'
Enumerating objects: 10930922, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 10930922 (delta 140), reused 135 (delta 135), pack-reused 10930731
INFO:swh.loader.git.loader:Listed 15020 refs for repo https://github.com/thebigbrain/chromium

Loader crashed with memory issues. Probably too much loading in //.
Currently stopping the worker's other processes to let this one finish (i'll restart it).

swhworker@worker17:~$ url=https://github.com/thebigbrain/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-04-pack-size-limit-32g-fork.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/thebigbrain/chromium' with type 'git'
Enumerating objects: 10930922, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 10930922 (delta 140), reused 135 (delta 135), pack-reused 10930731
INFO:swh.loader.git.loader:Listed 15020 refs for repo https://github.com/thebigbrain/chromium
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 377, in load
    self.store_data()
  File "/usr/lib/python3/dist-packages/swh/loader/git/base.py", line 80, in store_data
    for obj in self.get_contents():
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 414, in get_contents
    for raw_obj in self.iter_objects(b"blob"):
  File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 404, in iter_objects
    PackData.from_file(self.pack_buffer, self.pack_size)
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1386, in _walk_all_chains
    for result in self._follow_chain(offset, type_num, None):
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1444, in _follow_chain
    unpacked = self._resolve_object(offset, obj_type_num, base_chunks)
  File "/usr/lib/python3/dist-packages/dulwich/pack.py", line 1435, in _resolve_object
    unpacked.obj_chunks = apply_delta(base_chunks, unpacked.decomp_chunks)
MemoryError
{'status': 'failed'} for origin 'https://github.com/thebigbrain/chromium'
        Command being timed: "swh loader run git https://github.com/thebigbrain/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368"
        User time (seconds): 6907.23
        System time (seconds): 619.13
        Percent of CPU this job got: 62%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:19:27
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21273060
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 75
        Minor (reclaiming a frame) page faults: 6586107
        Voluntary context switches: 14848
        Involuntary context switches: 90856
        Swaps: 0
        File system inputs: 10352
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

i've already started back the process. The packfile sent was a large one [~10G [1]) but way less than the initial load [~18G [2]) if i read those logs correctly.

[1]

Total 10930922 (delta 140), reused 135 (delta 135), pack-reused 10930731

[2]

Total 18243617 (delta 622), reused 705 (delta 572), pack-reused 18242141

initial load of a different repository, which has 338k more commits

initial load of a different repository, which has 338k more commits

Which one has that much more commit, the initial one? If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.

My point was mostly to say "no, not immediately" to your question [1], not immediately since the process already restarted back for some time (prior to the syadm channel notification time).
I'll do it if that fails again.

And I thought the packfile log were interesting. If they are not, please detail a bit because i don't see it exactly.

[1] > In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

Which one has that much more commit, the initial one?

Yes

If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.

I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test

Which one has that much more commit, the initial one?

Yes

If so, i would expect the fork to be loaded way faster since they should have a shared history at some point in the past.

I would have expected it not to run out of memory (which was the point of the manual load), and it already failed that test

yes, ok so we are aligned then.

Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)

[1]

21327 swhworker  20   0 9124M 9096M 12804 R 27.4 18.8 52:58.02 │  │                 └─ python3 /usr/bin/swh loader run git https://github.com/thebigbrain/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368

Note that the first repo run took 134:40:21 (after multiple iterations so maybe more than that actually), so even if the fork ingestion take like ~10h, that'd be much quicker already ¯\_(ツ)_/¯ (been ongoing for ~52min now)

Well, it finished and took ~20h [1], still some win in regards to the initial ingestion of 134h...

[1]

swhworker@worker17:~$ url=https://github.com/thebigbrain/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-04-pack-size-limit-32g-fork.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/thebigbrain/chromium' with type 'git'
WARNING:swh.storage.proxies.retry:Retrying RPC call
WARNING:swh.storage.proxies.retry:Retrying RPC call
Enumerating objects: 10930922, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 10930922 (delta 140), reused 135 (delta 135), pack-reused 10930731
INFO:swh.loader.git.loader:Listed 15020 refs for repo https://github.com/thebigbrain/chromium
sINFO:swh.loader.git.loader.GitLoader:Fetched 10930923 objects; 3 are new
{'status': 'eventful'} for origin 'https://github.com/thebigbrain/chromium'
        Command being timed: "swh loader run git https://github.com/thebigbrain/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368"
        User time (seconds): 53568.28
        System time (seconds): 2469.61
        Percent of CPU this job got: 75%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 20:43:24
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21274320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 17
        Minor (reclaiming a frame) page faults: 6065975
        Voluntary context switches: 200471
        Involuntary context switches: 213563
        Swaps: 0
        File system inputs: 21200
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So the first fork ingestion finished and took less time.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

jsyk, I've edited accordingly the file and triggered back another fork ingestion:

swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'

@vlorentz I also encountered [1] this morning which might explain the large packfile...

[1] T4311

So the first fork ingestion finished and took less time.

Looks like either the loader didn't detect it is a fork, or github sent a large packfile anyway.

In swh/loader/git/loader.py at the end of the prepare function, could you print self.statsd.constant_tags and self.parent_origins, to see which it is?

jsyk, I've edited accordingly the file and triggered back another fork ingestion:

swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'

That's still a pretty big packfile ~12.6G [1]... I'm pondering whether i should stop it,
install the new python3-dulwich olasd packaged and trigger it back...

[1]

INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
Enumerating objects: 12661350, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 12661350 (delta 140), reused 135 (delta 135), pack-reused 12661159
INFO:swh.loader.git.loader:Listed 15230 refs for repo https://github.com/Tomahawkd/chromium

That's still a pretty big packfile ~12.6G [1]... I'm pondering whether i should stop it,
install the new python3-dulwich olasd packaged and trigger it back...

I've updated dulwich on worker17 and triggered another fork ingestion instead.
It does not change much, ~11G packfile.
I'll let this rest now.

swhworker@worker17:~$ url=https://github.com/Innerface/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork3.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Innerface/chromium' with type 'git'
Enumerating objects: 11165662, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 11165662 (delta 140), reused 135 (delta 135), pack-reused 11165471
INFO:swh.loader.git.loader:Listed 15333 refs for repo https://github.com/Innerface/chromium

status, second fork ingestion done (prior to the other one still ongoing) [1]

[1] @vlorentz Note the included print statements you asked for:

swhworker@worker17:~$ export SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml
swhworker@worker17:~$ url=https://github.com/Innerface/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork3.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Innerface/chromium' with type 'git'
Enumerating objects: 11165662, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 11165662 (delta 140), reused 135 (delta 135), pack-reused 11165471
INFO:swh.loader.git.loader:Listed 15333 refs for repo https://github.com/Innerface/chromium
INFO:swh.loader.git.loader.GitLoader:Fetched 11165663 objects; 1 are new
self.statsd.constant_tags: {'visit_type': 'git', 'incremental_enabled': True, 'has_parent_origins': True, 'has_parent_snapshot': True, 'has_previous_snapshot': False}
self.parent_origins: [Origin(url='https://github.com/chromium/chromium', id=b'\xa9\xf66\xa1/\\\xc3\\\xa4\x18+\r\xe7L\x91\x94\xe9\x00\x96J')]
{'status': 'eventful'} for origin 'https://github.com/Innerface/chromium'
        Command being timed: "swh loader run git https://github.com/Innerface/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368"
        User time (seconds): 53276.86
        System time (seconds): 2492.25
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 21:29:12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1842708
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 5961267
        Voluntary context switches: 195355
        Involuntary context switches: 224013
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

heh, ok, so it's indeed because github sends us way too much

And the 2nd fork ingestion is done as well:

swhworker@worker17:~$ url=https://github.com/Tomahawkd/chromium; /usr/bin/time -v swh loader run git $url lister_name=github lister_instance_name=github pack_size_bytes=34359738368 | tee chromium-20220607-05-pack-size-limit-32g-fork2.txt
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/Tomahawkd/chromium' with type 'git'
Enumerating objects: 12661350, done.
Counting objects: 100% (191/191), done.
Compressing objects: 100% (56/56), done.
Total 12661350 (delta 140), reused 135 (delta 135), pack-reused 12661159
INFO:swh.loader.git.loader:Listed 15230 refs for repo https://github.com/Tomahawkd/chromium
INFO:swh.loader.git.loader.GitLoader:Fetched 12661351 objects; 2 are new
self.statsd.constant_tags: {'visit_type': 'git', 'incremental_enabled': True, 'has_parent_origins': True, 'has_parent_snapshot': True, 'has_previous_snapshot': False}
self.parent_origins: [Origin(url='https://github.com/chromium/chromium', id=b'\xa9\xf66\xa1/\\\xc3\\\xa4\x18+\r\xe7L\x91\x94\xe9\x00\x96J')]
{'status': 'eventful'} for origin 'https://github.com/Tomahawkd/chromium'
        Command being timed: "swh loader run git https://github.com/Tomahawkd/chromium lister_name=github lister_instance_name=github pack_size_bytes=34359738368"
        User time (seconds): 62323.33
        System time (seconds): 3001.76
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 25:03:29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 29352136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 8
        Minor (reclaiming a frame) page faults: 10355329
        Voluntary context switches: 265156
        Involuntary context switches: 265330
        Swaps: 0
        File system inputs: 2048
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@vlorentz I don't have anything left to do, can i close it now?

ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

Let's do this the other way around, closing this as i'm done.
Please reopen if you need something else.

Cheers,