Change Details

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating a snapshot of the visit. While the memory consumption is not a problem for small to medium repositories, this can become one on large repositories, either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion (nothing has been done). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references, drop the packfile reference immediately [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start since it should then incrementally load the repository following its history [3]. If we don't follow the history [2], we could fetch first a huge packfile (with mostly everything in it thus back to square one). This assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss with the server to retrive information. FWIW, this is what's currently done with the mercurial loader without issues (which is btw very stable now and not as greedy in memory as it used to be, hence the motivation to align the loader git to do the same). [2] https://forge.softwareheritage.org/D6386 [3] https://forge.softwareheritage.org/D6392 [4] https://forge.softwareheritage.org/D6380 [5] Another idea (not followed through) would be to ingest some known special references (with assumed high connectivity in the graph, e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) as last references. The reasoning is that we assume that those are the main part of the repository, so highly connected part of the graph. So starting with those would end up with a huge packfile immediately (if we start by those highly connected references, with large repositories, back to square one again). If we start by the other references first, then dealing with those at the end, it sounds like a bit more work to fill in the hole (but hopefully not too much). That could yet be another optimization which could also help if there is no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal tool used to discuss with the server). As it's not clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's not to be excluded though. It may be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work. Any thoughts?

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating a snapshot of the visit. While the memory consumption is not a problem for small to medium repositories, this can become one on large repositories, either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing has been done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. The last point is problematic for scheduling further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references, drop the packfile reference immediately [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start since it should then incrementally load the repository following its history [3]. If we don't follow the history [2], we could fetch first a huge packfile (with mostly everything in it thus back to square one). This assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss with the server to retrive information. FWIW, this is what's currently done with the mercurial loader without issues (which is btw very stable now and not as greedy in memory as it used to be, hence the motivation to align the loader git to do the same). [2] https://forge.softwareheritage.org/D6386 [3] https://forge.softwareheritage.org/D6392 [4] https://forge.softwareheritage.org/D6380 [5] Another idea (not followed through) would be to ingest some known special references (with assumed high connectivity in the graph, e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) as last references. The reasoning is that we assume that those are the main part of the repository, so highly connected part of the graph. So starting with those would end up with a huge packfile immediately (if we start by those highly connected references, with large repositories, back to square one again). If we start by the other references first, then dealing with those at the end, it sounds like a bit more work to fill in the hole (but hopefully not too much). That could yet be another optimization which could also help if there is no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal tool used to discuss with the server). As it's not clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's not to be excluded though. It may be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work. Any thoughts?

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating a snapshot of the visit. While the memory consumption is not a problem for small to medium repositories, this can become one on large repositories, either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion (nothing has been done. Nothing has been done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. The last point is problematic for scheduling further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references, drop the packfile reference immediately [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start since it should then incrementally load the repository following its history [3]. If we don't follow the history [2], we could fetch first a huge packfile (with mostly everything in it thus back to square one). This assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss with the server to retrive information. FWIW, this is what's currently done with the mercurial loader without issues (which is btw very stable now and not as greedy in memory as it used to be, hence the motivation to align the loader git to do the same). [2] https://forge.softwareheritage.org/D6386 [3] https://forge.softwareheritage.org/D6392 [4] https://forge.softwareheritage.org/D6380 [5] Another idea (not followed through) would be to ingest some known special references (with assumed high connectivity in the graph, e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) as last references. The reasoning is that we assume that those are the main part of the repository, so highly connected part of the graph. So starting with those would end up with a huge packfile immediately (if we start by those highly connected references, with large repositories, back to square one again). If we start by the other references first, then dealing with those at the end, it sounds like a bit more work to fill in the hole (but hopefully not too much). That could yet be another optimization which could also help if there is no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal tool used to discuss with the server). As it's not clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's not to be excluded though. It may be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work. Any thoughts?