Change Details

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating a snapshot for the visit. While the memory consumption is not a problem for small to medium repositories, this can become one on large repositories, either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing has been done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. The last point is problematic for scheduling further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start since it should then incrementally load the repository following its history [3]. If we don't follow the history [2], we could fetch first a huge packfile (with mostly everything in it) thus back to square one. This assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss with the server to retrieve information. FWIW, this is what's currently done with the mercurial loader without issues (which is btw very stable now and not as greedy in memory as it used to be, hence one of the motivation to align the loader git to do the same). [2] https://forge.softwareheritage.org/D6386 [3] https://forge.softwareheritage.org/D6392 [4] https://forge.softwareheritage.org/D6380 [5] Another idea (not followed through) would be to ingest some known special references which are assumed highly connected within the graph (e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) as last references. The reasoning is that we assume that those are the main connected part of the repository. So starting with those would end up with a huge packfile immediately (if we start by those, with large repositories, back to square one again). If we start by the other references first, then dealing with those at the end, it sounds like a bit more work would be needed to fill in the blanks (but not too much). That could yet be another optimization which could also help if there are no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's not to be excluded though. It may simply be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work. As another point to further align the git loader with the mercurial loader, it would be interesting to actually start using the extid table to map what's considered git ids (which will change) with the revision/release id. And start using this mapping to actually filter across origins (and not only from the last snapshot of the same origin). That gave a good boost in actually doing less work for ingesting mercurial forks. Filtering early enough known revision/release would actually reduce further the packfile (when the extid table is actually filled enough that is). Any thoughts?

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating one snapshot for the visit. References in this context are resolved tip of branches (e.g refs/heads/master, ...) or tags (e.g refs/tags/...). While the memory consumption is not a problem for small (< 200 refs) to medium repositories (<= 500 refs), this can become one on large repositories (> 500), either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing gets done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. This actually is the major problem currently. Current deployment details also implies a heavy disk i/o (which creates ceph problems down the line). 3. The last point 2. is also problematic for scheduling further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects (which will have still been retrieved again though without partial snapshot in between). To solve these problems, some work has been investigated and tried. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. (no more huge packfile). This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start incrementally load the repository following its history [3]. If we don't follow the history (only [2]), we could fetch first a huge packfile (with mostly everything in it) thus back to square one. This assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss with the server to retrieve information (so a time trade-off) during the loading. FWIW, this continuous server discussion approach is what's currently done with the mercurial loader without issues (which is stable now and not as greedy in memory as it used to be, hence one other motivation to align git loader behavior). [2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch) [3] https://forge.softwareheritage.org/D6392 (follow refs in order) [4] https://forge.softwareheritage.org/D6380. This core loader adaptation could be dropped if we rework the loader git to let it do partial snapshots after each packfile consumption (so during the main ingestion, multiple partial snapshots prior to the final one). Right now, it cannot as-is, as missing references during ongoing ingestion prevent the partial snapshot from being built. Such adaptation would take care of 3. point (and make subsequent visit do less work even in case of failure). Thanks to @vlorentz which made me realize this, after a couple of nights sleeping on it, it clicked! [5] Another idea (not followed through) would be to ingest some known special references which are assumed highly connected within the graph (e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) at the end of the loading. The reasoning is that we assume that those are the main connected part of the repository. So starting with those would end up with a huge packfile immediately (with large repositories at least, back to the initial problem). If we start by the other references first, then dealing with those at the end, only a bit more work would be needed to fill in the blanks. That could yet be another optimization which would also help if there are no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's been slighly tested but dismissed for now due to that question. It's not to be excluded though. It may simply be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work (the part walking the git graph). As another optimization point to further align the git loader with the mercurial loader, it would be interesting to start using the extid table to map what's considered git ids (which will change) with the revision/release id. Then start using this mapping to actually filter across origins known refs (and not only from the last snapshot of the same origin as currently). That optimization gave a good boost in actually doing less work for ingesting mercurial forks. Filtering early enough known refs (revision/release) would actually reduce further the packfile (when the extid table is actually filled enough that is). Having described all that, I'm fairly convinced that some approaches described here are a possible way forward which is better than the current status quo. Shall we proceed? Any thoughts, pros or cons arguments welcomed. Cheers,

#+title: Analyze and try to reduce loader-git memory consumption #+author: vsellier, ardumont The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating a snapshot for the visitone snapshot for the visit. References in this context are resolved tip of branches (e.g refs/heads/master, ...) or tags (e.g refs/tags/...). While the memory consumption is not a problem for small to medium repositories, this(< 200 refs) to medium repositories (<= 500 refs), this can become one on large repositories (> 500), either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing has beengets done. The visit is marked marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. The last point is problematic for scheduling furtheris actually is the major problem currently. Current deployment details also implies a heavy disk i/o (which creates ceph problems down the line). 3. The last point 2. is also problematic for scheduling further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested visits for that origin. Nonetheless, if further visit happens somehow, those willobjects (which will have still been retrieved again though without partial snapshot skip already ingesin between). To solve these problems, some work has been investigated objectsand tried. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. (no more huge packfile). This is done by asking intervals of unknownn remote refs, starting by the tags (in natural order) then the remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start incrementally order on tags sounds like a proper way to start since it shouldload the repository following its history [3]. If we don't follow then incrementally load history (only the repository following its history [3]. If we don't follow the history [2][2]), we could fetch first a huge packfile (with mostly everything in it) thus back to fetch first a huge packfile (with mostly everything in it) thus back to square one. This assumes there are tags in the repository (which should mostly be the assumes there are tags in the repository (which should mostly be the case). The only limitation seen for that approach is that we now continually discuss limitation seen for that approach is that we now continually discuss with the server towith the server to retrieve information (so a time trade-off) during the loading. FWIW, this continuous server discussion approach is what's currently done with the retrieve information. FWIW, thismercurial loader without issues (which is what's currently done with the mercurial loaderstable now and not as greedy in memory as it without issues (which is btw very stable now and not as greedy in memory as it used toused to be, hence one other motivation to align git loader behavior). [2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch) be, hence one of the motivation to align the loader git to do the same).[3] https://forge.softwareheritage.org/D6392 (follow refs in order) [24] https://forge.softwareheritage.org/D63860. This core loader adaptation could be dropped if we rework the loader git to let it do partial snapshots after each packfile consumption (so during the main ingestion, multiple partial snapshots prior to the final one). Right now, it cannot as-is, as missing references during ongoing ingestion prevent the partial snapshot from being built. Such adaptation would take care of 3. point (and [3] https://forge.softwareheritage.org/D6392make subsequent visit do less work even in case of failure). Thanks to @vlorentz which [4] https://forge.softwareheritage.org/D6380made me realize this, after a couple of nights sleeping on it, it clicked! [5] Another idea (not followed through) would be to ingest some known special references which are assumed highly connected within the graph (e.g."HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... others?) as last referencesat the end of the loading. The reasoning reasoning is that we assume that those are the main connected part of the repository. So starting starting with those would end up with a huge packfile immediately (if we start by those, withwith large large repositorieses at least, back to square one againthe initial problem). If we start by the other references first, then dealing with those at the end, it sounds likeonly a bit more work would be needed to fill needed to fill in the blanks (but not too much). That could yet be another optimization which would also help if there are which could also help if there are no tags in the repository. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's not to be excluded though. It may simply bebeen slighly tested but dismissed for now that this soludue to that question composed with the previous points could just be a deeper optimization. It's not to be excluded though. It may simply be that this on reducing the loader's worksolution composed with the previous points could just be a deeper optimization on reducing the loader's work (the part walking the git graph). As another optimization point to further align the git loader with the mercurial loader, it would be interesting to start using the extid table to map what's considered git ids (which will change) with the revision/release id. it would beThen start using this mapping to interesting to actually start using the extid table to map what's considered git idsactually filter across origins known refs (and not only from the last snapshot of the (which will change) with the revision/release idsame origin as currently). And start using this mappThat optimization gave a good boost in actually doing toless actually filter across origins (and not only from the last snapshot of the same origin).work for ingesting mercurial forks. Filtering early enough known refs (revision/release) That gave a good boost in actually doing less work for ingesting mercurial forks.would actually reduce further the packfile (when the extid table is actually filled Filtering early enough known revision/release would actually reduce further the packfileenough that is). Having described all that, I'm fairly convinced that some approaches described here are (when the extid table is actually filled enough that is)a possible way forward which is better than the current status quo. Shall we proceed? Any thoughts, pros or cons arguments welcomed. Any thoughts?Cheers,