Change Details

The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references/refs (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating one snapshot for the visit. References in this context are resolved tips of branches (e.g refs/heads/master, ...) or tags (e.g refs/tags/...). While the memory consumption is not a problem for small (< 200 refs) to medium repositories (<= 500 refs), this can become one on large repositories (> 500), either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing gets done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. This is actually a major problem regarding the scheduling of further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects (which will have still been retrieved again though *without* partial snapshot in between visits). 3. Prior to the process getting killed, a high memory pressure in the current deployment situation implies that at some point heavy disk i/o happens (to reduce memory pressure). But that actually creates other ceph problems down the line, which can possibly cascade to outage, see the events we faced during holidays for example... It's a work in progress outside the scope of this email. But still, that could still happen. In the mean time, the actual proposed solutions could incidentally prevent this. To solve these problems, some work has been investigated and tried. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. (no more huge packfile). This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start incrementally load the repository following its history [3]. If we don't follow the history (only [2]), we could fetch first a huge packfile (with mostly everything in it) thus back to square one. This does assume tags exist in the repository (which should mostly be the case). The only limitation seen for that approach is that we now regularly discuss with the server to retrieve information during the loading. It's a trade-off. FWIW, this regular server discussion approach is what's currently done with the mercurial loader without issues. That loader is very stable now and not as greedy in memory as it used to be (hence one other motivation to align git loader behavior). It did the sourceforge hg origins ingestion without issues and the bitbucket origins still ongoing is running smoothly (ETA ~1 week or so). [2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch) [3] https://forge.softwareheritage.org/D6392 (follow refs tags in order, then branches) [4] https://forge.softwareheritage.org/D6380. This core loader adaptation actually allows the git loader (well DVCS loaders) to create partial visit targeting a snapshot after each internal ingestion loop (so for git after each packfile consumption). So this makes sure that we create incremental snapshot prior to the final one (if we reach the end of the ingestion). Such adaptation takes care of point 2. (and make subsequent visits do less work even in case of failures). Thanks to @vlorentz which made me realize this on that diff's review. After a couple of nights sleeping on it, it finally made sense! [5] Another idea (not followed through) would be to delay ingestion of some special references which are assumed highly connected within the graph at the end. Typically, e.g. "HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... (others?). The hypothesis is that these references are the main connected part of the repository. So starting with those would end up with a huge packfile immediately (so with large repositories at least, we are back to the initial problem). On the contrary, if we start by the other references first, then dealing with the special ones at the end, only a bit more work would be needed to fill in the blanks. That could yet be another optimization which would maybe help if there are no tags in the repository for example. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's been slighly tested but dismissed for now due to that question. It's not to be excluded though. It may simply be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work (especially the part walking the git graph). Not necessarily memory consumption related but still, as another optimization point to further align the git loader with the mercurial loader, it would be interesting to start using the extid table to map what's considered git ids (which will change afaiui) with the revision/release id. Using this mapping would allow filtering even further what we already ingested across origins (known refs and not only those from the last snapshot of the same origin as currently). That optimization reduced quite a lot the mercurial loader work especially regarding mercurial forks. In the loader git's case, filtering early enough known refs (revision/release) would actually reduce further the packfiles (providing the extid table is actually filled enough that is). As a heads up, there are ongoing runs between staging (patched venv) and production nodes. Stats will be updated along the way when more runs conclude [6]. In the mean time, the first runs already confirm what's been described in this email, "it works", less memory usage and it's not necessarily slower. Current runs actually shows that staging nodes finishes faster than production ones (even though they are machines "less" big, they are also less loaded in terms of loading so might be just that). All of this, without impacting the current snapshot id computations, meaning the final snapshot of the "full" visit remains the same between production and patched staging nodes. [6] https://forge.softwareheritage.org/T3625#71604 Having described the problematic, possible solutions, implement some of it, and with the first runs actually confirming it's actually working, I'm now fairly convinced that we have a way forward to improve the loader git so it ends up being a more concurrency friendly loader. That should also help in making our lag subside even faster (we are keeping up [7] btw but not as fast as we could). [7] https://grafana.softwareheritage.org/goto/crFAS4Dnk?orgId=1 Any thoughts, pros or cons arguments prior to actually drive this all the way through? Thanks in advance for any feedback, cheers,

Hello, The current loader git consumes a lot of memory depending on the size of the repository. It's fetching the full packfile of unknown references/refs (filtered by last snapshot's references), then parses the packfile multiple times to load in order contents, directories, revisions, releases and then finishes by creating one snapshot for the visit. References in this context are resolved tips of branches (e.g refs/heads/master, ...) or tags (e.g refs/tags/...). While the memory consumption is not a problem for small (< 200 refs) to medium repositories (<= 500 refs), this can become one on large repositories (> 500), either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. Nothing gets done. The visit is marked as failed. If that happens too often (thrice consecutively iirc), the origin ends up disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion process gets killed. That means partial ingestion of objects got done, but no snapshot nor finalized visit. This is actually a major problem regarding the scheduling of further visits for that origin. Nonetheless, if further visit happens somehow, those will skip already ingested objects (which will have still been retrieved again though *without* partial snapshot in between visits). 3. Prior to the process getting killed, a high memory pressure in the current deployment situation implies that at some point heavy disk i/o happens (to reduce memory pressure). But that actually creates other ceph problems down the line, which can possibly cascade to outage, see the events we faced during holidays for example... It's a work in progress outside the scope of this email. But still, that could still happen. In the mean time, the actual proposed solutions could incidentally prevent this. To solve these problems, some work has been investigated and tried. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even further. This had the advantage to kill the loading very fast. So, the conclusion of this attempt is that iterating over the packfile multiple times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1. (no more huge packfile). This is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start incrementally load the repository following its history [3]. If we don't follow the history (only [2]), we could fetch first a huge packfile (with mostly everything in it) thus back to square one. This does assume tags exist in the repository (which should mostly be the case). The only limitation seen for that approach is that we now regularly discuss with the server to retrieve information during the loading. It's a trade-off. FWIW, this regular server discussion approach is what's currently done with the mercurial loader without issues. That loader is very stable now and not as greedy in memory as it used to be (hence one other motivation to align git loader behavior). It did the sourceforge hg origins ingestion without issues and the bitbucket origins still ongoing is running smoothly (ETA ~1 week or so). [2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch) [3] https://forge.softwareheritage.org/D6392 (follow refs tags in order, then branches) [4] https://forge.softwareheritage.org/D6380. This core loader adaptation actually allows the git loader (well DVCS loaders) to create partial visit targeting a snapshot after each internal ingestion loop (so for git after each packfile consumption). So this makes sure that we create incremental snapshot prior to the final one (if we reach the end of the ingestion). Such adaptation takes care of point 2. (and make subsequent visits do less work even in case of failures). Thanks to @vlorentz which made me realize this on that diff's review. After a couple of nights sleeping on it, it finally made sense! [5] Another idea (not followed through) would be to delay ingestion of some special references which are assumed highly connected within the graph at the end. Typically, e.g. "HEAD", "refs/heads/master", "refs/heads/main", "refs/heads/develop", ... (others?). The hypothesis is that these references are the main connected part of the repository. So starting with those would end up with a huge packfile immediately (so with large repositories at least, we are back to the initial problem). On the contrary, if we start by the other references first, then dealing with the special ones at the end, only a bit more work would be needed to fill in the blanks. That could yet be another optimization which would maybe help if there are no tags in the repository for example. Another consideration we did not follow completely through yet was to use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be a relatively decent and satisfying enough for all repositories out there. It's been slighly tested but dismissed for now due to that question. It's not to be excluded though. It may simply be that this solution composed with the previous points could just be a deeper optimization on reducing the loader's work (especially the part walking the git graph). Not necessarily memory consumption related but still, as another optimization point to further align the git loader with the mercurial loader, it would be interesting to start using the extid table to map what's considered git ids (which will change afaiui) with the revision/release id. Using this mapping would allow filtering even further what we already ingested across origins (known refs and not only those from the last snapshot of the same origin as currently). That optimization reduced quite a lot the mercurial loader work especially regarding mercurial forks. In the loader git's case, filtering early enough known refs (revision/release) would actually reduce further the packfiles (providing the extid table is actually filled enough that is). As a heads up, there are ongoing runs between staging (patched venv with diffs [2] [3] and [4]) and production nodes. Stats will be updated along the way when more runs conclude (stats [6]). In the mean time, the first runs already confirm what's been described in this email, "it works". That is less memory is used and it's not necessarily slower. Current runs actually shows that staging nodes finishes faster than production ones (possibly due to their load been less intensive). This is also consistent in term of snapshot id computation. The final snapshot of the "full" visit remains the same between production and patched staging nodes (when the visit actually finishes production side). [6] https://forge.softwareheritage.org/T3625#71604 Having described the problematic, possible solutions, implement some of them, and with the first runs actually confirming it's actually working, I'm now fairly convinced that we have a way forward to improve the loader git. With at least the first diffs in review, it will end up being less greedy in terms of memory, thus more concurrent friendly. Which in turn, should help in making our lag subside even faster (we are keeping up [7] btw but not as fast as we'd like). [7] https://grafana.softwareheritage.org/goto/crFAS4Dnk?orgId=1 Any thoughts, pros or cons arguments prior to actually drive this all the way through? Thanks in advance for any feedback, cheers, @vsellier and @ardumont -- tony / Antoine R. Dumont (@ardumont) ----------------------------------------------------------------- gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8

Hello, The current loader git consumes a lot of memory depending on the size of of the repository. It's fetching the full packfile of unknown It's fetching the full packfile of unknown references/refs (filtered by last snapshot's references), then parses references)the packfile multiple times to load in order contents, then parses the packfile multiple times to load in order contentsdirectories, directories, revisions, releases and then finishes by creating one snapshot for the visit. References in this context are resolved tips of branches (e.g refs/heads/master, ...) or refs/heads/master, ...) or tags (e.g refs/tags/...). While the memory consumption is not a problem for small (< 200 refs) to to mediumm repositories (<= 500 refs), this can become one on large repositories (<= 500 refs), this can become one on large repositories (> 500), either: 1. The currently unique packfile retrieved at the beginning of the loading is too big (> 4Gib) which fails immediately the ingestion. 4Gib) which fails immediately the ingestionNothing gets done. Nothing gets doneThe visit is marked as failed. The visit is markedIf that happens too as failed.often (thrice consecutively iirc), If that happens too often (thrice consecutively iirc)the origin ends up disabled, the origin ends upso no disabled, so no longer scheduled (up until it's listed again). 2. The ingestion starts but due to concurrency with other loading processes, the ingestion processesess gets killed. That means partial ingestion of objects got done, thebut no snapshot nor finalized visit. ingestion process gets killed. That means partial ingestion of objects got done, butThis is actually a major problem regarding the scheduling of further no snapshot nor finalized visit.visits for that origin. Nonetheless, This is actually a major problem regarding theif further visit happens scheduling of further visits for that origin. Nonethelesssomehow, if further visit happensthose will skip already ingested objects (which will have somehow, those wstill skip already ingested objects (which will have still beenbeen retrieved again though *without* partial snapshot in retrieved again though *without* partial snapshot in between visits). 3. Prior to the process getting killed, a high memory pressure in the current deployment current deployment situation implies that at some point heavy disk i/o happens (to reduce memory pressure). But that actually creates other ceph problems down the line,i/o happens (to reduce memory pressure). which canBut that actually creates possibly cascade to outageother ceph problems down the line, see the events we faced during holidays for example...which can possibly cascade to It's a work in progress outside the scope of this email.outage, But still,see the events we faced during holidays for example... that could stillIt's a happenwork in progress outside the scope of this email. In the mean timeBut still, the actual proposed solutions could incidentally preventthat could still happen. In the mean time, the actual proposed solutions could incidentally prevent this. To solve these problems, some work has been investigated and tried. A first naive attempt has been made to iterate over the packfile once and keep a dict of the references (to drop immediately the packfile reference) [1]. This failed as the memory consumption spiked even the references (to drop immediatelyfurther. This had the advantage to kill the packfile reference) [1].loading very fast. So, This failed as the memory consumptconclusion spiked even further. This had the advantage to killof this attempt is that iterating over the loading verypackfile multiple fast. So, the conclustimes (one iteration of this attempt is that iterating over the packfile multiplefor each type of object of our model) is actually times (one iteration for each type of object of our model) is actually not the problem. [1] https://forge.softwareheritage.org/D6377 Another attempt was to modify the loader git to make the ingestion fetch multiple multiple packfiles ([2] [3] with a slight loader-core change required [4])). This has the advantage of naturally taking care of 1[4])). (no more huge packfile)This has the advantage of naturally taking care of 1. This is done by asking(no more intervals of unknown remote refs,huge packfile). starting by the tags (in natural order) then theThis is done by asking intervals of unknown remote refs, starting by the tags (in natural order) then the branches [5]. The natural order on tags sounds like a proper way to start incrementally load the repository following its history [3]. If we don't follow the history (onlynatural order on tags sounds like a proper way to start incrementally [2]),load the repository following its history [3]. we could fetch first a huge packfile (with mostly everything in it) thus back toIf we don't follow the square one.history (only [2]), This does assume tags exist in the repository (which shouldwe could fetch first a huge packfile (with mostly be the case)everything in it) thus back to square one. The only limitation seen for that approach is that we now regularly discuss withThis does assume tags exist the server in the repository (which should mostly be the case). The only limitation seen for that approach is that we now regularly discuss with the server to retrieve information during the loading. It's a trade-off. FWIW, this regular server discussion approach is what's currently done with the with the mercurial loader without issues. That loader is very stable now and not as greedy in and not as greedy in memory as it used to be (hence one other motivation to align git loader behavior). It did the sourceforge hg origins ingestion without issues and the bitbucketto align git loader behavior). It did the sourceforge hg origins stills ingestion without issues and the bitbucket origins still ongoing is running smoothly (ETA ~1 week or so). [2] https://forge.softwareheritage.org/D6386 (ingest in multiple packfile fetch) [3] https://forge.softwareheritage.org/D6392 (follow refs tags in order, then branches) [4] https://forge.softwareheritage.org/D6380. This core loader adaptation actually allows the git loader (well DVCS loaders) to create allows the git loader (well DVCS loaders) to create partial visit targeting a snapshotpartial visit targeting a snapshot after each internal ingestion loop after each internal ingestion loop (so for git after each packfile consumption). So this makes sure that we makes sure that we create incremental snapshot prior to the final one (if we reach the end end of the ingestion). Such adaptation takes care of point 2. (and make subsequent subsequent visits do less work even in case of failures). Thanks to @vlorentz which made me realize this on that diff's review. After a couple of nights sleeping on it,@vlorentz which made me realize this on that diff's review. it finally madeAfter a couple of nights sleeping on it, it finally made sense! [5] Another idea (not followed through) would be to delay ingestion of of some specialecial references which are assumed highly connected within the references which are assumed highly connected within the graph at the end.graph at the end. Typically, e.g. "HEAD", Typically"refs/heads/master", e.g."refs/heads/main", "HEAD""refs/heads/develop", "refs/heads/master",... "refs/heads/main", "refs/heads/develop",(others?). ...The hypothesis (others?). The hypothesis is that these references are the main connected part of the repository. repository. So starting with those would end up with a huge packfile immediately (so with large repositories at least, we are back to the initial problem). On the contrary, On the contrary, if we start by the other references first, then dealing with the special ones at the end, only a bit more work would be needed to fill inwith the special ones at the blanks.end, That conly a bit more work would yet bebe needed another optimization which wto fill in the blanks. That could maybe help if there are no tags in the repository foryet be another optimization which would maybe help if there are no tags in the repository for example. Another consideration we did not follow completely through yet was to use a depth use a depth parameter (in the current internal lib used to discuss with the server). It's not completely clear what actual depth number would be parameter (in the curra relatively decent internal lib used to discuss with the server). It's notand satisfying enough for all repositories out completely clear what actual depth number would be a relatively decent and satisfyingthere. It's been slighly tested but dismissed for now due to that enough for all repositories out therequestion. It's not to be excluded though. It's been slighly tested but dismissed for nowIt may simply be that this due to that quessolution. It's not to be excluded though. It may simply composed with the previous points could just be that thisa deeper soluoptimization composed withon reducing the previous points could just be a deeper optimization onloader's work (especially the part walking reducing the loader's work (especially the part walking the git graph). Not necessarily memory consumption related but still, as another optimization point to optimization point to further align the git loader with the mercurial loader, it would be interesting to start using the extid table to map what's considered git ids (which will change afaiui) withloader, it would be interesting to start using the extid table to map the revision/release id. Using this mapping would allow filtering even further what wewhat's considered git ids (which will change afaiui) with the already ingested across origins (known refs and not only those from the last snapshot ofrevision/release id. Using this mapping would allow filtering even the same origin as currently). That optimizatiofurther what we already ingested across origins (known reduced quite a lot the mercurialfs and not only loader work especially regarding mercurial forks. In the loader git's case,those from the last snapshot of the same origin as currently). filteringThat early enough knowoptimization refs (revision/release) would actually reduce further the packfilesduced quite a lot the mercurial loader work especially regarding mercurial forks. In the loader git's case, filtering early enough known refs (revision/release) would actually reduce further the packfiles (providing the extid table is actually filled enough that is). As a heads up, there are ongoing runs between staging (patched venv) and production with diffs [2] [3] and [4]) and production nodes. Stats will be updated along the way when more runs conclude (stats [6]). In the mean time, the first nodes. Stats will be updated along the way when more runs conclude [6].runs already confirm what's been described in this email, In the mean"it works". time, the first runs already confirm whaThat is less memory is used and it's been described in this email,not necessarily slower. "it works",Current less memory usage and it's not necessarily slower. Current runs actually shows thatruns actually shows that staging nodes finishes faster than production staging nodes finishes faster than production ones (even though they are machines "less"ones (possibly due to their load been less intensive). This is also big, they are also less loadedconsistent in terms of loading so might be just that)snapshot id computation. AllThe final snapshot of this,e without impacting"full" visit remains the current snapshot id computasame between productions, meaning the final snapshot of and patched staging the "full"nodes (when the visit remains the same betweenactually finishes production and patched staging nodesn side). [6] https://forge.softwareheritage.org/T3625#71604 Having described the problematic, possible solutions, implement some of it, and with the first runs actually confirming it's actually workingthem, I'm now fairand with the first runs actually convinced that wefirming it's actually working, have a way forward to improve the loader git so it ends up being a more concurrencyI'm now fairly convinced that we have a way forward to improve the friendly loader.loader git. With at least the first diffs in review, That should also help in making our lag subside even faster (we areit will end up being less greedy in terms of memory, thus more concurrent friendly. Which in turn, should help in making our lag subside even faster (we are keeping up [7] btw but not as fast as we couldwe'd like). [7] https://grafana.softwareheritage.org/goto/crFAS4Dnk?orgId=1 Any thoughts, pros or cons arguments prior to actually drive this all all the way through? Thanks in advance for any feedback, cheers, @vsellier and @ardumont -- tony / Antoine R. Dumont (@ardumont) ----------------------------------------------------------------- gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8