Page MenuHomeSoftware Heritage

package.loader: Flush and clear regularly internal proxy storage states
AbandonedPublic

Authored by ardumont on Apr 11 2020, 12:14 PM.

Details

Summary

To avoid the internal states of proxy storages to grow out of proportions.
D3012 and this diff should be enough to replace D2973 altogether.

Related to T2352

Test Plan

tox

Diff Detail

Event Timeline

Build is green

Patch application report for D3014 (id=10700)

Rebasing onto 0c35beca3c...

Current branch diff-target is up to date.
Changes applied before test
commit 3cef5b1ba32c4507cd5228486efb0662bb2bcd1d
Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
Date:   Sat Apr 11 12:10:20 2020 +0200

    package.loader: Flush and clear regularly inner worker state
    
    To avoid the inner state of proxy storages to grow out of proportions.
    
    Related to T2352

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/19/ for more details.

Isn't it already configurable in the buffering proxy?

And if not, it probably should be

Isn't it already configurable in the buffering proxy?

... ¯\_(ツ)_/¯ ...

but yeah, it is already so it's redundant with the buffer proxy behavior.

... (more thinking) ...

Not the filtering proxy though (right now it can grow infinitely).

But it'd be better to flush at the same time we clean the filter inner state.

TIL that the filter has an internal state too.

I think that internal state is useless now; as the buffering proxy deduplicates objects in its buffer.

TIL that the filter has an internal state too.

It keeps a set of objects (objects_seen: Dict[str, Set[bytes]] = {},
object_type to 1 hash id) already seen through the loading (to filter prior to
actual call to the missing endpoint).

Right now, it's never cleared. That diff fixes that when calling
self.storage.clear_buffers.

TIL that the filter has an internal state too.

It keeps a set of objects (objects_seen: Dict[str, Set[bytes]] = {},
object_type to 1 hash id) already seen through the loading (to filter prior to
actual call to the missing endpoint).

Right now, it's never cleared. That diff fixes that when calling
self.storage.clear_buffers.

Yeah I saw that. What I'm saying is, we can remove objects_seen entirely

Yeah I saw that. What I'm saying is, we can remove objects_seen entirely

I understood ;)

I'm just not sure i agree with the conclusion yet (doing my weekly report ;)

Does objects_seen have a purpose, other than avoiding duplicates in the buffer proxy?

Does objects_seen have a purpose, other than avoiding duplicates in the buffer proxy?

Its initial purpose was to avoid calling too often self.storage.<object-type>_missing with the same list of objects over and over.
But that might be completely peanuts, no idea at what point it's useful or not.

I'm tempted to cleanup the filter indeed.
That's be one less state management worry.

;)

Hmm, good point. But the relevant pages are probably cached by pg or the server kernel; so that's only sparing the network requests

Hmm, good point. But the relevant pages are probably cached by pg or the server kernel; so that's only sparing the network requests

Yes, that's also a good point ;)

I'm tempted to cleanup the filter indeed.
That's be one less state management worry.

Well, I've opened D3016 for that purpose ;)

Abandon in favor of D3016 (there is no longer state in the filter proxy).
And the buffer proxy already flushes when threshold per object type is reached.
So no longer needed.