Page MenuHomeSoftware Heritage

Stop processing packfiles before sending objects
ClosedPublic

Authored by olasd on Feb 25 2021, 4:18 PM.

Details

Summary

Since its creation, the git loader would process the packfile downloaded
from the remote repository, to make an index of all objects, filtering
them before sending them on to the storage. Since this functionality has
been implemented as a filter proxy in the storage API itself, the
built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through
the packfile six times: once for the basic object id indexing, once to
get content ids, then once for each object type. This change removes the
first two runs. By eschewing the double filtering, we should also reduce
the load on the backend storage (we would call the <object_type>_missing
endpoints twice).

Finally, as this change removes the global index of objects, and sends
the converted objects to the storage as soon as they're read, the memory
usage decreases substantially for large loads.

Test Plan

Integration tests unchanged

Diff Detail

Repository
rDLDG Git loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D5147 (id=18409)

Rebasing onto 11fe5b2770...

Current branch diff-target is up to date.
Changes applied before test
commit 076abc14fff85aa5ab687820932ee5c6332c953c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit e66c0927023fcb8eb098e5f1f2ea1705aa11f9d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

commit 7baa8117050ed4eb1f616422351851866b4252f1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 16:16:01 2021 +0100

    Drop unused type: ignore annotation

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/87/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 25 2021, 4:18 PM
Harbormaster failed remote builds in B19487: Diff 18409!

That simplifies readability ;)

lgtm (can't accept while it's still in draft ¯\_(ツ)_/¯)

swh/loader/git/loader.py
266

(unrelated to this diff, more of a general question regarding loader)

Do we keep the self.log method attribute or do we want to generalize what's done elsewhere with the module level logger variable instead?

swh/loader/git/from_disk.py
15

16:18:53 swh/loader/git/from_disk.py:15: error: Module 'dulwich.errors' has no attribute 'EmptyFileException'

Looks like this is the gist of the current build failure.
Just revert it?

That type: ignore annotation wasn't that useless after all

Build is green

Patch application report for D5147 (id=18415)

Rebasing onto 11fe5b2770...

Current branch diff-target is up to date.
Changes applied before test
commit 61afbc56b0356f31fd12a6c44713d16617db48da
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit 5e434d6f6e1c2f072477deb50f073333a0e9f45b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/89/ for more details.

olasd requested review of this revision.Feb 25 2021, 5:52 PM
swh/loader/git/loader.py
266

I think the self.log variable isn't that useful anymore (and is even somewhat of a liability, if it's initialized improperly...). We used to add celery task informations there, but now we do that either with sentry's integration or via our custom LogHandler, so centralizing the loader logger definition isn't that useful anymore.

ardumont added inline comments.
swh/loader/git/loader.py
266

ok so that could be cleaned up at some point.

good to know. Thanks.

Note: s/method attribute/instance attribute/

This revision is now accepted and ready to land.Feb 25 2021, 6:00 PM