Stop processing packfiles before sending objects
ClosedPublic
Actions

Authored by olasd on Feb 25 2021, 4:18 PM.

Details

Reviewers

ardumont

Group Reviewers

Reviewers

Commits

rDLDG61afbc56b035: Stop processing packfiles before sending objects
rDLDG5e434d6f6e1c: Drop unused get_fetch_history_result methods

Summary

Since its creation, the git loader would process the packfile downloaded
from the remote repository, to make an index of all objects, filtering
them before sending them on to the storage. Since this functionality has
been implemented as a filter proxy in the storage API itself, the
built-in filtering by the git loader is now redundant.

The way the filtering was implemented in the loader would run through
the packfile six times: once for the basic object id indexing, once to
get content ids, then once for each object type. This change removes the
first two runs. By eschewing the double filtering, we should also reduce
the load on the backend storage (we would call the <object_type>_missing
endpoints twice).

Finally, as this change removes the global index of objects, and sends
the converted objects to the storage as soon as they're read, the memory
usage decreases substantially for large loads.

Test Plan

Integration tests unchanged

Diff Detail

Repository

rDLDG Git loader

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

olasd created this revision.Feb 25 2021, 4:18 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptFeb 25 2021, 4:18 PM

Build has FAILED

Patch application report for D5147 (id=18409)

Rebasing onto 11fe5b2770...

Current branch diff-target is up to date.

Changes applied before test

commit 076abc14fff85aa5ab687820932ee5c6332c953c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit e66c0927023fcb8eb098e5f1f2ea1705aa11f9d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

commit 7baa8117050ed4eb1f616422351851866b4252f1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 16:16:01 2021 +0100

    Drop unused type: ignore annotation

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/87/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 25 2021, 4:18 PM

Harbormaster failed remote builds in B19487: Diff 18409!

That simplifies readability ;)

lgtm (can't accept while it's still in draft ¯\_(ツ)_/¯)

swh/loader/git/loader.py
266	(unrelated to this diff, more of a general question regarding loader) Do we keep the `self.log` method attribute or do we want to generalize what's done elsewhere with the module level `logger` variable instead?

ardumont added inline comments.Feb 25 2021, 5:18 PM

swh/loader/git/from_disk.py
15	16:18:53 swh/loader/git/from_disk.py:15: error: Module 'dulwich.errors' has no attribute 'EmptyFileException' Looks like this is the gist of the current build failure. Just revert it?

That type: ignore annotation wasn't that useless after all

Build is green

Patch application report for D5147 (id=18415)

Rebasing onto 11fe5b2770...

Current branch diff-target is up to date.

Changes applied before test

commit 61afbc56b0356f31fd12a6c44713d16617db48da
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit 5e434d6f6e1c2f072477deb50f073333a0e9f45b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/89/ for more details.

Harbormaster completed remote builds in B19493: Diff 18415.Feb 25 2021, 5:52 PM

olasd requested review of this revision.Feb 25 2021, 5:52 PM

olasd added inline comments.Feb 25 2021, 5:55 PM

swh/loader/git/loader.py
266	I think the `self.log` variable isn't that useful anymore (and is even somewhat of a liability, if it's initialized improperly...). We used to add celery task informations there, but now we do that either with sentry's integration or via our custom LogHandler, so centralizing the loader logger definition isn't that useful anymore.