Page MenuHomeSoftware Heritage

core.loader: Log information about origin currently being ingested
ClosedPublic

Authored by ardumont on Sep 22 2020, 5:33 PM.

Details

Summary

Prior to this commit, we "info" logged no meaningful information [1] about the
origin being loaded.

[1] The only information we have for now is what configuration file we loaded which
i don't find satisfying... D4011

Test Plan

tox

Diff Detail

Event Timeline

Build is green

Patch application report for D4012 (id=14154)

Rebasing onto 452fa224f9...

Current branch diff-target is up to date.
Changes applied before test
commit 45189cc35882d2238be8db5a7e965fe1f414e98c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 17:31:31 2020 +0200

    core.loader: Log information about origin currently being ingested
    
    Prior to this commit, we "info" logged no meaningful information about the
    origin being loaded. With the following, we currently will have more context
    about the status, type and origin being loaded.
    
    This concerns the dvcs loaders for now (git, svn, mercurial)

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/301/ for more details.

swh/loader/core/loader.py
283 ↗(On Diff #14154)

package loaders currently return the status and the snapshot id...
i'm in a mind to unify this as well, adding the snapshot id as well here...

Build is green

Patch application report for D4012 (id=14157)

Rebasing onto b5d607989b...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-82-D4012.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/82/ for more details.

Build is green

Patch application report for D4012 (id=14158)

Rebasing onto 452fa224f9...

Current branch diff-target is up to date.
Changes applied before test
commit b563bd610e92564eb74d1b8106fe09a07e8acc44
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Sep 22 17:31:31 2020 +0200

    core.loader: Log information about origin currently being ingested
    
    Prior to this commit, we "info" logged no meaningful information about the
    origin being loaded. With the following, we currently will have more context
    about the status, type and origin being loaded.
    
    This concerns the dvcs loaders for now (git, svn, mercurial)

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/302/ for more details.

I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.

If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.

swh/loader/core/loader.py
318–320 ↗(On Diff #14158)

I think this log line should be at the beginning of the load method.

swh/loader/core/loader.py
318–320 ↗(On Diff #14158)

yes, it'd be fine earlier indeed.

something, something, history, something, error earlier which would have made that message logs something even though nothing meaningful would have really happened... Then again, conflation and all that...

I'll improve it, thanks.

I don't think the origin url and visit type should be sent in the task
result; they're arguments of the task already.

Yes, ok.

Except that information is not immediately seen in logs output, or is it?

If we want them logged by the worker when the task ends (which I agree would
be useful), then we should improve logging on the worker/celery side to show
some of the task arguments (for instance, if there's a "url" argument)
instead / in addition of the task id.

Yes, that sounds like the right place. Although, I don't really know what code
to change now. My first thought would be to look into the scheduler though (but
not quite sure where in there yet).


To be clear, my main issue today, when I try to look through our logs to
investigate or plain read what's going on (after a deployment for example), I
don't have any clues immediately...

In my mind, the kibana information is not enough by itself, so i think i need
to cross information with say sentry to have some more context... It's
currently quite frustrating... up to an eventual point of, "oh well, I have
some other urgent matters somewhere else..." (sometimes I push through but
sometimes, I fail).

douardda added inline comments.
swh/loader/core/loader.py
279 ↗(On Diff #14158)

I'm not very fond of using an "assert" here: as always when you add an assertion in the code: what is the expected behavior of the loading pipeline if this assertion is false?

And why in this place did you feel it's necessary to add it?

In D4012#99525, @olasd wrote:

I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.

If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.

+1

ok, i'll adapt
i would not be against a nudge in the right direction to actually improve the logging

swh/loader/core/loader.py
279 ↗(On Diff #14158)

I don't like it either but i kinda i don't trust entirely that loader...
It's been too long since we had a look into it...
(if everything is fine tests wise, that does mean everything will be fine in the other loaders git, ...)

But anyway, the gist of what olasd said means i will adapt that diff with just the one log command and revert the rest.
Then i'll have to dig in into the scheduler code to check how to make those celery/worker thing actually do the log improvment about the url ;)

swh/loader/core/loader.py
279 ↗(On Diff #14158)

that does not* mean...

Simplify to just one log statement

(the other improvment should go deeper in the stack)

Build is green

Patch application report for D4012 (id=14223)

Rebasing onto 7ed1927b99...

First, rewinding head to replay your work on top of it...
Applying: core.loader: Log information about origin currently being ingested
Changes applied before test
commit c5ce6ef3d34d234ea1cb5937a7345c6da2e5fda1
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 24 16:02:46 2020 +0200

    core.loader: Log information about origin currently being ingested
    
    Prior to this commit, we "info" logged no meaningful information about the
    origin being loaded. With the following commit, we will be able to tell what
    origin/visit-type is happening.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/303/ for more details.

To be clear, my main issue today, when I try to look through our logs to
investigate or plain read what's going on (after a deployment for example), I
don't have any clues immediately...

In my mind, the kibana information is not enough by itself, so i think i need
to cross information with say sentry to have some more context... It's
currently quite frustrating... up to an eventual point of, "oh well, I have
some other urgent matters somewhere else..." (sometimes I push through but
sometimes, I fail).

Yeah, I totally get it.

Our logging handler swh.core.logger.JournalHandler already knows how to pull some metadata from the celery tasks:

https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/logger.py;v0.3.0$44-49

This data appears as extra fields in the logs handled by systemd-journald (you can use, for instance, journalctl -f SWH_TASK_NAME=swh.loader.git.tasks.UpdateGitRepository), and therefore end up as named fields in the messages stored in kibana (I guess the field name ends up lowercased there).

rDCORE806f583 has dropped task arguments from the default systemd logging because:

  • some tasks have long lists of arguments that cause problems to the logging pipeline (because elasticsearch has a limit of 1000 fields per index, so flattening large argument lists can make it explode)
  • some tasks have arguments that can't be properly serialized at the bottom of the logging pipeline (elasticsearch doesn't like arbitrary bytes)

I think it would work to hardcode a few task argument fields, now that we've normalized a lot of them:

  • task.kwargs['url'] would be useful for all loaders and some listers
  • task.kwargs['instance'] would be useful for some listers too.

I've started doing this and documenting it a little better, I'll submit a diff ASAP.

I would prefer to have a hook in the celery logging infrastructure to add *explicit* extra fields to the logging calls (because doing it at the handler level makes this only happen with the systemd journald handler, that we have in prod, but it will not happen in other contexts), but I'm not even sure that's doable. Using the "ambient" current_task is a bit ugly but it works...

swh/loader/core/loader.py
308–310 ↗(On Diff #14223)

I think using self.visit.type would be more generic and accurate.

You can also use the opportunity to add message to the assert statements complaining if prepare_origin_visit didn't populate self.origin or self.visit (and put your logging statement after the asserts).

Adapt according to suggestion

swh/loader/core/loader.py
308–310 ↗(On Diff #14223)

Awesome suggestion, thanks.

you can also use...

correct

...or self.visit

that's set by our _store_origin_visit now but yeah, i'll do that.

Build is green

Patch application report for D4012 (id=14257)

Rebasing onto 7ed1927b99...

Current branch diff-target is up to date.
Changes applied before test
commit bdd29d41e0aed612da349efd10d80c6243e43c85
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Sep 24 16:02:46 2020 +0200

    core.loader: Log information about origin currently being ingested
    
    Prior to this commit, we "info" logged no meaningful information about the
    origin being loaded. With the following commit, we will be able to tell what
    origin/visit-type is happening.
    
    This also adds some extra assertion failure message to explicit why the
    self.origin and self.visit should be set and where to look if not.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/304/ for more details.

This revision is now accepted and ready to land.Sep 25 2020, 9:00 AM

Our logging handler swh.core.logger.JournalHandler already knows how to pull some metadata from the celery tasks:
...

Thanks a lot for clarifying that part!