Page MenuHomeSoftware Heritage

Kernel oops in i40e driver on hypervisor3
Closed, MigratedEdits Locked

Description

The kernel on the hypervisor3 machine oopsed inside the network driver. The machine came back up after a cold boot cycle.

Full kernel log of the oops:

I only find somewhat old bug reports with analoguous bugs, with no obvious fix... Let's keep an eye out.

Event Timeline

olasd triaged this task as Unbreak Now! priority.
olasd created this task.

Some post-4.15 commits seem to fix this kind of issue.

07d44190a38939adfec6177a6e1b683417da291f in particular is a good candidate:

commit 07d44190a38939adfec6177a6e1b683417da291f
Author: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Date:   Mon Dec 18 05:17:25 2017 -0500

    i40e/i40evf: Detect and recover hung queue scenario

   
    In VFs, there is a known issue which can cause writebacks
    to not occur when interrupts are disabled and there are
    less than 4 descriptors resulting in TX timeout. Timeout
    can also occur due to lost interrupt.
    
    The current implementation for detecting and recovering
    from hung queues in the PF is problematic because it actually
    actively encourages lost interrupts.  By triggering a SW
    interrupt, interrupts are forced on.  If we are already in
    napi_poll and an interrupt fires, napi_poll will not be
    rescheduled and the interrupt is effectively lost; thereby
    potentially *causing* hung queues.
    
    This patch checks whether packets are being processed between
    every watchdog cycle and determine potential hung queue and
    fires triggers SW interrupt only for that particular queue.
    
    Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
    Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>