View Issue Details

IDProjectCategoryView StatusLast Update
0000003Xen made easykernel-xenpublic2013-05-04 14:05
ReporterSteven HaighAssigned ToSteven Haigh 
PriorityhighSeveritymajorReproducibilityrandom
Status closedResolutionfixed 
Summary0000003: DomU network goes offline with Frag is bigger than frame; fatal error; disabling device
DescriptionDomU randomly loses networking with the following printed to Dom0 /var/log/messages:

Apr 25 12:09:25 hosting kernel: vif vif-4-0 vif.crc: Frag is bigger than frame.
Apr 25 12:09:25 hosting kernel: vif vif-4-0 vif.crc: fatal error; disabling device
Apr 25 12:09:25 hosting kernel: br0: port 5(vif.crc) entered disabled state

There is then no way to reenable networking for the DomU in question. A shutdown / create is required.

This is unrelated to the previous MAX_SKB_FRAGS problem that has been patched.
TagsNo tags attached.
External Reference

Activities

Steven Haigh

2013-04-26 18:54

administrator   ~0000002

There have been some patches floating around the xen-devel list that apparently go into this... For the moment, a workaround of disabling GSO offloading is supposed to stop the problem.

This is done by:
$ ethtool -K eth0 gso off

I'll continue to follow this up as follows:
1) See if patches will be merged with kernel.org kernel and what versions.
2) Try to chase up if these will be merged with a RHEL kernel (if they haven't already).

This may take some time.

Abbas

2013-04-26 19:11

reporter   ~0000003

I guess it is better to disable offloading altogether.
I had duplicated in xenserver too.

#!/bin/bash
if_modes="rx tx sg tso ufo gso"
for iface in $(ifconfig | awk '$0 ~ /Ethernet/ { print $1 }'); do
 for if_mode in ${if_modes}; do
 ethtool -K $iface $if_mode off 2>/dev/null
 done
done

Steven Haigh

2013-04-26 19:20

administrator   ~0000004

I believe the only one that triggers this bug is GSO. From what I understand, the rest of these work as expected.

There are currently 4 patches for this - 2 x frontend (DomU) and 2 x backend (Dom0) that may resolve it completely - as well as the previous MAX_SKB_FRAGS issue.

netfront:
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=697089dc13c52d668322ac6cb8548520de27ed0e
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=9ecd1a75d977e2e8c48139c7d3efed183f898d94

netback:
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=2810e5b9a7731ca5fce22bfbe12c96e16ac44b6f
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=03393fd5cc2b6cdeec32b704ecba64dbb0feae3c

There are also 3 other patches required while I'm reviewing:
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e2d617c0ccf658a55552955f07018ecfa0135210
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=7158ff6d0c6aa3724fb51c6c11143d31e166eb1f
http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=27f852282ab9a028f57da96d05c26f38c424a315

This is not a simple fix - and some will be moved straight into the mainstream / stable kernel.org releases - so I don't want to duplicate work here either.

Steven Haigh

2013-04-27 03:36

administrator   ~0000013

I've put up a test kernel RPM at:
http://au1.mirror.crc.id.au/repo/el6/testing

Files:
http://au1.mirror.crc.id.au/repo/el6/testing/kernel-xen-3.8.8-2.el6xen.x86_64.rpm
http://au1.mirror.crc.id.au/repo/el6/testing/kernel-xen-firmware-3.8.8-2.el6xen.x86_64.rpm

Testing welcome. This replaces the patch I previously applies to kernels to increase the value of MAX_SKB_FRAGS - so anything that used to trigger a network disconnect before that patch should be tested with this kernel.

Steven Haigh

2013-04-27 03:38

administrator   ~0000014

Have lodged a xen-netfront bug with RedHat at:
https://bugzilla.redhat.com/show_bug.cgi?id=957231

It will be up to them to resolve the xen-netfront issues upstream. The patches included in this test kernel *should* stop the problem from occurring even without the xen-netfront patches - but it will work better with them.

Abbas

2013-04-27 06:08

reporter   ~0000017

Seems like they have made the bug private.

Steven Haigh

2013-04-27 08:03

administrator   ~0000018

Yeah... I noticed that. Probably because it's a kernel bug. I can't override it and make it public either...

Steven Haigh

2013-04-28 02:52

administrator   ~0000030

Patch is also present in kernel-xen-3.8.10-1 currently in the testing repo:
http://au1.mirror.crc.id.au/repo/el6/testing

If no bug reports are filed against either 3.8.8-2 or 3.8.10-1 in the next few days, I'll push 3.8.10-1 to the main repo.

Steven Haigh

2013-05-04 14:05

administrator   ~0000043

Fixed in kernel-xen-3.8.8-2 onwards.

Issue History

Date Modified Username Field Change
2013-04-26 18:41 Steven Haigh New Issue
2013-04-26 18:41 Steven Haigh Status new => assigned
2013-04-26 18:41 Steven Haigh Assigned To => Steven Haigh
2013-04-26 18:43 Steven Haigh Description Updated View Revisions
2013-04-26 18:54 Steven Haigh Note Added: 0000002
2013-04-26 19:11 Abbas Note Added: 0000003
2013-04-26 19:20 Steven Haigh Note Added: 0000004
2013-04-27 03:36 Steven Haigh Note Added: 0000013
2013-04-27 03:36 Steven Haigh Status assigned => feedback
2013-04-27 03:38 Steven Haigh Note Added: 0000014
2013-04-27 03:38 Steven Haigh Status feedback => assigned
2013-04-27 03:39 Steven Haigh Status assigned => feedback
2013-04-27 06:08 Abbas Note Added: 0000017
2013-04-27 08:03 Steven Haigh Note Added: 0000018
2013-04-27 08:03 Steven Haigh Status feedback => assigned
2013-04-27 12:03 Steven Haigh Status assigned => feedback
2013-04-28 02:52 Steven Haigh Note Added: 0000030
2013-04-28 02:52 Steven Haigh Status feedback => assigned
2013-04-28 02:52 Steven Haigh Status assigned => feedback
2013-05-04 14:05 Steven Haigh Note Added: 0000043
2013-05-04 14:05 Steven Haigh Status feedback => assigned
2013-05-04 14:05 Steven Haigh Status assigned => closed
2013-05-04 14:05 Steven Haigh Resolution open => fixed