NXP / Freescale ARM: Data stopping after 20 minutes

Questions and discussions about the Xillybus IP core and drivers

NXP / Freescale ARM: Data stopping after 20 minutes

Postby Guest »

Hi,

I have a weird problem with Xillybus running on an NXP i.MX8 board (with PCIe of course). I transfer a lot of data for 10-20 minutes, and then communication suddently stops. Everything is OK, just no data transferred. When I press CTRL-C, I get the following error in the kernel log:

[ 1311.254166] xillybus_pcie 0000:01:00.0: Hardware failed to respond to close command, therefore left in messy state.
[ 1331.333182] xillybus_pcie 0000:01:00.0: Removed 5 device files.
[ 1331.438611] xillybus_pcie 0000:01:00.0: Failed to quiesce the device on exit.

If I reload Xillybus' kernel modules with rmmod and insmod, I get these messages, and everything is working fine until the next time:

[ 1340.649186] xillybus_pcie 0000:01:00.0: assign IRQ: got 0
[ 1340.650980] xillybus_pcie 0000:01:00.0: enabling bus mastering
[ 1340.680392] xillybus_pcie 0000:01:00.0: Created 5 device files.

More weird stuff: With an old BSP for the board, which is based upon kernel 4.9.51, this problem doesn't exist. With a newer BSP (kernel 4.14.78) there's this problem.

I wanted to see if it's PCIe related, so I plugged in a plain network card, and sent a lot of data through it. It ran overnight with no apparent problems, but this appeared in the kernel log:

[ 1354.057495] NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
[ 1354.403492] e1000e 0000:01:00.0 eth1: Reset adapter unexpectedly
[ 1358.338544] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

It looks like a kernel problem, but what is it?

Thanks in advance!
Guest
 

Re: NXP / Freescale ARM: Data stopping after 20 minutes

Postby support »

Hello,

There is a problem with some revisions of the Linux driver for the Designware PCIe controller, which is the PCIe module in i.MX processors (some or all, I'm not sure).

What happens is that the driver handles MSI interrupts from the PCIe improperly, which causes such interrupts to be missed. This results in a deadlock, where the PCIe device (Xillybus in our case) has sent an interrupt request, and waits for it be handled, but it has been lost, and the host waits for that interrupts exactly.

This is not a Xillybus specific issue. The kernel log with the network card shows it had to be kickstarted after it got stuck, most likely because of an interrupt loss.

There are three commits in the mainline kernel tree adressing this issue:

https://github.com/torvalds/linux/commi ... 2b641eae7f
https://github.com/torvalds/linux/commi ... 0a314aa6ee
https://github.com/torvalds/linux/commi ... 91a463fcc5

From the description of the first commit: "The dwc driver is showing an interesting level of brokeness, as it insists on using the enable/disable set of registers to mask/unmask MSIs, meaning that an MSIs being generated while the interrupt is in that "disabled" state will simply be lost."

I suggest applying them to your kernel. That will most likely solve the problem.

Regards,
Eli
support
 
Posts: 802
Joined:


Return to Xillybus

cron