Xillybus/PCIE crashing after a while of running

Questions and discussions about the Xillybus IP core and drivers

Xillybus/PCIE crashing after a while of running

Postby Guest »

Hello,

I am using Xillybus to transfer a stream of data to my host linux machine. I have 3 device files, one high throughput datastream from fpga -> host, a mem_read device to transfer information to the host, and a mem_addr_write device from host-> fpga. I am running into an issue where the stream will run for a while, like a few minutes, but then the stream will suddenly stop. Digging up the kern.log file, It seems that a bunch of bus errors are happening, which don't stop the stream, until eventually it does fail and is unable to recover. Notably, xillybus_pcie is throwing " xillybus_pcie 0000:01:00.0: Hardware failed to respond to close command, therefore left in messy state.", which the site says is because the FPGA is trying to reconfigure itself. The whole thing is running on custom hardware. I am not sure if this is something to be resolved in the xillybus driver, or something in the fpga.

Thanks for any help, I attached the relevant logs below. As you can tell by the timestamps, it runs for a while ok, but with errors, until eventually it just totally stops. I can see the FPGA xillybus_heartbeat LED stop too , when it happens.

Oct 7 21:45:09 localhost kernel: [ 4.093283] addr: ffffffc0e7a23000
Oct 7 21:45:09 localhost kernel: [ 4.093286] dma_addr: 167a23000
Oct 7 21:45:09 localhost kernel: [ 4.144163] xillybus_pcie 0000:01:00.0: Created 3 device files.
Oct 7 21:45:09 localhost kernel: [ 4.175544] zram: Added device: zram0
Oct 7 21:45:09 localhost kernel: [ 4.178389] zram: Added device: zram1
Oct 7 21:45:09 localhost kernel: [ 4.183644] zram: Added device: zram2
Oct 7 21:45:09 localhost kernel: [ 4.184134] zram: Added device: zram3
Oct 7 21:45:09 localhost kernel: [ 4.220648] zram0: detected capacity change from 0 to 519598080
Oct 7 21:45:09 localhost kernel: [ 4.280170] Adding 507416k swap on /dev/zram0. Priority:5 extents:1 across:507416k SS
Oct 7 21:45:09 localhost kernel: [ 4.297946] zram1: detected capacity change from 0 to 519598080
Oct 7 21:45:09 localhost kernel: [ 4.309892] Adding 507416k swap on /dev/zram1. Priority:5 extents:1 across:507416k SS
Oct 7 21:45:09 localhost kernel: [ 4.313403] zram2: detected capacity change from 0 to 519598080
Oct 7 21:45:09 localhost kernel: [ 4.371894] Adding 507416k swap on /dev/zram2. Priority:5 extents:1 across:507416k SS
Oct 7 21:45:09 localhost kernel: [ 4.375704] zram3: detected capacity change from 0 to 519598080
Oct 7 21:45:09 localhost kernel: [ 4.393180] Adding 507416k swap on /dev/zram3. Priority:5 extents:1 across:507416k SS
Oct 7 21:45:09 localhost kernel: [ 4.469904] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Oct 7 21:45:09 localhost kernel: [ 4.471135] eth0: 0xffffff800a39c000, 00:04:4b:ea:43:a5, IRQ 399
Oct 7 21:45:09 localhost kernel: [ 4.602426] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Oct 7 21:45:10 localhost kernel: [ 5.243329] tegra-xusb 70090000.xusb: Upgrade port 0 to USB3.0
Oct 7 21:45:10 localhost kernel: [ 5.243335] tegra-xusb 70090000.xusb: Upgrade port 1 to USB3.0
Oct 7 21:45:10 localhost kernel: [ 5.342857] usb usb2: usb_suspend_both: status 0
Oct 7 21:45:11 localhost kernel: [ 6.183299] fuse init (API version 7.26)
Oct 7 21:45:13 localhost kernel: [ 8.050648] tegra-xusb 70090000.xusb: entering ELPG
Oct 7 21:45:13 localhost kernel: [ 8.051672] tegra-pmc: PMC tegra_pmc_utmi_phy_enable_sleepwalk : port 1, speed 0
Oct 7 21:45:13 localhost kernel: [ 8.051858] tegra-pmc: PMC tegra_pmc_utmi_phy_enable_sleepwalk : port 2, speed 0
Oct 7 21:45:13 localhost kernel: [ 8.053399] tegra-xusb 70090000.xusb: entering ELPG done
Oct 7 21:45:56 localhost kernel: [ 51.699921] r8168: eth0: link up
Oct 7 21:45:56 localhost kernel: [ 51.700458] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Oct 7 21:47:15 localhost kernel: [ 130.429507] pcieport 0000:00:02.0: AER: Corrected error received: id=0018
Oct 7 21:47:15 localhost kernel: [ 130.429518] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)
Oct 7 21:47:15 localhost kernel: [ 130.439805] pcieport 0000:00:02.0: device [10de:0faf] error status/mask=00000001/00002000
Oct 7 21:47:15 localhost kernel: [ 130.448492] pcieport 0000:00:02.0: [ 0] Receiver Error (First)
Oct 7 21:47:15 localhost kernel: [ 130.962768] FAN rising trip_level:1 cur_temp:51750 trip_temps[2]:61000
Oct 7 21:47:46 localhost kernel: [ 161.202672] FAN rising trip_level:2 cur_temp:61500 trip_temps[3]:71000
Oct 7 21:48:32 localhost kernel: [ 207.122652] FAN rising trip_level:3 cur_temp:72000 trip_temps[4]:82000
Oct 7 21:48:49 localhost kernel: [ 225.011114] Setting nominal refresh + timings.
Oct 7 21:49:30 localhost kernel: [ 265.362714] FAN rising trip_level:4 cur_temp:82000 trip_temps[5]:140000
Oct 7 21:50:52 localhost kernel: [ 347.876522] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.876606] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
Oct 7 21:50:52 localhost kernel: [ 347.888262] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00000081/00002000
Oct 7 21:50:52 localhost kernel: [ 347.896735] pcieport 0000:00:01.0: [ 0] Receiver Error (First)
Oct 7 21:50:52 localhost kernel: [ 347.903669] pcieport 0000:00:01.0: [ 7] Bad DLLP
Oct 7 21:50:52 localhost kernel: [ 347.910454] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910478] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910482] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910501] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910504] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910522] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910526] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910544] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910547] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910565] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910569] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910586] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910590] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.910607] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.910610] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.918020] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.918026] pcieport 0000:00:01.0: AER: Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.918050] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.918053] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.918071] pcieport 0000:00:01.0: can't find device of ID0010
Oct 7 21:50:52 localhost kernel: [ 347.918075] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.918083] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:52 localhost kernel: [ 347.933288] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:52 localhost kernel: [ 347.942187] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:52 localhost kernel: [ 347.949515] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:52 localhost kernel: [ 347.949520] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:52 localhost kernel: [ 347.949525] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.949533] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:52 localhost kernel: [ 347.963549] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:52 localhost kernel: [ 347.972555] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:52 localhost kernel: [ 347.980049] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:52 localhost kernel: [ 347.980054] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:52 localhost kernel: [ 347.980059] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 347.980067] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:52 localhost kernel: [ 347.994264] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:52 localhost kernel: [ 348.003497] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:52 localhost kernel: [ 348.011110] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:52 localhost kernel: [ 348.011114] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:52 localhost kernel: [ 348.011120] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:52 localhost kernel: [ 348.011128] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:52 localhost kernel: [ 348.023063] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:52 localhost kernel: [ 348.026681] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Oct 7 21:50:53 localhost kernel: [ 348.031661] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.038728] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.038733] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.038739] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.038747] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.050855] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.059396] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.066306] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.066310] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.066315] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.066323] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.078068] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.086575] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.093383] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.093387] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.093392] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.093400] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.105157] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.113520] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.120326] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.120330] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.120334] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.120342] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.132108] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.140595] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.147599] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.147604] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.147608] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.147616] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.159428] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.167793] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.174700] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.174704] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:50:53 localhost kernel: [ 348.174708] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:50:53 localhost kernel: [ 348.174717] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:50:53 localhost kernel: [ 348.186459] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:50:53 localhost kernel: [ 348.194869] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:50:53 localhost kernel: [ 348.201665] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:50:53 localhost kernel: [ 348.201668] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:51:06 localhost kernel: [ 361.935481] xillybus_pcie 0000:01:00.0: Hardware failed to respond to close command, therefore left in messy state.
Oct 7 21:51:43 localhost kernel: [ 398.160225] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:51:43 localhost kernel: [ 398.185833] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
Oct 7 21:51:43 localhost kernel: [ 398.196955] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00000081/00002000
Oct 7 21:51:43 localhost kernel: [ 398.198950] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Oct 7 21:51:43 localhost kernel: [ 398.205482] pcieport 0000:00:01.0: [ 0] Receiver Error
Oct 7 21:51:43 localhost kernel: [ 398.211963] pcieport 0000:00:01.0: [ 7] Bad DLLP
Oct 7 21:51:43 localhost kernel: [ 398.219171] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:51:43 localhost kernel: [ 398.219180] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:51:43 localhost kernel: [ 398.233804] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:51:43 localhost kernel: [ 398.242214] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:51:43 localhost kernel: [ 398.249067] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:51:43 localhost kernel: [ 398.249074] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:52:37 localhost kernel: [ 452.428688] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:52:37 localhost kernel: [ 452.454094] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
Oct 7 21:52:37 localhost kernel: [ 452.464774] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00000001/00002000
Oct 7 21:52:37 localhost kernel: [ 452.473301] pcieport 0000:00:01.0: [ 0] Receiver Error
Oct 7 21:52:37 localhost kernel: [ 452.474754] tegra_soctherm 700e2000.soctherm: soctherm: trip temperature 2147483647 forced to 127000
Oct 7 21:52:37 localhost kernel: [ 452.479569] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:52:37 localhost kernel: [ 452.479731] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:52:37 localhost kernel: [ 452.492481] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:52:37 localhost kernel: [ 452.500943] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:52:37 localhost kernel: [ 452.507794] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:52:37 localhost kernel: [ 452.507798] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:53:04 localhost kernel: [ 479.283375] FAN cooling trip_level:3 cur_temp:71750 trip_temps[4]:82000
Oct 7 21:53:43 localhost kernel: [ 518.943905] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:53:43 localhost kernel: [ 518.969285] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
Oct 7 21:53:43 localhost kernel: [ 518.980028] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00000001/00002000
Oct 7 21:53:43 localhost kernel: [ 518.988618] pcieport 0000:00:01.0: [ 0] Receiver Error
Oct 7 21:53:43 localhost kernel: [ 518.994889] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:53:43 localhost kernel: [ 518.995983] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:53:43 localhost kernel: [ 519.007925] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:53:43 localhost kernel: [ 519.016997] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:53:43 localhost kernel: [ 519.024877] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:53:43 localhost kernel: [ 519.024885] pcieport 0000:00:01.0: AER: Device recovery failed
Oct 7 21:55:23 localhost kernel: [ 618.325007] pcieport 0000:00:01.0: AER: Multiple Corrected error received: id=0010
Oct 7 21:55:23 localhost kernel: [ 618.350525] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
Oct 7 21:55:23 localhost kernel: [ 618.361269] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00000001/00002000
Oct 7 21:55:23 localhost kernel: [ 618.369853] pcieport 0000:00:01.0: [ 0] Receiver Error
Oct 7 21:55:23 localhost kernel: [ 618.376113] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Oct 7 21:55:23 localhost kernel: [ 618.377174] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
Oct 7 21:55:23 localhost kernel: [ 618.391859] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
Oct 7 21:55:23 localhost kernel: [ 618.400272] pcieport 0000:00:01.0: [14] Completion Timeout (First)
Oct 7 21:55:23 localhost kernel: [ 618.407129] pcieport 0000:00:01.0: broadcast error_detected message
Oct 7 21:55:23 localhost kernel: [ 618.407139] pcieport 0000:00:01.0: AER: Device recovery failed
Guest
 

Re: Xillybus/PCIE crashing after a while of running

Postby support »

Hello,

The "Hardware failed to respond to close command" error means that the driver has lost contact with the FPGA. The by far most common reason for this is indeed that the FPGA has been reconfigured, but in your case, it's clearly a poor PCIe link.

All those AER messages from the pcieport driver are a result of some low-level error detected on the physical link between the FPGA and the host. This could be plain bit errors or loss of lock. The PCIe protocol managed to work around some of these, but in the end it failed.

These error messages should never appear on properly working hardware. The place to look is the physical link. I also suggest verifying that the MGT's reference clock is delivered properly to the FPGA, and that its jitter specification meets requirements. I would also verify that the voltage that feeds the reference clock generator is within the required range, and is as clean (noise free) as required by its datasheet.

Regards,
Eli
support
 
Posts: 802
Joined:


Return to Xillybus