Page 1 of 1

Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 12:46 am
by Guest
Hey, I am trying to get xillybus working with a Jetson Nano 900-13448-0020-00 talking to a Artix7 100t fpga. I compiled the jetson kernel with Xillybus enabled, and I have the xillybus demo running on the fpga. They are hooked up via the 4lane pcie on the jetson nano, with the PCIERST pin tied to the PCIE_PERST_B_LS. When I run lspci on my nano, I see the device, with the vendor and device id what I would expect, and in the syslog I can actually see the driver attempting to startup the device and failing. Really at a loss on where to debug this, as it could be anywhere along the chain.

I would think that because I can properly load the pcie device and get its id and whatnot it is not a hardware problem, and probing the heartbeat GPIO_LED pins, I can see it starting to tick as the reset is enabled, so I am not sure what is causing this error. I went into the xillybus_core.c stuff and poked around in the kernal and found the error messages, but really low PCIe stuff is beyond me. Maybe it is a device tree setup issue? Not really sure which interrupt to provide. It could also be the fact that it seems the jetson nano has a PCIe bridge in the way, so maybe related to that? Here is my xillybus relevant syslog, if it's any help.

Jan 28 07:58:18 turro kernel: [ 1.036066] xillybus_pcie 0000:01:00.0: enabling device (0000 -> 0002)
Jan 28 07:58:18 turro kernel: [ 1.036333] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.054102] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.054128] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.071865] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.071883] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.089617] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.089635] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.107367] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.107384] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112780] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112797] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112801] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112817] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112823] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112838] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112843] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112859] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112864] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112879] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112884] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112899] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Jan 28 07:58:18 turro kernel: [ 1.112903] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
Jan 28 07:58:18 turro kernel: [ 1.112908] xillybus_pcie 0000:01:00.0: Lost sync with interrupt messages. Stopping.
Jan 28 07:58:18 turro kernel: [ 1.138534] xillybus_pcie 0000:01:00.0: No response from FPGA. Aborting.

Thanks a bunch for the help

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 6:36 am
by support
Hello,

These error messages indicate that interrupts from the FPGA arrived and were handled properly, but that DMA writes from the FPGA to the processor's RAM failed. That means that they were ignored or lost on the way, or written to the wrong address.

There is also an indication that memory writes made to the FPGA's allocated segment (BAR) were successful, but possibly arrived with wrong data.

So this is definitely a low-level hardware problem related to PCIe, most likely improper setup of some of the components on the way. Unless the FPGA implementation failed to meet timing constraints or something of that sort, in which case anything can happen.

So if the FPGA implementation went through OK (which I assume it did, as you're probably at the stage of the initial bundle), I suggest trying with any other PCIe device, and see how that goes -- a PCIe-based NIC is usually the easiest thing to get your hands on.

And since you mentioned a PCIe switch somewhere in the foodchain. Maybe it isn't configured properly...? Just a wild guess.

Regards,
Eli

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 1:51 pm
by Guest
hmm ok, thanks for the insight.

This is on a custom board of mine, so I can't directly swap in another PCIE device (haha, maybe not the wisest decision). With it being a custom carrier board, I was also worried that I messed up the routing somehow. I figured that because I could at least see the PCIE device on my Jetson nano, that it would be fine on that layer. Perhaps when it runs faster the signals arn't maintained? they are fairly short traces though.

I also noticed in my sys log that whenever I get that "Malformed message + Nack sent" error, I also get one of these nearby.

Sep 3 01:12:36 localhost kernel: [ 4.044081] mc-err: (0) csw_afiw: EMEM address decode error
Sep 3 01:12:36 localhost kernel: [ 4.044084] mc-err: status = 0x20010031; addr = 0x74f62000
Sep 3 01:12:36 localhost kernel: [ 4.044086] mc-err: secure: no, access-type: write, SMMU fault: none

Based on what you said about the FPGA being unable to write to the processors ram, I suspect that it's trying and something is going awry. Also, when I first removed and added the module with rmmod and modprobe (so I could see the log) it worked and I got output, but now whenever I unload and reload the module, my device crashes. Weird, it kind of remembers through power cycles that I did that before.

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 2:29 pm
by support
Hello,

Given that this is a custom board, I would usually put my money on the PCIe switch. Something like that it doesn't know to which port it should send the packet that requests the DMA write to the processor's memory.

But when you get error messages that rhyme with address decode error, it might very well be that something with the conversion from virtual to physical memory (or vice versa) is going wrong. Or is there some kind of IOMMU on these processors, that may possibly block access from the PCIe port? Not that they should, because the memory segment is allocated properly, and still.

In short, this isn't an easy one. As you've surely noticed already.

Regards,
Eli

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 3:19 pm
by Guest
my current running theory is that while the ISR is going through, the DMA packet is writing to the wrong address. The xillybus driver looks where it was supposed to be written to, sees a bunch of zeros (explaining why the malformed message is going off, and all the 0s in it), so reattempts the message 10 times, each time just looking at an empty buffer. At the same time the DMA is trying to write to an address it shouldn't be. It could be because of the switch or because of what you said with the conversion. Maybe a general pcie question, but is there a way I can check where the driver is looking for the buffer?

Haha, glad to be validated. I was worried that I was just being dumb.

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 3:51 pm
by support
Hello,

The part with the driver finding all zeros and retries 10 times is correct. Whether the write request packet has the wrong address or it's discarded on the way is still the question.

To make things more interesting, I should mention that the ISR request is a DMA write request to a certain address with a certain value (MSI). From a bus traffic perspective, there is no distinction between a DMA write and an MSI ISR request.

As for the address of the buffer: The failure is with the message buffer. You might add printks in xillybus_core.c's xilly_get_dma_buffers() function. I would do it after this part:

Code: Select all
         ep->msgbuf_addr = s->salami;
         ep->msgbuf_dma_addr = dma_addr;
         ep->msg_buf_size = bytebufsize;


msgbuf_addr is virtual address assigned to the message buffer and msgbuf_dma_addr is the bus ("physical") address given to the device. There's always the chance that the FPGA didn't receive the relevant register writes correctly (which has never happened as far as I know, and still), but since it responds reliably to the request to show a life sign, odds are that the writes to the FPGA's register space are OK.

Regards,
Eli

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 5:32 pm
by Guest
well I got the dma address, here are some updated logs

Sep 3 09:44:02 localhost kernel: [ 3.725082] xillybus_pcie 0000:01:00.0: enabling device (0000 -> 0002)
Sep 3 09:44:02 localhost kernel: [ 3.725230] IN GET DMA BUFFERS
Sep 3 09:44:02 localhost kernel: [ 3.725232] addr: ffffffc0f00a5000
Sep 3 09:44:02 localhost kernel: [ 3.725234] dma_addr: 1700a5000
Sep 3 09:44:02 localhost kernel: [ 3.725258] xillybus_pcie 0000:01:00.0: Malformed message (skipping): opcode=0, channel=000, dir=0, bufno=000, data=0000000
Sep 3 09:44:02 localhost kernel: [ 3.725262] xillybus_pcie 0000:01:00.0: Sending a NACK on counter 0 (instead of b) on entry 0
< repeats 9 more times>
Sep 3 09:44:02 localhost kernel: [ 3.725379] xillybus_pcie 0000:01:00.0: Lost sync with interrupt messages. Stopping.
Sep 3 09:44:02 localhost kernel: [ 3.725399] mc-err: (0) csw_afiw: EMEM address decode error
Sep 3 09:44:02 localhost kernel: [ 3.725401] mc-err: status = 0x20010031; addr = 0x700a5000
Sep 3 09:44:02 localhost kernel: [ 3.725404] mc-err: secure: no, access-type: write, SMMU fault: none
Sep 3 09:44:02 localhost kernel: [ 3.829977] xillybus_pcie 0000:01:00.0: No response from FPGA. AbortingXXXXXXXX.

It looks like there is an error when there is an attempt to decode addr = 0x700a5000, which is coincidentally the value of dma_addr: 1700a5000 when you concatenate it to 32bits. I think this might be an issue specific to the Nvidia platform, because that file where the error comes up is on a tegra specific piece of source code.

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 6:35 pm
by support
Hello,

This is significant progress. I think this is proof enough that the PCIe write packet made its way to the processor. This match of addresses can't be a coincidence.

It very much appears like the physical address space of this processor is 64 bits, as dma_addr takes more than 32 bits to represent. What seems to have happened, is that the FPGA was instructed to produce packets with 32 bit addressing despite this. Never mind the PCIe spec issues for this, but I suggest checking if this is indeed the case, and possibly force 64-bit addressing.

The place to manipulate is this in xillybus_pcie.c, function xilly_probe():

Code: Select all
   if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) {
      endpoint->dma_using_dac = 0;
   } else if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
      endpoint->dma_using_dac = 1;
   } else {
      dev_err(endpoint->dev, "Failed to set DMA mask. Aborting.\n");
      return -ENODEV;
   }


To make a long story short, set dma_using_dac to 1 in any case, and see if that makes any difference.

Fingers crossed.

Eli

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 7:41 pm
by Guest
Woaw, like magic, all the dev devices popped up and working! Yeah, I just set that 0 to a 1, and the demo project started moving data back and forth.

Seems like it was a 64 bit 32 bit thing. Idk how sustainable the solution is, as you said, but yeah for now I am happy enough that it's working. Thanks for all the help Eli.

Charles

Re: Xillybus problems on Jetson Nano with Artix 7

PostPosted: Thu Sep 03, 2020 7:54 pm
by support
Hello,

I'm glad it did the trick. Odds are that this was actually the problem.

The thing is, that pci_set_dma_mask() should fail the call with DMA_BIT_MASK(32) as argument if the PCIe bus is 64 bits, and apparently your platform is 64 bits, and yet this call didn't fail. It seems like poor porting of the PCIe kernel framework to your platform. So setting dma_using_dac to 1 skips the guesswork and selects the correct option.

Regards,
Eli