The Xillybus Forum

by **Guest** »

Thanks Eli,

This was very helpful and helped me understand how it works. You're doing a great job here.

Peace

by **support** »

Hi,

First of all, the latency figures you gave make sense of the PCIe delay explanation. The reason the write operation is so quick is that it's a "posted request", meaning that a TLP packet is sent, and there's no need to wait for a response. So it's quick. The read operations requires waiting for the whole back and forth.

As for TLP packeting, there isn't much to do: Any read or write operation made by the processor ends up with a single TLP packet sent, handling that single operation. As far as I know, there isn't a single processor out there which has the ability to go more efficient than that. The rationale is simple: If you want efficiency, use DMA. So there's no point in implementing anything smarter on the processor.

So it doesn't looks like there is much you can do software-wise to improve this. There is no "kernel TLP packeting".

As for how DMA can help, I'll give Xillybus' interrupt handling routine as an example: Usually, interrupt service routines (ISRs) read status registers from the hardware to determine the whereabouts of the interrupt. Since I wanted to avoid reads in my driver, I turned the whole concept around: Before the hardware issues an interrupt, it fills a dedicated RAM buffer with information about why the interrupt was issued. The ISR reads from this buffer, which is in RAM, so there's no PCIe reads made. It does confirm the reception of the interrupt with a write operation to the BAR region, but as you've seen, that's almost costless.

I don't know how well this may fit your application.

DMA may sound a bit scary, but it's actually easier to implement DMA writes in FPGA logic than responding to read requests from the host. You form a TLP write packet on the FPGA, and submit it through the PCIe core's interface, supplied by the FPGA's vendor. But I suppose that you used some sample design implementing plain register I/O, so implementing anything on the PCIe interface is an obstacle.

Regards,
Eli

by **Guest** »

Thanks for the prompt response Eli,

Well, about the latency, I have measured it by a timer on the device side and a single word write takes 6 clock cycles on average while a single word read takes 160 clock cycles (=3us @50MHz). My host side profiling confirms the same numbers. So I was wondering what is causing the imbalanced latencies and I think the TLP protocol you mentioned explains it to some extent.

Anyways, I looked into assembly codes and I found out that my read is a simple "mov" instruction. I don't think kernel does anything to handle this mov instruction because it's on a PCI region. So here is my guess: When using mmap to get a pointer to the device and directly read from it, PCIe treats it as an I/O operation (or a simple PCI bus read), and doesn't do any packeting whatsoever. On the other hand, when we use a PCI device driver (as you do in Xillybus Project) kernel uses TLP packeting, etc. This is only my guess, so correct me if I'm wrong.

Now my question is why would DMA improve latency? As far as I understand DMA, it should only improve throughput. Also, what component is exactly doing the DMA operations? The device or the bridge? I have a CPU connected to an FPGA via Intel NM10 express chipset. So, does the FPGA take the DMA load (I mean it should be implemented as PCI IP core, right?) or the NM10? Sorry for too many vague questions. I don't have any experience with DMA.

Thanks Again

by **support** »

Hi,

Could you please define how awful the read latency is? Just for my own curiosity...?

What probably causes the slow response is the PCIe mechanism for reading: The host sends a TLP packet requesting the read operation, and the peripheral responds with a TLP containing the information (a completion packet). As far as I know, the processor core stalls during this time. How long time this takes depends on the packet relaying mechanism.

Direct read and writes by the host should be used for register interface, if performance is of any importance. If significant data transport is necessary, DMA is the way to go. In particular when the data needs to go from the peripheral to the host.

I wouldn't put my money on improving something on the kernel level. But actual latency measurements could be helpful to tell if there's any chance.

I hope this helped.
Eli

by **Guest** »

Hello,

Thank you for the nice and neat PCIe article.

I am working with PCIe and I'm using mmap to be able to directly write and read. Everything works fine, however my read latency is awful. I needed to understand what's happening in underlying layers and this article gave me a basic understanding to start with. I suppose the non-posted nature of read is causing the long delay. You mention that this is "rightfully avoided" in modern systems, but I don't understand what you exactly mean by that. Can anyone help me understand or find out what's going on in the underlying layers of Linux kernels, and if there is anything I can do to optimize it? Here is a simple piece of code that shows how to use mmap to communicate with a PCIe device in Linux.

Code: Select all: int fd = open("/dev/mem", O_RDWR|O_SYNC); int pci_bar0 = 0x0a0b0c0d; // This is the PCI device base address you can find with lspci or /proc/bus/pci/devices char* ptr = mmap(0, MMAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, pci_bar0); if(ptr == MAP_FAILED) perror("MMAP FAILED\n"); else printf("PCI BAR0 0x0000 = 0x%4x\n", ptr); // Now you can read or write like a simple pointer. Of course, you should know the address map of your device *(ptr+offset) = something-to-write; read-something = *(ptr+offset);

The Xillybus Forum

mmap and reading latency

Post a reply

Expand view Topic review: mmap and reading latency

Re: mmap and reading latency

Re: mmap and reading latency

Re: mmap and reading latency

Re: mmap and reading latency

mmap and reading latency