xillybus RTT perfromance

Questions and discussions about the Xillybus IP core and drivers

xillybus RTT perfromance

Postby h314 »

Hi all,

In my project, I am trying to connect the Xilinx Virtex707 evaluation board to the Nvidia Jetson TX1 through PCIe.
I used you’re the xillybus PIe IP for the communication.
I configured the FPGA with demo design in xillybus-eval-virtex7-2.0c.zip.
And ran the following c code snippet to measure the round-trip time for the stream ports in the loopback configuration.
----
unsigned int DataWrite = 123;
unsigned int DataRead;
write(fdw32, (void *) DataWrite, sizeof(unsigned int));
read (fdr32, (void *) DataRead, sizeof(unsigned int));
----
My measurement shows about 11msec for both write and read operation.

I am wondering if there is a technique to reduce this time to about a couple of micro-second.
Note that I am trying to send command and receive acknowledge using this write and read. So, at this step, my goal is not sending a huge amount of data to FPGA.

Thanks
h314
 
Posts: 12
Joined:

Re: xillybus RTT perfromance

Postby support »

Hello,

Generally speaking, the latency that Xillybus adds is negligible compared with the operating system. The 11 ms you're looking at are probably a result of the autoflush mechanism (presumably 10ms + a little delta).

I suggest taking a look on this page:

http://xillybus.com/doc/xillybus-latency

If you happen to use a Linux kernel between 3.18 and 4.6 I also suggest trying the Xillybus driver available for download at the website instead. There was a slight bug in this in-kernel Xillybus driver in this version range, which can have some effect on synchronization. The "Getting Started" guide for Linux walks through the steps for installing the driver.

Regards,
Eli
support
 
Posts: 802
Joined:

Re: xillybus RTT perfromance

Postby h314 »

Thank you Eli for the prompt reply.

Yes, it seems the kernel has a bug.
I fixed it and now the round trip time (RTT) is between 1msec and 0.7msec by calling a write() command with zero size data after the normal write().

I am wondering if it is possible to further reduce this RTT.
h314
 
Posts: 12
Joined:

Re: xillybus RTT perfromance

Postby support »

Hi,

What you have there sounds way too long latency. How do you measure the time? For example, if you measure the entire execution time of running the program (with the Linux "time" command), you're including a lot of program setup operations that are irrelevant.

Regards,
Eli
support
 
Posts: 802
Joined:

Re: xillybus RTT perfromance

Postby h314 »

Hi Eli,

Thanks for your reply.
My time measuring code is as follows. and it is not the Linux "time" command.
I measured two completely different RTT on two different systems as follows. Interestingly the second system is up to 20 times faster.
I am wondering how I can find the bottleneck in the first system.

This is my code for measuring RTT in the loopback in the demo design (i.e., xillybus-eval-virtex7-2.0c.zip)
-------------------------------------------Snippet code starts here---------------------------------------------
unsigned int tmpU32DataWrite= 124;
unsigned int tmpU32DataRead ;

for (int i = 0; i < 20; i++) {
hardware_start = getTimestamp();

rc = write(fdw32, (void *) &tmpU32DataWrite, sizeof(tmpU32DataWrite));
rc = write(fdw32, (void *) &tmpU32DataWrite, 0);
rc = read (fdr32, (void *) &tmpU32DataRead, sizeof(tmpU32DataRead));

hardware_end = getTimestamp();
hardware_execution_time = (hardware_end-hardware_start)/(1000);
printf("stream 32 loopback RTT %.6lf ms \n", hardware_execution_time);
}
-------------------------------------------Snippet code ends here---------------------------------------------

System 1: Nvidia Jetson TX1 with Linux kernel 4.4.38 and Xillybus driver module (download from your website) compiled manually (not the Linux kernel code)
the Xilinx Virtex 707 evaluation board is connected to the PCIe on the Jetson board.

code output:
stream 32 loopback RTT 1.221000 ms
stream 32 loopback RTT 0.684000 ms
stream 32 loopback RTT 0.638000 ms
stream 32 loopback RTT 0.551000 ms
stream 32 loopback RTT 0.831000 ms
stream 32 loopback RTT 0.628000 ms
stream 32 loopback RTT 0.598000 ms
stream 32 loopback RTT 0.614000 ms
stream 32 loopback RTT 0.521000 ms
stream 32 loopback RTT 0.523000 ms
stream 32 loopback RTT 0.563000 ms
stream 32 loopback RTT 0.538000 ms
stream 32 loopback RTT 0.537000 ms
stream 32 loopback RTT 0.549000 ms
stream 32 loopback RTT 0.475000 ms
stream 32 loopback RTT 1.045000 ms
stream 32 loopback RTT 0.383000 ms
stream 32 loopback RTT 0.156000 ms
stream 32 loopback RTT 0.172000 ms
stream 32 loopback RTT 0.291000 ms

System 2: xilinx ZCU102 board with Linux kernel 4.9 and Xillybus driver module (download from your website) compiled manually (not the Linux kernel code)
the Xilinx Virtex 707 evaluation board is connected to the PCIe on the ZCU102 board.

code output:

stream 32 loopback RTT 0.061000 ms
stream 32 loopback RTT 0.048000 ms
stream 32 loopback RTT 0.046000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.047000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.045000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.046000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.044000 ms
stream 32 loopback RTT 0.046000 ms



Thanks
h314
 
Posts: 12
Joined:

Re: xillybus RTT perfromance

Postby support »

Hello,

System 2 shows the result I keep hearing that people with low latency in mind achieve. Why System 1 wobbles like that is an interesting question, and it definitely seems like an operating system issue. Same hardware, same driver, completely different result.

The thing to keep in mind is that when the read() call is reached, the process goes into sleep mode, waiting for data to arrive. When the data arrives, the process goes back to running mode, but the operating system doesn't have to give it the processor immediately, or actually, at any specified time. It's up to the scheduler to decide.

So it could be that the scheduler on System 1 delays the execution for whatever reason. That makes even more sense if the processor had another process actively running while the test was made, which I suppose wasn't the case.

Anyhow, what happens if you call usleep(10) on system 1 instead of read()? That should put the process to sleep for 10 us. What do you actually measure?

Regards,
Eli
support
 
Posts: 802
Joined:

Re: xillybus RTT perfromance

Postby h314 »

Thanks for the reply.

The RTT on the first system calling usleep(10) instead of read() is as follows
stream 32 loopback RTT 0.505000 ms
stream 32 loopback RTT 0.333000 ms
stream 32 loopback RTT 0.360000 ms
stream 32 loopback RTT 0.335000 ms
stream 32 loopback RTT 0.326000 ms
stream 32 loopback RTT 0.317000 ms
stream 32 loopback RTT 0.337000 ms
stream 32 loopback RTT 0.319000 ms
stream 32 loopback RTT 0.316000 ms
stream 32 loopback RTT 0.379000 ms
stream 32 loopback RTT 0.308000 ms
stream 32 loopback RTT 0.328000 ms
stream 32 loopback RTT 0.321000 ms
stream 32 loopback RTT 0.324000 ms
stream 32 loopback RTT 0.296000 ms
stream 32 loopback RTT 0.295000 ms
stream 32 loopback RTT 0.299000 ms
stream 32 loopback RTT 0.289000 ms
stream 32 loopback RTT 1.202000 ms
stream 32 loopback RTT 0.200000 ms

The system also runs other processes relating to the GPU in the Jetson TX1.
For now, I think I should accept that as the overhead that my platform adds to the RTT and focus on utilizing the maximum bandwidth to transfer data from the main memory on the system to the ddr3 attached to the FPGA.
I will ask thous questions in another thread in the forum.

Thanks.
h314
 
Posts: 12
Joined:

Re: xillybus RTT perfromance

Postby h314 »

Hello,

It seems the Nvidia Jetson TX1 is on low-power mode (interactive) running @ 102Mhz and when a process executes it takes a little bit of time to increase its frequency to get to the maximum performance.
If we run the code about 1000 times then RTT reduces to 0.04 msec which is reasonable.
another way is changing the frequency through "scaling governor" in the Linux kernel that can improve the RTT.

Thanks
Mohammad
h314
 
Posts: 12
Joined:


Return to Xillybus

cron