Higher Perfromance

Questions and discussions about the Xillybus IP core and drivers

Higher Perfromance

Postby h314 » Fri Jul 13, 2018 4:24 pm

Hello,

In my project, I have connected the Xilinx Virtex707 evaluation board to the Nvidia Jetson TX1 which has only one 4x PCIe.
I managed to get about 390Mb/s bandwidth to read and write data to/from the DDR3 memory on the Virtex board.

I am wondering if it is possible to achieve faster data transfer by using the wider PCIe bus?
If so, how I can access to generate such a powerful IP.
Based on the website, it is available upon request.

Thanks
Mohammad
h314
 
Posts: 12
Joined: Wed Jul 04, 2018 7:21 pm

Re: Higher Perfromance

Postby support » Fri Jul 13, 2018 5:19 pm

Hello,

First, I'd like to say that when stating the PCIe link's width in the context of bandwidth, please also state the negotiated link speed (in GT/s). In fact, it's a good idea to check the negotiated link width as well (with lspci -vv).

Also, it's not clear if you got 390 MBytes/s or 390 Mbits/s.

Either way, your result is too low, even if the negotiated link was Gen1 x 4. The expected bandwidth for the latter is 800 MBytes/s. So there's obviously a problem.

First, I'd like to mention this page, which is a checklist of topics to look at:

http://xillybus.com/doc/bandwidth-guidelines

Since you're using an ARM-based platform, item #4 of this page might be extra relevant (looking at the CPU usage).

Second, I don't know how you reduced the lane width to x4, but if made changes in Xilinx PCIe block, you might have reduced the application bus clock to 125 MHz (instead of the original 250 MHz), which might cause a bandwidth reduction.

Regards,
Eli
support
 
Posts: 623
Joined: Tue Apr 24, 2012 3:46 pm

Re: Higher Perfromance

Postby h314 » Sat Jul 14, 2018 10:09 am

Hello Eli,

Thank you for your quick reply.

1- I think the negotiated link speed based on the is LnkSta 2.5GT/s and Width x4. These are the two line of the “lspci –vv” which show the speed.
----LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
----LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

2- I am using a simple adapter cable to convert PCIe X8 to X4 (without any specific changes in the xilinx PCie block)
3- To clarify the bandwidth utilization, let me shortly explain my project. I am trying to connect an FPGA (Virtex 707) to Nvidia Jetson Tx1 board to create a heterogeneous system to be used in an embedded system. My goal is studying the benefits of using FPGA accelerator over embedded GPUs in image processing and machine learning applications.
So, I am trying to implement a few functions, as follows, to be able to work with the FPGA
--fpga_malloc() --> which reserve a memory space in the on board ddr3 in the Virtex V707.
--fpga_memcpy() --> to transfer data between the main memory on the Nvidia and the dd3 on FPGA board
-- accerator_run() --> to execute the task on the FPGA. The accelerator is designed using Vivado-HLS. At this step, the accelerator only communicates with the ddr3 on FPGA board using multi-port protocol. So, it can have around 16 floating point stream with the DDR3.
--fpga_free() --> to free the allocated memory

On the FPGA, I have used a couple of AXI-DMAs to send/receive streams of data to/from memory mapped DDR3 on FPGA board. A Microblaze is used as a controller to receive commands through PCIe from the host program running on the Nvidia CPU. Then, it configures the AXI-DMAs and sends back appropriate acknowledges to the host program. In my previous post, the frequency of the Microblaze controller and other logic in the FPGA was 100MHz. (Note at the moment I have just one clock domain)

This is the simple host code example
------------Code snippet starts here------------------
#define N 1048576
uint32_t *a_fpga = fpga_alloc(N*sizeof(DATA_TYPE), IN);
uint32_t *b_fpga = fpga_alloc(N*sizeof(DATA_TYPE), OUT);

start = getTimestamp();
status = fpga_memcpy(a, a_fpga, N*sizeof(DATA_TYPE), IN);
end = getTimestamp();
execution_time = (end-start)/(1000);
printf("FPGA_memcpy write execution time %.6lf ms\n", execution_time);
....
------------Code snippet Ends here------------------
These are the execution time and the total bandwidth utilization (note that bandwidth utilization also contains the overhead of the handshaking between host program and firmware in the Microblaze).

If the frequency is 100MHz
exe time = 10.839 ms ………………… Bandwidth = (1048576*4 Byte)/ 10.839 ms = 386.964 Mbyte/s (ideal is 400Mhz with 32 bus-width)

If the frequency is 125MHz
exe time = 8.787 ms ………………………Bandwidth= (1048576*4 Byte)/ 8.787 ms = 477.33 Mbyte/s (ideal is 500Mhz with 32 bus-width)

If the frequency is 150MHz
exe time = 7.428 ms ………………………Bandwidth= (1048576*4 Byte)/ 7.428 ms = 564.661 Mbyte/s (ideal is 600Mhz with 32 bus-width)

If the frequency is 200MHz
exe time = 6.046 ms ………………………Bandwidth= (1048576*4 Byte)/ 6.046 ms = 693.732 Mbyte/s (ideal is 800Mhz with 32 bus-width)

My design is not working at 250Mhz

Note that, I prefer to run my design in the FPGA at 100Mhz to reduce the power consumption and most of the floating point and double point designs generated with Viviado-HLS requires the frequency of 100Mhz . So I am going to get the speed and bandwidth by utilizing more hardware threads and memory ports (or lanes) instead of increasing the frequency of one port.

----the output of "lspcie -vv" for more information -----------------

00:01.0 PCI bridge: NVIDIA Corporation Device 0fae (rev a1) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 368
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Memory behind bridge: 13000000-130fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Subsystem: NVIDIA Corporation Device 0000
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/2 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
Mapping Address Base: 00000000fee00000
Capabilities: [80] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=30us PortTPowerOnTime=70us
Kernel driver in use: pcieport

01:00.0 Unassigned class [ff00]: Xilinx Corporation Device ebeb (rev 07)
Subsystem: Xilinx Corporation Device ebeb
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=30us PortTPowerOnTime=70us
Kernel driver in use: pcieport

01:00.0 Unassigned class [ff00]: Xilinx Corporation Device ebeb (rev 07)
Subsystem: Xilinx Corporation Device ebeb
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin ? routed to IRQ 440
Region 0: Memory at 13000000 (64-bit, non-prefetchable) [size=128]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000001734b4000 Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: xillybus_pcie
Kernel modules: xillybus_pcie
---------------------------



Regards,
Mohammad
h314
 
Posts: 12
Joined: Wed Jul 04, 2018 7:21 pm

Re: Higher Perfromance

Postby support » Sat Jul 14, 2018 10:26 am

Hello,

The LnkSta line in the lspci output indicates that the link runs on Gen1 x 4, so it should support 800 MB/s.

As for the bandwidth utilization, I kind of lost you. You show that the bandwidth depends (quite linearly) on the clock you feed the Microblaze processor with. So what does the Xillybus link have to do with this?

Regards,
Eli
support
 
Posts: 623
Joined: Tue Apr 24, 2012 3:46 pm

Re: Higher Perfromance

Postby h314 » Sat Jul 14, 2018 10:42 am

Hi,

Thanks for your reply.

The things that I am expecting from Xillybus is that providing me wider bus (for example 128 bit-width bus) then I can use 4 streams at the same time each with 32-bit to transfer data to the ddr3 memory which can support multi-port access.
Then using the frequency at 100 MHz I can achieve 4*400 MByte/s= 1.6Gbyte/s.
But it seems using present configuration, I cannot achieve this limit.


Regards,
Mohammad
h314
 
Posts: 12
Joined: Wed Jul 04, 2018 7:21 pm

Re: Higher Perfromance

Postby support » Sat Jul 14, 2018 10:54 am

Hello,

Xillybus will indeed allow for a 1.6 GB/s link with a revision B IP core (and upgrading the PCIe link to Gen2 x 4). You don't need the XL revision (i.e. the one with 128 bit PCIe interface) for this purpose, and you wouldn't be able to utilize its bandwidth potential, because you only have 4 lanes.

But I can't see how that's related to the tests you're presenting. They seem to indicate the bottleneck isn't Xillybus.

Regards,
Eli
support
 
Posts: 623
Joined: Tue Apr 24, 2012 3:46 pm

Re: Higher Perfromance

Postby h314 » Sat Jul 14, 2018 11:11 am

Hi,

I think there is no real bottleneck in the system. Only I have a constraint that the design runs at 100MHz.

This is my understanding with the frequency of 100MHz and 32 bit: the maximum bandwidth that I can achieve is 400Mbyte/s even if Xillybus provides higher speed, then the data should wait in the FIFOs to be used by the design.

Now, if I use more than one stream in the design (each running at 100 Mhz) then I expect to be able to consume the data in the same pace as generated by PCIe.
If I am right, this is the underlying idea that Xilinx has used in its multi-port memory interface, for example in Zynq that uses four HP ports running at 100MHs, one can achieve up to 1.6 Gbyte/sec (4*400Mbte/s) if its DDR memory support that.
Now, I am wondering if a similar approach is applicable to the xillybus. If so, I should modify my design accordingly.


Regards,
Mohammad
h314
 
Posts: 12
Joined: Wed Jul 04, 2018 7:21 pm

Re: Higher Perfromance

Postby support » Sat Jul 14, 2018 7:14 pm

Hello,

If you generate a custom IP core (at the IP Core Factory, on the website) with additional streams, you might indeed duplicate your own processing machine, and connect each machine to a separate stream. This way you may utilize full bandwidth that you IP core allows, divided between the streams in use.

In your case, the aggregate bandwidth is 800 MB/s, or 1.6 GB/s if you use a revision B IP core and modify the PCIe block to work at 5 GT/s (Gen 2) for each lane.

Regards,
Eli
support
 
Posts: 623
Joined: Tue Apr 24, 2012 3:46 pm

Re: Higher Perfromance

Postby h314 » Wed Jul 18, 2018 11:10 am

Hello Eli,
Thanks for your post.

I used two 32-bit xillybus streams and four boost threads (two for read and two for write) to utilize all the bandwidth and I managed to utilize almost 733.013 MByte/s out of 800 Mbyte/s.
The difference is the overhead of one RTT due to the handshaking though microblaze and Linux kernel on Jetson that I mentioned in my other post about the Linux overhead on RTT in Jetson. I am trying to find the problem in the Linux on Jetson TX1 to reduce this overhead.

However, my main problem is the statistical behaviour of the bandwidth utilisation. it is between 733.013 to 468.00 Mbyte/s. (most of the time close to 733.013 Mbyte/s)
I think it should be the impact of the Linux scheduler and other processes on the system that cannot guarantee to achieve maximum memory bandwidth utilisation.

I am wondering how I can use a revision B IP core to increase the bandwidth utilization.

Thanks
Mohammad
h314
 
Posts: 12
Joined: Wed Jul 04, 2018 7:21 pm

Re: Higher Perfromance

Postby support » Wed Jul 18, 2018 11:22 am

Hello,

This page supplies information on revision B:

http://xillybus.com/doc/revision-b-xl

Regards,
Eli
support
 
Posts: 623
Joined: Tue Apr 24, 2012 3:46 pm

Next

Return to Xillybus

cron