The Xillybus Forum

by **Guest** »

Hey. Great site you have.

I have a question. My SP605 has 1x lane PCIe. Can I get 250 MB/s with it? Is it possible?

by **support** »

Hello,

The real bitrate of an 1x lane is 2.0 Gbit/sec, so divided by 8 the upper bound for actual data is indeed 250 MB/s. But now let's make a simple calculation regarding the overhead for upstream packets (from FPGA to host).

The host controls the maximal number of payload bytes each packet (TLP) can have through the MaxPayload field in the PCIe configuration space. It's quite common to have this set to 128 bytes, so let's assume this is the case.

Now the overhead: Each TLP has a header of three or four DWORDs (4 bytes each), depending on whether the addressing is 32 or 64 bit. Let's go for the modest case of 32-bit addressing, so the transaction layer's overhead is 12 bytes. There's also a possibility of a TLP digest (ECRC), but since it doesn't make much sense having it, I'll assume it's not used.

The data link layer adds 6 additional bytes: There's a two-byte header, with some reserved field and the TLP Sequence number (used for acknowledgement) and a four-byte LCRC, which is how the packet's integrity is checked.

It's plausible to assume that real-life implementations add another two bytes, so that the entire transmission chain can be implemented with 32-bit alignment, but since I don't know how Xilinx implemented their interface, I'll assume only 6 bytes of data layer overhead.

So all in all, there's 18 bytes overhead in the optimistic case for 128 bytes of maximal payload. The efficiency is therefore 128/(18+128) = 87.7%. Applied to the 250 MB/s byte stream, we have a maximal real throughput of 219 MB/s.

This overhead calculation doesn't include other packets, such as credit announcements, power management packets, acknowledgement packets etc, which take a some amount of bandwidth.

As for the other direction (host to FPGA) things become significantly more complicated to estimate. This is because the FPGA requests data to be read, and host can choose to chop the answer into several TLPs in the answer (completion). There's a limitation to the number of TLPs a completion can be chopped into, but it's not possible to derive an upper bound for performance based upon the worst bandwidth utilization chosen by host.

There is also another factor which may slow down the downstream significantly: The FPGA may or may not choose to send a request to the host before the previous one(s) are completed. Having multiple requests pending is possible only when the FPGA can handle the arriving data in nonincreasing order. The packet ordering rules apply only for packets regarding the same request.

So all in all, the only relatively solid answer I can offer is: Don't expect more than 220 MB/s from FPGA to host. As for the other direction, it's still limited by the maximal payload rule per packet, but it can be slowed down further by other factors.

Hope this helped,
Eli

by **Guest** »

Hi, Eli --

Can you give some insight into how much real-world upstream (FPGA to host) bandwidth should be expected from a VC707 board sending data to an x64 Windows system? I need to see 2000 MB/sec reliably for my application to work the way I'm hoping, and it seems like this should be achievable (5.0GT/s x 8 lanes for Gen2 PCIe on VC707, times 8/10, and allowing a conservative 50% for protocol overhead = 2 GBytes/sec). However, most of your documentation suggests that about 800 MB/sec is a more realistic maximum rate for this sort of hardware. Any pointers?

by **support** »

Hi,

The figures given at Xillybus' site relate to Xillybus' IP core, which feeds data in a convenient way into user space applications. Since actual computer programs have a significantly lower capacity to handle data than what PCIe can offer, Xillybus' data rates are not expanded, as there is no real-life demand for it.

Xillybus' internal data bus is 32 bits wide, running at a core clock of 250 MHz (on Virtex-7), so we have 4 bytes x 250 MHz = 1 GB/s, but with the PCIe packets' overhead, we're down at ~800 MB/s. It's of course possible to widen the bus, and this will be done when someone really needs it. So far, most real-life applications have been struggling with making sense of 200-400 MB/s.

As for the general capability of PCIe, my experience is that it expands linearly from the 200 MB/s available by 1x Gen1 lane. Doubling the data rate in account for Gen2, and multiplying the lanes by 8 (for your Gen2 x 8), you should be able to see 3200 MB/s. This can be useful for a short burst of data, or communication between PCIe cards on the board. Unless you have supercomputer over there, odds are that the computer will be far behind processing the data if the flow is supposed to be sustained.

Regards,
Eli

by **Guest** »

eli wrote:The figures given at Xillybus' site relate to Xillybus' IP core, which feeds data in a convenient way into user space applications. Since actual computer programs have a significantly lower capacity to handle data than what PCIe can offer, Xillybus' data rates are not expanded, as there is no real-life demand for it.

Well, there is now

Xillybus' internal data bus is 32 bits wide, running at a core clock of 250 MHz (on Virtex-7), so we have 4 bytes x 250 MHz = 1 GB/s, but with the PCIe packets' overhead, we're down at ~800 MB/s. It's of course possible to widen the bus, and this will be done when someone really needs it. So far, most real-life applications have been struggling with making sense of 200-400 MB/s. As for the general capability of PCIe, my experience is that it expands linearly from the 200 MB/s available by 1x Gen1 lane. Doubling the data rate in account for Gen2, and multiplying the lanes by 8 (for your Gen2 x 8), you should be able to see 3200 MB/s. This can be useful for a short burst of data, or communication between PCIe cards on the board. Unless you have supercomputer over there, odds are that the computer will be far behind processing the data if the flow is supposed to be sustained.

In this case, I'm interested in running some "embarrassingly parallel" algorithms on continuous streams from a set of four 16-bit 250 MS/s ADCs. Either a fast GPU or AVX2 on the CPU should be able to handle this without too much trouble. If that turns out to be impractical, I can also get away with recording non-contiguous 32 GB blocks of data to system RAM and saving them to disk for non-realtime crunching, but I still need gap-free streaming at 2 GB/sec within each block.

Sounds like the 32-bit internal bus width is a showstopper in this case (why'd you do that?) but I've got to say I'm blown away by the user-friendliness of your docs and the open-ended evaluation license terms. I'll definitely keep an eye on the Xillybus IP for future projects that aren't so bandwidth-intensive. Thanks for the help!

by **support** »

Hi,

At 2 GB/s, I would definitely keep the CPU out of the way, except for managing the traffic. In other words, let the FPGA write the data to a DMA buffer in RAM, and hand over the buffer to a GPU directly, without touching its content. Xillybus wouldn't fit in here anyhow.

As for the non-real-time option, I would usually suggest to sample the data into DDR memory on the FPGA board, and then transport it to the PC at a slower pace through a Xillybus stream. This has the advantage of not putting a strain on the PC's hardware's real-time performance. But I'm not sure it's so easy to find an FPGA board with 32 GB RAM on it.

So indeed, it looks like Xillybus isn't going to help here. One can't make everyone happy all the time.

Regards,
Eli

The Xillybus Forum

PCIe data rate

PCIe data rate

Re: PCIe data rate

Re: PCIe data rate

Re: PCIe data rate

Re: PCIe data rate

Re: PCIe data rate