by support »
Hello,
The real bitrate of an 1x lane is 2.0 Gbit/sec, so divided by 8 the upper bound for actual data is indeed 250 MB/s. But now let's make a simple calculation regarding the overhead for upstream packets (from FPGA to host).
The host controls the maximal number of payload bytes each packet (TLP) can have through the MaxPayload field in the PCIe configuration space. It's quite common to have this set to 128 bytes, so let's assume this is the case.
Now the overhead: Each TLP has a header of three or four DWORDs (4 bytes each), depending on whether the addressing is 32 or 64 bit. Let's go for the modest case of 32-bit addressing, so the transaction layer's overhead is 12 bytes. There's also a possibility of a TLP digest (ECRC), but since it doesn't make much sense having it, I'll assume it's not used.
The data link layer adds 6 additional bytes: There's a two-byte header, with some reserved field and the TLP Sequence number (used for acknowledgement) and a four-byte LCRC, which is how the packet's integrity is checked.
It's plausible to assume that real-life implementations add another two bytes, so that the entire transmission chain can be implemented with 32-bit alignment, but since I don't know how Xilinx implemented their interface, I'll assume only 6 bytes of data layer overhead.
So all in all, there's 18 bytes overhead in the optimistic case for 128 bytes of maximal payload. The efficiency is therefore 128/(18+128) = 87.7%. Applied to the 250 MB/s byte stream, we have a maximal real throughput of 219 MB/s.
This overhead calculation doesn't include other packets, such as credit announcements, power management packets, acknowledgement packets etc, which take a some amount of bandwidth.
As for the other direction (host to FPGA) things become significantly more complicated to estimate. This is because the FPGA requests data to be read, and host can choose to chop the answer into several TLPs in the answer (completion). There's a limitation to the number of TLPs a completion can be chopped into, but it's not possible to derive an upper bound for performance based upon the worst bandwidth utilization chosen by host.
There is also another factor which may slow down the downstream significantly: The FPGA may or may not choose to send a request to the host before the previous one(s) are completed. Having multiple requests pending is possible only when the FPGA can handle the arriving data in nonincreasing order. The packet ordering rules apply only for packets regarding the same request.
So all in all, the only relatively solid answer I can offer is: Don't expect more than 220 MB/s from FPGA to host. As for the other direction, it's still limited by the maximal payload rule per packet, but it can be slowed down further by other factors.
Hope this helped,
Eli
Hello,
The real bitrate of an 1x lane is 2.0 Gbit/sec, so divided by 8 the upper bound for actual data is indeed 250 MB/s. But now let's make a simple calculation regarding the overhead for upstream packets (from FPGA to host).
The host controls the maximal number of payload bytes each packet (TLP) can have through the MaxPayload field in the PCIe configuration space. It's quite common to have this set to 128 bytes, so let's assume this is the case.
Now the overhead: Each TLP has a header of three or four DWORDs (4 bytes each), depending on whether the addressing is 32 or 64 bit. Let's go for the modest case of 32-bit addressing, so the transaction layer's overhead is 12 bytes. There's also a possibility of a TLP digest (ECRC), but since it doesn't make much sense having it, I'll assume it's not used.
The data link layer adds 6 additional bytes: There's a two-byte header, with some reserved field and the TLP Sequence number (used for acknowledgement) and a four-byte LCRC, which is how the packet's integrity is checked.
It's plausible to assume that real-life implementations add another two bytes, so that the entire transmission chain can be implemented with 32-bit alignment, but since I don't know how Xilinx implemented their interface, I'll assume only 6 bytes of data layer overhead.
So all in all, there's 18 bytes overhead in the optimistic case for 128 bytes of maximal payload. The efficiency is therefore 128/(18+128) = 87.7%. Applied to the 250 MB/s byte stream, we have a maximal real throughput of 219 MB/s.
This overhead calculation doesn't include other packets, such as credit announcements, power management packets, acknowledgement packets etc, which take a some amount of bandwidth.
As for the other direction (host to FPGA) things become significantly more complicated to estimate. This is because the FPGA requests data to be read, and host can choose to chop the answer into several TLPs in the answer (completion). There's a limitation to the number of TLPs a completion can be chopped into, but it's not possible to derive an upper bound for performance based upon the worst bandwidth utilization chosen by host.
There is also another factor which may slow down the downstream significantly: The FPGA may or may not choose to send a request to the host before the previous one(s) are completed. Having multiple requests pending is possible only when the FPGA can handle the arriving data in nonincreasing order. The packet ordering rules apply only for packets regarding the same request.
So all in all, the only relatively solid answer I can offer is: Don't expect more than 220 MB/s from FPGA to host. As for the other direction, it's still limited by the maximal payload rule per packet, but it can be slowed down further by other factors.
Hope this helped,
Eli