by support »
The general answer is that this is platform dependent. In theory, a root complex governed by the CPU may wait for some data to accumulate before generating a TLP for write. Even more theoretic, memcpy() could be implemented to hint the hardware that the data should be packed.
But practically, what I've seen is that each read/write operation directed towards a PCIe device results in a separate TLP. At best, if a 64-bit word is read or written, this TLP will span two DWORDs. I doubt anyone has seen larger TLPs resulting from data transfer initiated directly by software. And memcpy() is just an optimized loop of reads followed by writes.
There is a reason why there is no data packing: Software reads and writes to PCIe devices should be used only to access registers in a properly designed driver. The bulk data transfers should be done with DMA. Hence it doesn't make much sense to optimize non-DMA data transfers. I don't expect this to appear in future processors.
I hope this clarified this issue.
Eli
The general answer is that this is platform dependent. In theory, a root complex governed by the CPU may wait for some data to accumulate before generating a TLP for write. Even more theoretic, memcpy() could be implemented to hint the hardware that the data should be packed.
But practically, what I've seen is that each read/write operation directed towards a PCIe device results in a separate TLP. At best, if a 64-bit word is read or written, this TLP will span two DWORDs. I doubt anyone has seen larger TLPs resulting from data transfer initiated directly by software. And memcpy() is just an optimized loop of reads followed by writes.
There is a reason why there is no data packing: Software reads and writes to PCIe devices should be used only to access registers in a properly designed driver. The bulk data transfers should be done with DMA. Hence it doesn't make much sense to optimize non-DMA data transfers. I don't expect this to appear in future processors.
I hope this clarified this issue.
Eli