Since it’s a PC, it’s likely that the CPU itself performs a simple write operation on its own bus, and that the memory controller chipset, which is connected to the CPU’s bus, has the direct connection to the PCIe bus. So what happens is that the chipset (which, in PCIe terms functions as a Root Complex) generates a Memory Write packet for transmission over the bus. This packet consists of a header, which is either 3 or 4 32-bit words long (depending on if 32 or 64 bit addressing is used) and one 32-bit word containing the word to be written. This packet simply says “write this data to this address”.
Can you expound on the above for memcpy to pcie memory? For a "simple" memcpy that does a bunch of 64 bit stores/writes to the pcie memory, I would assume that a Memory Write packet will be generated for each of these writes such that the number of stores to complete the memcpy would equal the number of Memory Write packets. Is this correct? What difference would there be if the pcie memory were mapped as write combining? Do the number of Memory Write packets decrease? How about if memcpy were optimized using SSE instructions? Would there be any difference?
Finally, for reading the pcie configuration using inb/outb during bus enumeration... Do these inb/outb requests also get converted into TLP packets?