Xillybus bandwidth can't reach more than 75MB/s

Questions and discussions about the Xillybus IP core and drivers

Xillybus bandwidth can't reach more than 75MB/s

Postby mjuzwiak »

Hello,

I'm not sure if it's good place to ask, but I have no idea where can get help.
Currently I'm running Xillybus on Virtex5 (ML507 proto board).
In my design, i need to send continuously ~120MB/s data from pc to fpga, and ~1MB/s from fpga to pc.

I've generated core from IP Core Factory (attached below), and wrote simply code in C to write and capture data and measure time.
My core have device 150MB/s bandwidth to fpga, while i can get only ~75MB/s.
Same situation is with my devices (xillybus_md_upload, xillybus_md_upload_my, xillybus_md_download, xillybus_md_download_my), and xillybus default devices (xillybus_write_32 and xillybus_read_32).
I'm using ubuntu 12.04 32bit on Athlon 64 X2 4800+ CPU.

In code below i'm trying to recieve same data as send, but I've tried also sending a lot of data to fpga and recieve only 1 byte per 160K bytes send.

Any help will be appreciated.

Core specification:
Code: Select all

------- /dev/xillybus_read_32

  Upstream (FPGA to host):
    Data width: 32 bits
    DMA buffers: 32 x 128 kB = 4 MB
    Flow control: Asynchronous, select() and non-blocking read() supported
    Seekable: No

------- /dev/xillybus_write_32

  Downstream (host to FPGA):
    Data width: 32 bits
    DMA buffers: 32 x 128 kB = 4 MB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: 4 segments x 512 bytes = 2 kB

------- /dev/xillybus_read_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous, select() and non-blocking read() supported
    Seekable: No

------- /dev/xillybus_write_8

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: None

------- /dev/xillybus_mem_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits
    FPGA RAM for DMA acceleration: None

------- /dev/xillybus_md_download

  Downstream (host to FPGA):
    Data width: 32 bits
    DMA buffers: 32 x 128 kB = 4 MB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: 4 segments x 512 bytes = 2 kB

------- /dev/xillybus_md_download_my

  Downstream (host to FPGA):
    Data width: 32 bits
    DMA buffers: 64 x 128 kB = 8 MB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: 4 segments x 512 bytes = 2 kB

------- /dev/xillybus_md_upload

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: No

------- /dev/xillybus_md_upload_my

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 128 bytes = 512 bytes
    Flow control: Asynchronous, select() and non-blocking read() supported
    Seekable: No


C program code:
Code: Select all
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <pthread.h>

#define MEGABYTE 1024*1024
#define SQUARE_100 40000

//#define DEBUG_MODE
//#define WORD_CHECK

//to calculate size of data sent, multiply words_number and fpga_word_width
//
//words number should:
// -be bigger or equal fpga_response_factor
//- can be divided by fpga_response_factor

#define words_number MEGABYTE*10*10
#define fpga_word_width 4
#define fpga_response_factor 0.25 //must match defined in fpga

#ifdef WORD_CHECK
#if words_number < fpga_response_factor
#error Not enough words_number or fpga_response_factor too big
#elif (words_number%fpga_response_factor)!= 0
#error words_number not dividable by fpga_response_factor
#endif
#endif

int writeFd;
int readFd;

int rcv_started = 0;
int snd_started = 0;

int start_sending = 0;

char *mem;


const long memsize = (words_number)*fpga_word_width; //in fact - sendsize
const long rcvsize = words_number/fpga_response_factor;

struct timeval start_write, start_read, end_write, end_read;
long mtime_write, mtime_read;

void* rcvThread(void* arg)
{
   long rc;
   int num;
   long bytes_read = 0;
   long seconds, useconds;
   rcv_started = 1;
   char* buff;

   buff = malloc(rcvsize);
   if(buff == 0)
   {
      perror("Cannot allocate memory for recieve");
   }


   #ifdef DEBUG_MODE
      printf("Read thread started!\n");
      fflush(stdout);
   #endif

   gettimeofday(&start_read, NULL);

   while (1) {
      #ifdef DEBUG_MODE
      printf("Bytes read: %d \n", bytes_read);
      #endif
      rc = read(readFd, buff+bytes_read, rcvsize-bytes_read);
      bytes_read += rc;

      if(bytes_read >= rcvsize)
      {
         //count elapsed time
         gettimeofday(&end_read, NULL);
         seconds  = end_read.tv_sec  - start_read.tv_sec;
         useconds = end_read.tv_usec - start_read.tv_usec;
         mtime_read = ((seconds) * 1000 + useconds/1000.0) + 0.5;
         #ifdef DEBUG_MODE
         printf("Data read done!");
         fflush(stdout);
         #endif
         break;
      }

      if ((rc < 0) && (errno == EINTR))
         continue;

      if (rc < 0) {
         perror("read() failed");
         break;
      }

      if (rc == 0) {
         fprintf(stderr, "Reached read EOF.\n");
         break;
         }
   }

   return NULL;
}


void* sndThread(void* arg)
{
   int rc;
   long seconds, useconds;

   snd_started = 1;

   gettimeofday(&start_write, NULL);

      long bytes_written = 0;

      while (bytes_written < memsize) {
         #ifdef DEBUG_MODE
         printf("Bytes: %d \n", bytes_written);
         #endif
         rc = write(writeFd, mem+bytes_written, memsize-bytes_written);

         if ((rc < 0) && (errno == EINTR))
            continue;

         if (rc < 0) {
            perror("write() failed");
            break;
         }

         if (rc == 0) {
            fprintf(stderr, "Reached write EOF (?!)\n");
            break;
         }

         bytes_written += rc;

      }

      //flushing
      while(1) {
         rc = write(writeFd, mem+bytes_written,0);

         if((rc < 0) && (errno == EINTR))
            continue;

         if(rc < 0)
         {
            perror("Flush failed!");
            printf("bytes: %d\n\n", bytes_written);
            break;
         }
         break;
      }

      gettimeofday(&end_write, NULL);
      seconds  = end_write.tv_sec  - start_write.tv_sec;
       useconds = end_write.tv_usec - start_write.tv_usec;
      mtime_write = ((seconds) * 1000 + useconds/1000.0) + 0.5;

      #ifdef DEBUG_MODE
      printf("Writing finished!: %d written", bytes_written);
      fflush(stdout);
      #endif


   return NULL;
}



int main(int argc, char* argv[])
{

   pthread_t rcvThreadId, sndThreadId, mgmtThreadId;

   if (argc != 3)
   {
      printf("Usage: %s write_file read_file\n", argv[0]);
      exit(1);
   }

   writeFd = open(argv[1], O_WRONLY);
   if (writeFd < 0) {
      perror("Failed to open write file");
      exit(1);
    }

   readFd = open(argv[2], O_RDONLY);
   if (readFd < 0) {
      perror("Failed to open write file");
      exit(1);
    }

   mem = (char*)malloc(memsize);
   if(mem == 0)
   {
      perror("Failed to allocate memory");
      exit(1);
   }

   if (pthread_create(&sndThreadId, NULL, sndThread, NULL)) {
      perror("Failed to create send thread");
      exit(1);
   }



   pthread_join(sndThreadId, NULL);
    printf("Write time: %ld mili\n", mtime_write);
    fflush(stdout);

    if (pthread_create(&rcvThreadId, NULL, rcvThread, NULL)) {
             perror("Failed to create recieve thread");
             exit(1);
          }

    pthread_join(rcvThreadId, NULL);
    printf("Recieve time: %ld mili\n", mtime_read);
    fflush(stdout);

   return 0;
}



fpga code (part with sending back data for my device)
Code: Select all

   always @ (posedge bus_clk)
   begin
   upload_fifo_wr_en = 0;
      if(!user_r_md_upload_open && !user_w_md_download_open)
         counter = 0;
      else if(user_w_md_download_wren == 1'b1) begin
         counter = counter + 1;
      end
   
      if(counter == 40000) begin
         counter = 0;
         upload_fifo_wr_en = 1;
      end
   
   end

   assign user_r_md_upload_eof = 0;
   
     fifo_8x2048 md_upload
     (
      .clk(bus_clk),
      .srst(!user_r_md_upload_open && !user_w_md_download_open),
      .din(7),
      .wr_en(upload_fifo_wr_en),
      .rd_en(user_r_md_upload_rden),
      .dout(user_r_md_upload_data),
      .full(user_w_md_download_full),
      .empty(user_r_md_upload_empty)
      );

mjuzwiak
 
Posts: 4
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby support »

Hello,

Yes, this is the correct place to ask. (:

There is no reason why 150 MB/s shouldn't be reached. What you didn't mention, is where the data is going. The first thing I would look at, is if the destination FIFO is being emptied by the logic at the required rate. The most plausible explanation is that the rate you see is the rate at which data is being fetched from that FIFO. If this is the case, it's the user_w_md_download_full signal that goes high and hence stalls the data. You may try to assign zero to it, rather than connecting it to the FIFO. This will break the system's functionality, but show clearly if this was the issue or not.

It's not 100% clear from your question which of the devices you tried with. Anyhow, for fast data transmission, only 32-bit wide interfaces should be used. It appears like you went for /dev/xillybus_md_download, which is fine.

Besides, the 1:160000 ratio between the data rates is explained by the code you attached: You're counting the 32-bit data items from PC to FPGA, and send a single byte every time the counter reaches 40000. One byte for each chunk of 40000 words of 4 bytes each, that's exactly the ratio.

I hope this helps.
Eli
support
 
Posts: 802
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby Guest »

Thank You for reply.

I've tried all of shown devices :)
Connecting user_w_md_download_full has no effect - still same time.

I generated new core, edited xillybus_read_32 and xillybus_write_32 for large buffers (see below). It's now little faster (~10MB/s)
Each of them has now 64x1MB DMA buffers.
I did some tests:

Code: Select all
data sent - time - bandwidth
64MB - 121ms : 528MB/s
128MB - 776ms : 167MB/s
256MB - 2310ms : 110MB/s
512MB - 5396ms : 94MB/s
1024MB - 11525ms : 89MB/s
2048MB - 23851ms: 85MB/s
4096MB - 48497ms: 84MB/s


I ignore incoming data in fpga.
I'm not sure, but probably write function returns after filling DMA buffer, not transfering all data - that explain why 64MB is so fast.
I tried on 64bit Ubuntu 12.04 - same results.
I'm going crazy with it. Could it be poor motherboard? Its Gigabyte GA-M61SME-S2 ( http://www.gigabyte.us/products/product ... id=2507#sp ).

Of course, i can generate core with ~1GB of DMA buffer, but I'm almost sure it will not fix the problem, and it's not good way.

My devices now:
Code: Select all
------- /dev/xillybus_read_32 or \\.\xillybus_read_32

  Upstream (FPGA to host):
    Data width: 32 bits
    DMA buffers: 64 x 1 MB = 64 MB
    Flow control: Asynchronous, select() and non-blocking read() supported (on Linux)
    Seekable: No

------- /dev/xillybus_write_32 or \\.\xillybus_write_32

  Downstream (host to FPGA):
    Data width: 32 bits
    DMA buffers: 64 x 1 MB = 64 MB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: 8 segments x 512 bytes = 4 kB

------- /dev/xillybus_read_8 or \\.\xillybus_read_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous, select() and non-blocking read() supported (on Linux)
    Seekable: No

------- /dev/xillybus_write_8 or \\.\xillybus_write_8

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: None

------- /dev/xillybus_mem_8 or \\.\xillybus_mem_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits
    FPGA RAM for DMA acceleration: None

Guest
 

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby support »

Hi,

First, yes, since the stream is asynchronous, a write() call may return immediately without the data being actually sent to the FPGA. If you want to time the arrival of the data, it's enough to include the close() call, which returns only after the data has been sent (or a timeout).

The situation you have there is indeed odd. You shouldn't need to allocate DMA buffers with a total larger than 1 MB to achieve full throughput. And the chances that something is wrong with your hardware are extremely slim.

I took a look on the C code. It appears like you're allocating huge buffers in memory. Memory allocation can slow down significantly, not to mention if pages are flushed to disk. I would suggest allocating a relatively small buffer (say, 512 kB) and use it several times to reach your data transmission goal.

I would also suggest using the dd Linux utility for measuring data rates. This is how I check these figures.

Regards,
Eli
support
 
Posts: 802
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby mjuzwiak »

Hello,

I had same anxiety (about allocating huge amount of memory) before, i tested it as first thing some time ago.
Anyway, checked it once again, I changed code to send small chunks of data (512KB-1MB) in loop. Same result.
I used also dd to copy 1GB of data, here's report:

Code: Select all
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 11.7009 s, 89.6 MB/s


and 2nd time

Code: Select all
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 11.8763 s, 88.3 MB/s


I ignore all coming data in fpga, i've assigned user_w_write_32_full to 0.
Maybe i should change linux distro? Or it could be problem with clock on fpga (dont think so, PCIE transmission should not work). I connected differential 100MHz to PCIE_REFCLK_P and PCIE_REFCLK_N.
mjuzwiak
 
Posts: 4
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby support »

Hi,

Let me first say that what you have there is extremely rare. Xillybus is running on a lot of platforms, and I've never seen anything like this.

You mentioned something about a 100 MHz clock to PCIE_REFCLK. The clock you should use is the one coming from the motherboard, not just some oscillator. Have you changed the pinout relative to the demo bundle you downloaded?

I can see that you used 512 bytes' blocks in your dd attempt. Could you please try it again with bs=16k or something? Just to be sure it's for real?

Also, please add the dd command you used. As we are in a debugging session, the devil is probably somewhere in some very tiny detail...

I can see a few possibilities:

(1) There is something overlooked in the xillydemo.v file. Could you please send the file you're running with to Xillybus' main email?
(2) For some reason, the stream you're using is synchronous, despite what it appears. These things can happen when you've downloaded one IP core and then another, and though that the new ngc file is used, but it's actually the old. The data rates you have there are typical for synchronous streams.
(3) Unlikely (unless you're using the wrong reference clock): Your PCIe hardware is faulty at the physical level, causing a high rate of retransmissions of packets. I would expect a lot of other trouble if this was the case.

Regards,
Eli
support
 
Posts: 802
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby Guest »

Hello,

I'm really not suprised that I hit rare case ;)

clock:
I've downloaded version for ML506, and changed ucf to:

Code: Select all
INST "*/pcie_ep0/pcie_blk/SIO/.pcie_gt_wrapper_i/GTD[0].GT_i" LOC = GTX_DUAL_X0Y2;
NET  "PCIE_REFCLK_P"       LOC = "AF4"  ;
NET  "PCIE_REFCLK_N"       LOC = "AF3"  ;
NET  "PCIE_PERST_B_LS"     LOC = "AC24"  ;


AF4, AF3 are taken from my Virtex5 device ucf:

Code: Select all
NET  PCIE_CLK_QO_N        LOC="AF3";   # Bank 118, MGTREFCLKN_118, GTP_DUAL_X0Y1
NET  PCIE_CLK_QO_P        LOC="AF4";   # Bank 118, MGTREFCLKP_118, GTP_DUAL_X0Y1


detalied info: http://www.xilinx.com/support/documenta ... /ug347.pdf page 43

dd command:

'file' is 1.1GB file previously generated by dd.
Code: Select all
$ dd if=file of=/dev/xillybus_write_32
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 12.2857 s, 87.4 MB/s


and

Code: Select all
$ dd if=/dev/zero of=/dev/xillybus_write_32 bs=16k count=100KB
100000+0 records in
100000+0 records out
1638400000 bytes (1.6 GB) copied, 18.2155 s, 89.9 MB/s


dd consumes ~16% of CPU.

I downloaded example project for ML506 again, so my xillydemo.v is very close to example, just assigned full signal to 0.
ad 1. Sending it right now
ad 2. Just like I said above, downloaded example project, and before building changed files to my last core (with large buffers). Results above from dd come from my new core. Could large DMA buffers now cause low bandwidth?
ad 3. Ough. Should I look for new motherboard? Can we check retransmission ratio?
Guest
 

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby mjuzwiak »

Forgot to login above.

Forgot:
I connected PCIE_PERST_B_LS to dip switch.

Mathew
mjuzwiak
 
Posts: 4
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby support »

Hi,

I took a look on the xillybus.v file you sent me. There is indeed nothing special about it. Your pinout also seems OK.

Large DMA buffers can't reduce bandwidth, and in fact there is nothing, except synchronous streams, that can explain that bandwidth. Really.

I would also be quite surprised if it turned out to be a poor PCIe link. But I would make a quick try on another computer, if that is possible for you. Even though I wouldn't put my money on that this is the issue.

In all previous cases I had of this kind of black magic, it turned out that there was some confusion with old and new files. So to be absolutely sure that there really is a problem, please try this. Even though it appears like you've been through this already:

(1) Please create a new custom core for Virtex-5, based upon the default configuration. Change nothing except for the name of xillybus_write_32, to something else, say xillybus_write_test.
(2) Then adopt the downloaded core into a freshly downloaded demo project (replace the files), and adjust it: Rewire the full signal as you previously did, and search-replace write_32 to write_test.
(3) Adjust the UCF file
(4) Implement the project and load it to the FPGA
(5) Run the dd test you did above.

The idea behind this checkup is that we'll be absolutely sure that what is loaded into the FPGA is what we think it was. Otherwise, we start to suspect the hardware. So the next step is to try using another FPGA board and/or computer.

Regards,
Eli
support
 
Posts: 802
Joined:

Re: Xillybus bandwidth can't reach more than 75MB/s

Postby mjuzwiak »

Hello,

Done steps you requested.

$ dd if=/dev/zero of=/dev/xillybus_write_test bs=16kB count=64kB
64000+0 records in
64000+0 records out
1024000000 bytes (1.0 GB) copied, 13.5872 s, 75.4 MB/s


Code: Select all
------- /dev/xillybus_read_test

  Upstream (FPGA to host):
    Data width: 32 bits
    DMA buffers: 32 x 128 kB = 4 MB
    Flow control: Asynchronous, select() and non-blocking read() supported
    Seekable: No

------- /dev/xillybus_write_test

  Downstream (host to FPGA):
    Data width: 32 bits
    DMA buffers: 32 x 128 kB = 4 MB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: 4 segments x 512 bytes = 2 kB

------- /dev/xillybus_read_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous, select() and non-blocking read() supported
    Seekable: No

------- /dev/xillybus_write_8

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Asynchronous
    Seekable: No
    FPGA RAM for DMA acceleration: None

------- /dev/xillybus_mem_8

  Upstream (FPGA to host):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits

  Downstream (host to FPGA):
    Data width: 8 bits
    DMA buffers: 4 x 4 kB = 16 kB
    Flow control: Synchronous
    Seekable: Yes, with 5 address bits
    FPGA RAM for DMA acceleration: None



Updated bios to lastest version, no result. Also i tried to manual assign IRQ, still no effect.
I'll try to change PC.
I'm going to holidays tommorow for one week; hope we'll continue when I'll get back.

Thank You very much,
Mathew
mjuzwiak
 
Posts: 4
Joined:

Next

Return to Xillybus

cron