The Xillybus Forum

by **support** »

Hello,

I now see that you've basically followed the HLS howto on Xillybus' site. In that case, there is no flow control problem (assuming that you copied the FPGA code correctly). The connection between xillybus_wrapper and the FIFOs (the empty and full signals, actually) make sure that the HLS'ed code stops when the input FIFO is empty or that output FIFO is full. So there should be no problems on that front.

So simply put, if there is any bottleneck, it will slow down things, but not harm the correct operation of the coprocessing flow.

I would suggest taking a look on the Videocapture thing over there. What happens if data isn't fetched as fast as it's captured? Does it get stuck, by any chance?

The 95% CPU consumption indeed indicates that the processors are very busy, and if it doesn't handle data quickly enough, maybe the video capture object stops being friendly? Does this object have a way to say it has overflown? Some error indication? I would suggest looking at that.

Regards,
Eli

by **Guest** »

Just a little correction above... in the Vivado HLS code.I used rows = 240, cols =240 which was for a different configuration that I tested earlier for a corresponding 240*240 input from Linux.Doing rows = 500 and cols = 500 will do for input of 500*500 from linux. However, the problem remains for each configuration.

by **Guest** »

Hi again,
Thanks for your reply.I am really excited to hear that you'll be releasing revised HLS bundles.This would hopefully make things easier for people like me in new projects.Great initiative!!

Now, I did try out your suggestions related to improving my code.And it increased some number of iterations.I have moved the code for opening fifos outside the infinite loops.I don't close any fifos inside loops.I have also removed the intermediate buf variable and used the framepointer directly in the loop that writes data(to FPGA and to the fifo between parent and child).

However, my webcam feed still freezes after a given number of iterations.I put a cout statement in the loop that writes to the FPGA in the parent process and this is where the process stops every time.Here is the updated code from the previous post.

Code: Select all: [code] #include <stdio.h> #include <sys/types.h> #include <unistd.h> #define MAX_COUNT 10 #include <iostream> #include <stdlib.h> #include <stdio.h> #include <opencv2/opencv.hpp> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/imgproc/imgproc.hpp> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> using namespace cv; using namespace std; int i = 0; VideoCapture cap(0); void ChildProcess(void); /* child process prototype */ void ParentProcess(void); /* parent process prototype */ int main() { pid_t pid; pid = fork(); if (pid != 0) ChildProcess(); else ParentProcess(); } void ChildProcess(void) { int rows = 500; int cols = 500; int nchan = 1; int fd2 = open("/dev/xillybus_read_32",O_RDONLY); int fd3 = open("/root/vidpipe",O_WRONLY); for(;;) { Mat img(rows,cols,CV_8UC1); ///Reading the result from FPGA here int totalbytes = rows*cols*nchan; int buflen = cols*nchan; int ret; uchar buf[buflen]; uchar datarray[totalbytes]; int j; int k = 0; int num = totalbytes/buflen; int bread = 0; while(bread<totalbytes) { ret=read(fd2,buf,buflen); for ( j = 0 ; j<= (ret-1);j++ ) { datarray[j+k] = buf[j]; } k = k+ret; bread = bread+ret; } img.data = datarray; ///Sending results back to the parent here /// uchar *framepointer = img.data; int bwritten = 0; int ret2; while(bwritten<totalbytes) { ret2 = write(fd3,framepointer,buflen); framepointer = framepointer + ret2; bwritten = bwritten+ret2; } } } void ParentProcess(void) { int check; int totalbytes; int buflen; int count = 0; int rows = 500; int cols = 500; int nchan = 1; int fd = open("/dev/xillybus_write_32",O_WRONLY); if (fd < 1) { cout<<"open failed"<<endl; } int fd1 = open("/root/vidpipe",O_RDONLY); for(;;) { i++; cout<<"Parent "<<i<<endl; Mat img(rows,cols,CV_8UC1); Mat frame; cap >> frame; totalbytes = frame.total()*frame.elemSize(); buflen = cols*nchan; uchar *framepointer = frame.data; int bwritten = 0; int ret; ///Writing Input image to FPGA here/// /*** This loop is where the process gets stuck ***/ while(bwritten<totalbytes) { ret = write(fd,framepointer,buflen); framepointer = framepointer + ret; bwritten = bwritten+ret; } //having or removing this has apparently no effect write(fd,NULL,0); /// Receive binary map from child here /// totalbytes = rows*cols*nchan; int ret2; uchar buf2[buflen]; uchar datarray[totalbytes]; int j; int k = 0; int bread = 0; while(bread<totalbytes) { ret2=read(fd1,buf2,buflen); for ( j = 0 ; j<= (ret2-1);j++ ) { datarray[j+k] = buf2[j]; } k = k+ret2; bread = bread+ret2; } img.data = datarray; // Overlay results on the original image here for( int p = 1; p <= img.rows; p++ ) { for( int q = 1; q <= img.cols; q++ ) { if( img.at<uchar>(p,q) == 1 ) { circle( frame, Point( p, q ), 5, Scalar(255,0,0), 2, 8, 0 ); } } } imshow( "Received image in parent", frame ); waitKey(1); frame.release(); } } [/code]

So, I am pretty sure that it is due to the lack of flow control handling in the c-code for FPGA that I created in Vivado HLS.

Here, is my code for that too.

Code: Select all: #include <math.h> #include <stdint.h> #include "xilly_debug.h" #include "ap_cint.h" #include "ap_utils.h" void xillybus_wrapper(int *in, int *out) { #pragma AP interface ap_fifo port=in #pragma AP interface ap_fifo port=out #pragma AP interface ap_ctrl_none port=return int x1, x2, x3; uint8_t y1,y2,y3,y4; int res; uint8_t bytes[12]; int thresh = 1000; float k = 0.04; const int rows = 240; const int cols = 240; uint8 rfirst[cols + 2] = {0}; uint8 rproc[cols + 2] = {0}; uint8 rnext[cols + 2] = { 0 }; uint8 rcurr[cols+2] = { 0 }; int ix2_1[cols + 2] = { 0 }; int ix2_2[cols + 2] = { 0 }; int ix2_3[cols + 2] = { 0 }; int iy2_1[cols + 2] = { 0 }; int iy2_2[cols + 2] = { 0 }; int iy2_3[cols + 2] = { 0 }; int ixy_1[cols + 2] = { 0 }; int ixy_2[cols + 2] = { 0 }; int ixy_3[cols + 2] = { 0 }; float ix2_f[cols] = { 0 }; float iy2_f[cols] = { 0 }; float ixy_f[cols] = { 0 }; float r;float cimg_1[cols + 2] = { 0 }; float cimg_2[cols + 2] = { 0 }; float cimg_3[cols + 2] = { 0 }; uint8 output[cols] = { 0 }; int ix; int iy;int ct; for (int i = 1; i <= (rows + 2); i++) { //Read rows and covert to grayscale--for i = 1 load two rows else load 1 row each time. //At any instant there are 3 rows available for each calculation i.e horizontal and vertical //derivatives, smoothed derivatives,and corner maps before and after non-maxima suppression. if (i == 1) { for (int u=1;u<=(cols)/4;u++) { x1 = *in++; bytes[0] = (x1 >> 24) & 0xFF; bytes[1] = (x1 >> 16) & 0xFF; bytes[2] = (x1 >> 8) & 0xFF; bytes[3] = x1 & 0xFF; x2 = *in++; bytes[4] = (x2 >> 24) & 0xFF; bytes[5] = (x2 >> 16) & 0xFF; bytes[6] = (x2 >> 8) & 0xFF; bytes[7] = x2 & 0xFF; x3 = *in++; bytes[8] = (x3 >> 24) & 0xFF; bytes[9] = (x3 >> 16) & 0xFF; bytes[10] = (x3 >> 8) & 0xFF; bytes[11] = x3 & 0xFF; rproc[ct] = (bytes[0] + bytes[1] + bytes[2])/3; rproc[ct+1] = (bytes[3] + bytes[7] + bytes[6])/3; rproc[ct+2] = (bytes[5]+bytes[4]+bytes[11])/3; rproc[ct+3] = (bytes[10]+bytes[9]+bytes[8])/3; ct = ct + 4; } ct = 1; for (int v=1;v<=(cols)/4;v++) { x1 = *in++; bytes[0] = (x1 >> 24) & 0xFF; bytes[1] = (x1 >> 16) & 0xFF; bytes[2] = (x1 >> 8) & 0xFF; bytes[3] = x1 & 0xFF; x2 = *in++; bytes[4] = (x2 >> 24) & 0xFF; bytes[5] = (x2 >> 16) & 0xFF; bytes[6] = (x2 >> 8) & 0xFF; bytes[7] = x2 & 0xFF; x3 = *in++; bytes[8] = (x3 >> 24) & 0xFF; bytes[9] = (x3 >> 16) & 0xFF; bytes[10] = (x3 >> 8) & 0xFF; bytes[11] = x3 & 0xFF; rnext[ct] = (bytes[0] + bytes[1] + bytes[2])/3; rnext[ct+1] = (bytes[3] + bytes[7] + bytes[6])/3; rnext[ct+2] = (bytes[5]+bytes[4]+bytes[11])/3; rnext[ct+3] = (bytes[10]+bytes[9]+bytes[8])/3; ct = ct + 4; } } else if (i > 1 && i <= (rows - 1)) { ct = 1; //here load one row from the RGB to Grayscale module for (int r = 1; r <= (cols)/4; r++) { x1 = *in++; bytes[0] = (x1 >> 24) & 0xFF; bytes[1] = (x1 >> 16) & 0xFF; bytes[2] = (x1 >> 8) & 0xFF; bytes[3] = x1 & 0xFF; x2 = *in++; bytes[4] = (x2 >> 24) & 0xFF; bytes[5] = (x2 >> 16) & 0xFF; bytes[6] = (x2 >> 8) & 0xFF; bytes[7] = x2 & 0xFF; x3 = *in++; bytes[8] = (x3 >> 24) & 0xFF; bytes[9] = (x3 >> 16) & 0xFF; bytes[10] = (x3 >> 8) & 0xFF; bytes[11] = x3 & 0xFF; rcurr[ct] = (bytes[0] + bytes[1] + bytes[2])/3; rcurr[ct+1] = (bytes[3] + bytes[7] + bytes[6])/3; rcurr[ct+2] = (bytes[5]+bytes[4]+bytes[11])/3; rcurr[ct+3] = (bytes[10]+bytes[9]+bytes[8])/3; ct = ct + 4; } for (int j = 0; j <= cols - 1; j++) { rfirst[j + 1] = rproc[j + 1]; rproc[j + 1] = rnext[j + 1]; rnext[j + 1] = rcurr[j+1]; } } else if (i == rows) { for (int j = 0; j <= cols - 1; j++) { rfirst[j + 1] = rproc[j + 1]; rproc[j + 1] = rnext[j + 1]; rnext[j + 1] = 0; } } //Calculating horizontal and vertical derivatives ix,iy,ix2,iy2 and ixy if (i >= 1 && i <= rows) { for (int j = 0; j <= cols - 1; j++) { ix2_1[j + 1] = ix2_2[j + 1]; ix2_2[j + 1] = ix2_3[j + 1]; iy2_1[j + 1] = iy2_2[j + 1]; iy2_2[j + 1] = iy2_3[j + 1]; ixy_1[j + 1] = ixy_2[j + 1]; ixy_2[j + 1] = ixy_3[j + 1]; } for (int m = 0; m <= (cols - 1); m++) { ix = abs(rfirst[m] - rfirst[m + 2] + rproc[m] - rproc[m + 2] + rnext[m] - rnext[m + 2]); iy = abs(rfirst[m] - rnext[m] + rfirst[m + 1] - rnext[m + 1] + rfirst[m + 2] - rnext[m + 2]); ix2_3[m + 1] = pow(ix,2); iy2_3[m + 1] = pow(iy,2); ixy_3[m + 1] = ix*iy; } } else if (i == rows + 1) { for (int j = 0; j <= cols - 1; j++) { ix2_1[j + 1] = ix2_2[j + 1]; ix2_2[j + 1] = ix2_3[j + 1]; iy2_1[j + 1] = iy2_2[j + 1]; iy2_2[j + 1] = iy2_3[j + 1]; ixy_1[j + 1] = ixy_2[j + 1]; ixy_2[j + 1] = ixy_3[j + 1]; ix2_3[j + 1] = 0; iy2_3[j + 1] = 0; ixy_3[j + 1] = 0; } } //filtering ix2,iy2 and ixy if (i > 1 && i <= rows + 1) { for (int j = 0; j <= cols - 1; j++) { cimg_1[j + 1] = cimg_2[j + 1]; cimg_2[j + 1] = cimg_3[j + 1]; cimg_3[j + 1] = 0; } for (int m = 1; m <= cols; m++) { ix2_f[m - 1] = (0.0113*ix2_1[m - 1] + 0.0838*ix2_1[m] + 0.0113*ix2_1[m + 1] + 0.0838*ix2_2[m - 1] + 0.6193*ix2_2[m] + 0.0838*ix2_2[m + 1] + 0.0113*ix2_3[m - 1] + 0.0838*ix2_3[m] + 0.0113*ix2_3[m + 1]); iy2_f[m - 1] = (0.0113*iy2_1[m - 1] + 0.0838*iy2_1[m] + 0.0113*iy2_1[m + 1] + 0.0838*iy2_2[m - 1] + 0.6193*iy2_2[m] + 0.0838*iy2_2[m + 1] + 0.0113*iy2_3[m - 1] + 0.0838*iy2_3[m] + 0.0113*iy2_3[m + 1]); ixy_f[m - 1] = (0.0113*ixy_1[m - 1] + 0.0838*ixy_1[m] + 0.0113*ixy_1[m + 1] + 0.0838*ixy_2[m - 1] + 0.6193*ixy_2[m] + 0.0838*ixy_2[m + 1] + 0.0113*ixy_3[m - 1] + 0.0838*ixy_3[m] + 0.0113*ixy_3[m + 1]); r = (((ix2_f[m - 1])*( iy2_f[m - 1]) - pow(ixy_f[m - 1],2)) - k*pow((ix2_f[m - 1] + iy2_f[m - 1]),2) ); if (r > thresh) { cimg_3[m] = r; } } } else if (i == rows + 2) { for (int j = 0; j <= cols - 1; j++) { cimg_1[j + 1] = cimg_2[j + 1]; cimg_2[j + 1] = cimg_3[j + 1]; cimg_3[j + 1] = 0; } } //non maxima suppression of corner map if (i > 2) { for (int m = 1; m <= cols; m++) { if (!((cimg_2[m] > cimg_2[m - 1]) && (cimg_2[m] > cimg_2[m + 1]) && (cimg_2[m] > cimg_1[m - 1]) && (cimg_2[m] > cimg_1[m]) && (cimg_2[m] > cimg_1[m + 1]) && (cimg_2[m] > cimg_3[m - 1]) && (cimg_2[m] > cimg_3[m]) && (cimg_2[m] > cimg_3[m + 1]))) { output[m - 1] = 0; } else { output[m - 1] = 1; } } ct = 0; for (int p=1;p<=125;p++) { //packing output bytes to an int res = (output[ct+3] << 24) | (output[ct+2] << 16) | (output[ct+1] << 8) | output[ct]; *out++ = res; ct = ct+4; } } } }

Again,please give me a hint about what to do about flow handling here.This might sound silly,but the reason I am asking for hints over and over is that the way I am doing this project is to learn things on the go as they come up.

Another thing is that according to your suggestion I looked at the output from top in a separate terminal while my main program was running and it showed that this process was taking up >= 95% of the CPU for the entire time.So, could the problem be due to this too?

What I am trying to do here is to identify the point (FPGA or Linux) where my problem lies.Once that is done I can take it up from there.

Thanks in advance.

by **support** »

Hello,

I'll start with some shameless promotion, since you're into HLS: Later this year (2016), an HLS-friendly revision of the Xillybus bundles will be available. It will allow simple block design connections between Xillybus and AXI-Streaming based HLS blocks in Vivado. I understand that you've struggled your way through the plumbing already, but in case someone else is about to start a project...

OK, so to your questions.

First, I'm glad to see that you divided the work into two processes. A lot of people miss this point. However in the code you submitted, you pass the data from the child process back to the parent, so it kinda misses the point of having separate processes, which is letting each one run freely: One is pushing data, one is collecting it.

The way it's written, you've indeed added a layer of storage that collects the data as soon as possible, instead of letting it dwell in the DMA buffers. But that would have worked OK anyhow, given that the DMA buffers are large enough to hold the intermediate data. Since you have some 250 kB there, it's not a problem.

One thing that's apparently wrong, is that you open /dev/xillybus_read_32 outside the loop, but close it inside the for-loop. That can't be right. I suppose you don't want to close it at all. Or any file descriptor. It's not clear why you open and close the pipe file either.

Another thing I noted is that in the parent process, you do the flushing inside the write() loop with a "write(fd,NULL,0);". That is not required, and actually wrong. You want to flush, if at all, after the whole image has been sent. That is, after the write() loop. But that doesn't make any problem, just slows it down a little

As a side note: It's not required to copy from datarray to buf while reading from fd2. You could just move the pointer given in the read() call. Again, a slight slowdown.

If there's a problem on the FPGA side, I can't comment. What I can say, is that if the interface FIFOs' empty and full signals are respected by the code generated by HLS, there should be no data flow issues whatsoever.

Checking the CPU consumption with "top" is always a good idea. But again, if the data flow control is done correctly on the FPGA side as just mentioned, there should be no problems even if the CPU is completely loaded. It just makes the show less impressive.

Hope this helped a bit.

Regards,
Eli

by **Guest** »

Ok, so I wrote a code for detecting corners in an image on the zedboard, with the corner detection code implemented in FPGA.This code was written in C in Vivado HLS.The rest of the code which trasmits the image to the FPGA via Xillybus runs on Xillinux.
It is written in C++ and uses the OpenCV library to read and display images.
The parent process writes to the FPGA, the child then reads the result from FPGA and sends it back to the parent via a named fifo.
This code is given below:

Code: Select all: #include <stdio.h> #include <sys/types.h> #include <unistd.h> #include <iostream> #include <stdlib.h> #include <opencv2/opencv.hpp> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/imgproc/imgproc.hpp> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> using namespace cv; using namespace std; int i = 0; VideoCapture cap(0); void ChildProcess(void); /* child process prototype */ void ParentProcess(void); /* parent process prototype */ int main() { pid_t pid; pid = fork(); if (pid != 0) ChildProcess(); else ParentProcess(); } void ChildProcess(void) { int fd2 = open("/dev/xillybus_read_32",O_RDONLY); if (fd2 < 1) { perror("open error"); } for(;;) { ///Read result from FPGA--500*500 bytes array/// int rows = 500; int cols = 500; int nchan = 1; int totalbytes = rows*cols*nchan; int buflen = cols*nchan; int ret; uchar buf[buflen]; uchar datarray[totalbytes]; Mat img(rows,cols,CV_8UC1); int j; int k = 0; int bread = 0; while(bread<totalbytes) { ret=read(fd2,buf,buflen); for ( j = 0 ; j<= (ret-1);j++ ) { datarray[j+k] = buf[j]; } k = k+ret; bread = bread+ret; } img.data = datarray; cout<<"Fpga receive succeeded"<<endl; close(fd2); int fd3 = open("/root/vidpipe",O_WRONLY); if (fd3 < 1) { perror("open error"); } totalbytes = img.total()*img.elemSize(); buflen = (img.cols); uchar *framepointer = img.data; int bwritten = 0; int ret2; uchar* buf2; buf2 = framepointer; while(bwritten<totalbytes) { ret2 = write(fd3,buf2,buflen); buf2 = buf2 + ret2; bwritten = bwritten+ret2; } cout<<"child to parent send succeeded"<<endl; close(fd3); } } void ParentProcess(void) { int check; int fd; int totalbytes; int buflen; int count = 0; fd = open("/dev/xillybus_write_32",O_WRONLY); if (fd < 1) { perror("open error"); } for(;;) { i++; cout<<"iteration "<<i<<endl; Mat frame; cap >> frame; frame = frame(Rect(10,10,500,500)); totalbytes = frame.total()*frame.elemSize(); buflen = (frame.cols); uchar *framepointer = frame.data; int bwritten = 0; int ret; uchar* buf; buf = framepointer; int num = totalbytes/buflen; while(bwritten<totalbytes) { ret = write(fd,buf,buflen); write(fd,NULL,0); buf = buf + ret; bwritten = bwritten+ret; cout<<"--------Loop1--------"<<endl; } ///Receive processed frame from FPGA/// int rows = 500; int cols = 500; int nchan = 1; totalbytes = rows*cols*nchan; int ret2; int fd1 = open("/root/vidpipe",O_RDONLY); if (fd1 < 1) { perror("open error"); } uchar buf2[buflen]; uchar datarray[totalbytes]; Mat img(rows,cols,CV_8UC1); int j; int k = 0; int bread = 0; while(bread<totalbytes) { ret2=read(fd1,buf2,buflen); for ( j = 0 ; j<= (ret-1);j++ ) { datarray[j+k] = buf2[j]; } k = k+ret2; bread = bread+ret2; } img.data = datarray; close(fd1); for( int p = 1; p <= img.rows; p++ ) { for( int q = 1; q <= img.cols; q++ ) { if( img.at<uchar>(p,q) == 1 ) { circle( frame, Point( p, q ), 5, Scalar(255,0,0), 2, 8, 0 ); } } } cout<<"Process frame receive succeeded"<<endl; imshow( "Received image in parent", frame ); waitKey(1); cout<<"--------Parent Ended --------"<<endl; } }

This code works fine on a single image(500*500*3 RGB image).I tested it on a chessboard image and the result was a set of corners which I overlayed on the original image.Now, my actual goal is to use a stream of images from a webcam and overlay the resulting corners from the FPGA(500*500 binary image, with 1's representing corners on the image).

Here, comes the problem: The webcam stream runs for a while and then freezes and the process appears as if it is blocking on the terminal(with the cursor blinking of course).
The interesting thing is that for a given resolution like 500*500 or 640*480,it runs for a unique number of iterations and then blocks.Sometimes, the process appears to terminate and when I restart it, it starts from the same number of iterations where it stopped.This has left me really confused.

Now, I cannot find what's wrong with what I am doing here.

Is it because my FPGA cannot keep up with the data rate or the fifos are full etc?
Or is it because the process running on linux eats up all resources?
Or if there is a way I am using the child and parent process that messes up everything?

I have tried changing to smaller resolution for the stream and this makes it run for more iterations but then it freezes eventually.I have tried changing buffer length but this has no effect on the process.

Please, give me a hint as to what I should try? My C + system programming + FPGA skills are of intermediate level.I will explore any possible hint from your side to improve it.I am stuck on it for weeks now.

The Xillybus Forum

Using Xillybus for a vision application

Post a reply

Expand view Topic review: Using Xillybus for a vision application

Re: Using Xillybus for a vision application

Re: Using Xillybus for a vision application

Re: Using Xillybus for a vision application

Re: Using Xillybus for a vision application

Using Xillybus for a vision application