Reading the OpenGL backbuffer to system memory

Sometimes you are in the need to read back the OpenGL backbuffer, or other framebuffers. In my case the question was how to read back a downsampled framebuffer resp. its texture to system memory. There are different methods for this and I wrote a small benchmark to test them on various systems.

UPDATE #1: A thing I've totally overlooked is that you can actually blit the backbuffer to the downsampled framebuffer directly, saving you the FBO overhead (memory, setup, rendering time). I've updated the page and the code reflecting that.
UPDATE #2: The code now works on Windows and Linux (via GLX). Maybe I'll port it to GLES2 for the Raspberry Pi too... :)
UPDATE #3: The code works on the Raspberry Pi now using OpenGL ES 2.0. Reading the framebuffer is quite slow though even with an overclocked device. It's not the actual reading that is slow, but that we have to render to another framebuffer first and back to the screen. Atm all buffers and textures use RGBA8888, which might slow things down. I'll try RGB565 in the future. Also there are no PBOs and no glGetTexImage(), so only the glReadPixels method is working... If you have any ideas on how to make the code faster, I'd love to hear them.
UPDATE #4: Tinkered around with color formats a bit and updated the Pi's firmware and MESA implementation. I've tried changing the FBOs color format to 16bit, but that didn't change much. The secret seems to lie in using a RGB565 EGL display/framebuffer format with no alpha, no depth, no stencil and to remove the glDiscardFramebuffer calls. The FBO backbuffer and the downsampled FBO still have RGBA8888. Together with using a screen size of 640x480 this nearly quadrupled the frame rate. Blitting the FBO to the screen now seems to be much, much faster... I updated the code accordingly and also did some cleanup and found a safer way to get function addresses on Linux systems.

Prequisites

I needed to downsample the framebuffer. This can be done by blitting to a smaller framebuffer that will be read back. For setting up framebuffers, see here. In your rendering loop, do:

// set viewport to window size
glViewport(0, 0, width, height);

// draw stuff here

// blit backbuffer to downsampled buffer
context->glBindFramebuffer(GL_READ_FRAMEBUFFER, 0);
context->glBindFramebuffer(GL_DRAW_FRAMEBUFFER, smallId);
context->glBlitFramebuffer(0, 0, width, height, 0, 0, smallWidth, smallHeight, GL_COLOR_BUFFER_BIT, GL_LINEAR);


// read back buffer content here (Method #1, #2, #3)
// unbind again
glBindFramebuffer(GL_FRAMEBUFFER, 0);


Blitting the framebuffer using glBlitFramebuffer isn't much faster than binding a texture and rendering a quad to another framebuffer, but it has much less setup needed and you don't find yourself setting up projection or modelview matrices and trashing the state.

Method #1 - glReadPixels

The first method is to use a standard glReadPixels. This is the common way to read the backbuffer since ancient OpenGL times. Do (depending on you framebuffer format):

// bind downsampled buffer for reading
glBindFramebuffer(GL_FRAMEBUFFER, smallId);

glReadPixels(0, 0, smallWidth, smallHeight, GL_RGBA, GL_UNSIGNED_BYTE, downsampleData);
// unbind buffer again
glBindFramebuffer(GL_FRAMEBUFFER, 0);

The data will end up in downsampleData in system memory. Note that you need to allocate space for that data before!

Method #2 - glGetTexImage

The second method does not read the actual buffer, but the texture attached to it. That can be done using glGetTexImage. Do (depending on you framebuffer format):

// bind downsampled texture
glBindTexture(GL_TEXTURE_2D, downsampleTextureId);

// read from bound texture to CPU
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, downsampleData);
// unbind texture again
glBindTexture(GL_TEXTURE_2D, 0);


The data will end up in downsampleData in system memory again. Note that you need to allocate space for that data before!

Method #3 - PixelBufferObjects

This is the most complicated, but a slightly faster method than the previous two. You read the downsampled framebuffer using glReadPixels again, but now you read it to a PixelBuffer object asynchronously. For that you set up two PBOs and alternate between them so the GPU can go on rendering to the other while you process the first. The download of the data to system memory happens using DMA and does not block the CPU. To do this, first start up creating two PBOs:

int readIndex = 0;
int writeIndex = 1;
GLuint pbo[2];


// create PBOs to hold the data. this allocates memory for them too
glGenBuffers(2, pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glBufferData(GL_PIXEL_PACK_BUFFER, smallWidth, smallHeight * smallDepth, 0, GL_STREAM_READ);

glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glBufferData(GL_PIXEL_PACK_BUFFER, smallWidth, smallHeight * smallDepth, 0, GL_STREAM_READ);
// unbind buffers for now
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


Now that you have created the PBOs, go to your rendering loop and there you read back data to the PBOs there.

// bind downsampled fbo
downsample->bind();
// swap PBOs each frame
writeIndex = (writeIndex + 1) % 2;
readIndex = (writeIndex + 1) % 2;
// bind PBO to read pixels. This buffer is being copied from GPU to CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[writeIndex]);
// copy from framebuffer to PBO asynchronously. it will be ready in the NEXT frame
glReadPixels(0, 0, downsample->getWidth(), downsample->getHeight(), GL_RGBA, GL_UNSIGNED_BYTE, nullptr);
// now read other PBO which should be already in CPU memory
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[readIndex]);
// map buffer so we can access it
downsampleData = (unsigned char *)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if (downsampleData) {
    // ok. we have the data now available and could process it or copy it somewhere
    // ...
    // unmap the buffer again
    downsampleData = nullptr;
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);

}
// back to conventional pixel operation
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
// unbind downsampled fbo
downsample->unbind();
// now in the next readIndex and writeIndex will be swapped and we'll read from what is currently writeIndex...

The data will end up in downsampleData in system memory. This method seems to be slightly faster on some systems, but gain is highly dependent on the frame rate.

Results

The results depend on the system, method used and read color format. Here's a table showing my benchmarking results. The first column is regular rendering to the backbuffer, no FBOs involved. Then follow the tests, every test first without actually reading the buffer, just rendering to the downsampled FBO. Then two tests reading the downsampled buffer to system memory using different formats:
SystemNo FBOs#1 no reading#1 BGRA#1 RGBA#2 no reading#2 BGRA#2 RGBA#3 no reading#3 BGRA#3 RGBA
Intel Core i7 960
Nvidia Quadro 600, Driver 310.90
OpenGL 4.3,
640x480
0.390.540.700.670.540.880.870.550.640.55
Intel Core i5 480M
AMD Radeon 5650M, Driver Catalyst 13.1
OpenGL 4.2,
640x480
0.300.410.800.770.410.810.810.420.510.49
Intel Core i5 480M
Integrated HD Graphics, Driver 8.15.10.2827
OpenGL 2.1,
640x480
0.730.761.751.520.785.262.560.791.771.70
Windows 7 x64, frame time in ms, vsync off, data size 160x90x32bit

SystemNo FBOs#1 no reading#1 BGRA#1 RGBA#2 no reading#2 BGRA#2 RGBA#3 no reading#3 BGRA#3 RGBA
Intel Core i5 480M
Integrated HD Graphics, Driver Mesa DRI Intel Ironlake Mobile
OpenGL 2.1, GLX 1.4, MESA 9.0.2,
640x480
1.131.463.2516.911.973.4016.771.894.2217.05
Intel Core i3-2120T
Integrated HD Graphics 2000, Driver Mesa DRI Intel Sandybridge Desktop x86/MMX/SSE2
OpenGL 3.0, GLX 1.4, MESA 9.0.2,
640x480
0.781.141.301.771.131.291.551.131.301.77
  Ubuntu 12.10 x64, frame time in ms, vsync off ("export vblank_mode=0"), data size 160x90x32bit

SystemNo FBOs#1 no reading#1 BGRA#1 RGBA#2 no reading#2 BGRA#2 RGBA#3 no reading#3 BGRA#3 RGBA
Intel Core i5 480M
Integrated HD Graphics, Driver Mesa DRI Intel Ironlake Mobile
OpenGL 2.1, GLX 1.4, MESA 17.0.7,
640x480
1.081.041.481.421.031.441.421.041.071.07
  Ubuntu 16.04 x64, frame time in ms, vsync off ("export vblank_mode=0"), data size 160x90x32bit

SystemNo FBOs#1 no reading#1 BGRA#1 RGBA#2 no reading#2 BGRA#2 RGBA#3 no reading#3 BGRA#3 RGBA
Intel Core i7 7500U
Mesa DRI Intel(R) HD Graphics 620 (Kabylake GT2)
OpenGL 3.0, GLX 1.4, MESA 11.2.0,
640x480
0.200.220.280.290.220.290.280.220.250.24
  Ubuntu 16.04.3 x64, frame time in ms, vsync off ("export vblank_mode=0"), data size 160x90x32bit

SystemNo FBOs#1 no reading#1 BGRA#1 RGBA#2 no reading#2 BGRA#2 RGBA#3 no reading#3 BGRA#3 RGBA
Raspberry Pi
OpenGL ES 2.0, EGL 1.4,
640x480
2.374.494.546.25------
  Raspian (Updates: March 25th, 2013), frame time in ms, vsync off (eglSwapInterval(0)), data size 160x90x32bit

Source code

The source code, a CMake build file and the benchmark results can be found on GitHub and building should work at least on Windows 7, Ubuntu 12.10 (via GLX on X11) and on a current Raspian for the Raspberry Pi (via EGL). The beef is in the Test_... classes. To do a benchmark, run the application from the command line and wait for it to finish. It'll spit out some frame time values for the different methods and color formats.
I'd love to get your feedback on how else to read to system memory or maybe speed up the methods. Just drop me a comment.

Comments

jdobmeier said…
With some SoCs, in particular the RPi, you can get true DMA as opposed to mere memory mapped I/O. A call to glMapBuffer returns a pointer to the actual buffer in GPU address space (since all memory is, in principal, mutually accessible via the front side bus) so you can avoid even the latency associated with transfer over the PCI bus. This "issue" on Github demonstrates one method: https://github.com/raspberrypi/firmware/issues/85 There is also a lot of info on the bare metal section of the RPi forum. Also here: http://www.cl.cam.ac.uk/freshers/raspberrypi/tutorials/os/screen01.html

If you do end up porting what you have above I would be willing to help test. I am also working toward getting this same functionality implemented for the Pi to support some GPGPU experiments.
Kim said…
Sounds nice. Thanks for the info!
There's a reason I'm actually trying this is (as you can grasp from the blog posts) - to get XBMC to render all GL-rendered stuff to my ambilight via boblight (not only videos). I threw together those examples to convince the XBMC people (and myself) that performance is not an issue.
I'm currently porting to the PI, but wasn't familiar with EGL before, there's no glBlitFrameBuffer, eglGetProcAdress doesn't work properly, I'm trying out new things and so on... :) So it takes time...
You should be able to drop me an Email now if you're interested. I can send you the code package when it's done. Or just check the page from time to time. No deadlines though ;)
Kim said…
@jdobmeier: The test code is working, but the problem is that glMapBuffer only supports vertex or index buffers, no PBOs or other stuff AND only supports writes, so even rendering to a vertex buffer is of no use. All the examples you mentioned show that reading the actual framebuffer (what is on screen) is possible, but not just any FBO. The trick in my code is supposed to be that the more powerful GPU does the downsampling and frees the CPU from that task... So your approach actually doesn't make sense in my case...
If you have any idea on how to get the address of an FBO attachment that would be great though.
Kim said…
Btw. The documentation of the GL_OES_mapbuffer extension is here: http://www.khronos.org/registry/gles/extensions/OES/OES_mapbuffer.txt
Gregory Tetard said…
hello,

Nadnerb from raspberry pi forum made a guide on setting up boblight on a Raspberry Pi with, hopefully, every step you need to take to get it up and running.
This includes the hardware and software elements and uses 50 WS2801 LEDs.

here after is the link:

http://bit.ly/pi_boblight
Bim said…
Thanks for the info, but it does not seem like the boblight forks or the omxplayer thingie capture from OpenGL, which is what I wanted...
I wanted boblight not only when playing videos, but also in the menu, visualization etc.
D. Cerisano said…
Not sure why, but following this thread just reduced my GL to H264 transcoder to almost nothing. Time for bed.
Anton said…
Why not do glReadPixels after glUnmapBuffer? Same speed, minus one buffer.
Bim said…
Not sure what you mean. The idea is to hide the latency of copying to CPU memory, and not stall the CPU and GPU. Therefore 2 buffers are used, one that is used to copy from GPU to CPU memory asynchronously (writeIndex) and the other that is already copied and can be read from (readIndex). These buffers are swapped every frame.
With only one buffer you'd stall the GPU.
You are right, whether this is a real gain depends on your GPU + driver + system and probably on how long your rendering takes...
Moe Ho said…
Thanks for the good info! Read through the code, couldn't get it to work on 16.04, but I'm running amdgpu pro drivers... get this error:

build$ ./read_test
GLX is supported.
XF86 VideoMode extension version is 2.2
24 video modes found.
20 suitable framebuffer modes found. Using config #0:
Color buffer: R8G8B8A8, no multisampling
Depth buffer: D24S0
Double buffering
glXCreateContextAttribs is available.
Trying to get an OpenGL 3.0 context... worked.
Got a direct context.
Failed to bind OpenGL function "glXSwapInterval"!
Failed to get all function bindings!
Segmentation fault (core dumped)

I'll look around and see if I see something simple.

Had a question, I've been working with GLFW, GLEW, OpenCV and sometimes OpenCL. I've found a few work arounds, but nothing really direct to render to the backbuffer and use that rendering in a OpenCV UMat (image file) that resides on the GPU still. I'm trying to avoid the GPU to CPU transfer, since I plan on doing some post processing on the GPU with the CV library... so I often have to transfer it right back to the GPU to process it. Thanks in Advance!
Bim said…
Not sure where it crashes by looking at the output, but if you find out, pull requests are welcome :)
Not sure how OpenCV works nowadays, but back when I was using it, it was CPU-only. If it supports OpenCL now, sharing OpenGL textures with OpenCL is possible (see: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/gl_sharing.html and https://stackoverflow.com/questions/8824269/gl-cl-interoperability-shared-texture).
Best of luck.