Archived post by Wyeth – Houdini-and-Chill

Good timing Luiz!

OK, draw calls. First off, the conversation is somewhat meaningless without knowing what target hardware we’re looking at, so this will all just be generalities. A draw call is basically the CPU saying to the GPU “Hey, I know the general state of the following set of triangles, make them”. It has to do with the type of things you’re drawing, the buffer to draw them to, their material state (is it opaque, translucent, whatever), and other such nonsense. All draw calls are not created equal. Drawing 100 objects with an identical material is cheaper than drawing 100 objects with disparate materials, because there is a renderer state change in between each draw call that takes time to change. In broad strokes, though, there’s little difference between a single object with 100 materials, vs. 100 separate objects. However, 100 objects with an identical material will be significantly less expensive than 100 objects with varying render states.
The “size” of the draw call matters. There is a fixed overhead to how much it costs to ask the driver to do work. Submitting a single draw call for 3 vertices means that the GPU is basically doing almost nothing, in essence stalled waiting for the CPU to submit its commands, stack the draw calls into a command buffer, all that stuff. The “sweet spot” is different on every platform but a broad, completely out of the ass generality depending on hardware is that there is generally little difference between asking the GPU to draw one triangle or 10,000, as the time to submit the draw calls is slower than any work the GPU would ever do. On modern hardware, that number might even be 100,000. Breaking up a mesh to increase CPU burden at the expense of GPU work is rarely a win, within reason (depending on your total scene budget and hardware).

One thing that’s unintuitive is that, in the case of dynamic lighting or shadowing, most things in Unreal are actually drawn three separate times in the render pipeline. Once to determine early Z (Can I abandon work on all these pixels entirely by deciding if I can discard them early in the pipe), once to render the object itself, then a third time to render to a shadow buffer that’s projected for dynamic shadows. That’s why if you type “stat RHI” (the platform independent implementation which talks to all the various graphics APIS) you will see your draw call number go down by 3 if you delete a single tree or whatever.
The big thing to understand is that depending on either how slow your CPU is, or at how many hertz you render at, draw calls are almost always your bottleneck in some form or another. On our first iphone demo, given the state of the performance of the driver and CPU speed on that device, we basically got 50 to 100 draw calls per scene, so combining assets was paramount. On Robo Recall, we had a reasonable GPU as a minimum spec, and a pretty poor CPU as a minimum spec, so draw calls were also absolutely crucial. We used LODs extensively that would take an expensive building with 4-8 material IDs down to one at a near LOD, just to minimize draw calls. We stayed around 1500 or less at all times, mostly because the GPU was happy to render a few million vertices (1-2million) but the time it took the CPU to do the draw call work absolutely was the bottleneck.

On a modern blazing fast I7, you might be able to submit 5k or 10k draw calls in that time timeframe, and then render 20 or 40 million triangles on a titan, and neither really cares. There’s a huge delta out there but draw calls are typically more of a problem than triangle throughput. Again, caveats abound and every platform is different.
Another piece of the pie is occlusion culling. Early on in the pipeline, the CPU compares bounding boxes to decide whether objects are completely covered by other ones. This is known as occlusion culling, and is determined solely by object bounds which are generally owned by the CPU. This is an option to discard work before it ever even makes it to the command buffer (the list of draw calls), which is ideal, and so is quite important for performance as it’s a “pre draw call” optimization. Lots of small objects, IF they tend to occlude each other, can be valuable, but it’s something to be profiled. In the case of a small room full of books, rarely would a book occlude another book, and so the optimization is lost.

In general, when thinking about optimization your thought process is:
– do objects occlude each other, combine accordingly if not – at a distance, is it valuable to submit multiple draw calls for different material states, or can I combine them into a single material state to make the draw call more efficient from far away, since the difference in render state (cloth vs. opaque vs. masked or whatever) matters less at distance – am I submitting a reasonable number of vertices that all have the same material or render state, if not, should I combine them – at a distance am I appropriately drawing the minimum number of vertices needed to maintain silhouette, and not a vertex more, ensuring that the GPU is always in sync or ahead of the CPU and not waiting on draw submissions? – what’s my total vertex count? Am I exceeding the comfort zone of my minimum hardware spec with regards to total vertex throughput?
In the case of a small room, at a VR min spec (low end i5, nvidia 960) I’d be OK with about a thousand draw calls, and maybe 1.5 million triangles. After that, it’s all profiling and measuring time.
Everything I just wrote is full of caveats and absolutely and completely wrong depending on what your target platform is. The only real way to proceed is try shit and when it’s too slow, figure out why.