/ appendices / VK_NVX_device_generated_commands.txt
VK_NVX_device_generated_commands.txt
  1  include::meta/VK_NVX_device_generated_commands.txt[]
  2  
  3  *Last Modified Date*::
  4      2017-07-25
  5  *Contributors*::
  6    - Pierre Boudier, NVIDIA
  7    - Christoph Kubisch, NVIDIA
  8    - Mathias Schott, NVIDIA
  9    - Jeff Bolz, NVIDIA
 10    - Eric Werness, NVIDIA
 11    - Detlef Roettger, NVIDIA
 12    - Daniel Koch, NVIDIA
 13    - Chris Hebert, NVIDIA
 14  
 15  This extension allows the device to generate a number of critical commands
 16  for command buffers.
 17  
 18  When rendering a large number of objects, the device can be leveraged to
 19  implement a number of critical functions, like updating matrices, or
 20  implementing occlusion culling, frustum culling, front to back sorting, etc.
 21  Implementing those on the device does not require any special extension,
 22  since an application is free to define its own data structure, and just
 23  process them using shaders.
 24  
 25  However, if the application desires to quickly kick off the rendering of the
 26  final stream of objects, then unextended Vulkan forces the application to
 27  read back the processed stream and issue graphics command from the host.
 28  For very large scenes, the synchronization overhead, and cost to generate
 29  the command buffer can become the bottleneck.
 30  This extension allows an application to generate a device side stream of
 31  state changes and commands, and convert it efficiently into a command buffer
 32  without having to read it back on the host.
 33  
 34  Furthermore, it allows incremental changes to such command buffers by
 35  manipulating only partial sections of a command stream -- for example
 36  pipeline bindings.
 37  Unextended Vulkan requires re-creation of entire command buffers in such
 38  scenario, or updates synchronized on the host.
 39  
 40  The intended usage for this extension is for the application to:
 41  
 42    * create its objects as in unextended Vulkan
 43    * create a slink:VkObjectTableNVX, and register the various Vulkan objects
 44      that are needed to evaluate the input parameters.
 45    * create a slink:VkIndirectCommandsLayoutNVX, which lists the
 46      elink:VkIndirectCommandsTokenTypeNVX it wants to dynamically change as
 47      atomic command sequence.
 48      This step likely involves some internal device code compilation, since
 49      the intent is for the GPU to generate the command buffer in the
 50      pipeline.
 51    * fill the input buffers with the data for each of the inputs it needs.
 52      Each input is an array that will be filled with an index in the object
 53      table, instead of using CPU pointers.
 54    * set up a target secondary command buffer
 55    * reserve command buffer space via flink:vkCmdReserveSpaceForCommandsNVX
 56      in a target command buffer at the position you want the generated
 57      commands to be executed.
 58    * call flink:vkCmdProcessCommandsNVX to create the actual device commands
 59      for all sequences based on the array contents into a provided target
 60      command buffer.
 61    * execute the target command buffer like a regular secondary command
 62      buffer
 63  
 64  For each draw/dispatch, the following can be specified:
 65  
 66    * a different pipeline state object
 67    * a number of descriptor sets, with dynamic offsets
 68    * a number of vertex buffer bindings, with an optional dynamic offset
 69    * a different index buffer, with an optional dynamic offset
 70  
 71  Applications should: register a small number of objects, and use dynamic
 72  offsets whenever possible.
 73  
 74  While the GPU can be faster than a CPU to generate the commands, it may not
 75  happen asynchronously, therefore the primary use-case is generating "`less`"
 76  total work (occlusion culling, classification to use specialized shaders,
 77  etc.).
 78  
 79  === New Object Types
 80  
 81    * slink:VkObjectTableNVX
 82    * slink:VkIndirectCommandsLayoutNVX
 83  
 84  === New Flag Types
 85  
 86    * tlink:VkIndirectCommandsLayoutUsageFlagsNVX
 87    * tlink:VkObjectEntryUsageFlagsNVX
 88  
 89  === New Enum Constants
 90  
 91  Extending elink:VkStructureType:
 92  
 93    ** ename:VK_STRUCTURE_TYPE_OBJECT_TABLE_CREATE_INFO_NVX
 94    ** ename:VK_STRUCTURE_TYPE_INDIRECT_COMMANDS_LAYOUT_CREATE_INFO_NVX
 95    ** ename:VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX
 96    ** ename:VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX
 97    ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_LIMITS_NVX
 98    ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_FEATURES_NVX
 99  
100  Extending elink:VkPipelineStageFlagBits:
101  
102    ** ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
103  
104  Extending elink:VkAccessFlagBits:
105  
106    ** ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX
107    ** ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX
108  
109  === New Enums
110  
111    * elink:VkIndirectCommandsLayoutUsageFlagBitsNVX
112    * elink:VkIndirectCommandsTokenTypeNVX
113    * elink:VkObjectEntryUsageFlagBitsNVX
114    * elink:VkObjectEntryTypeNVX
115  
116  === New Structures
117  
118    * slink:VkDeviceGeneratedCommandsFeaturesNVX
119    * slink:VkDeviceGeneratedCommandsLimitsNVX
120    * slink:VkIndirectCommandsTokenNVX
121    * slink:VkIndirectCommandsLayoutTokenNVX
122    * slink:VkIndirectCommandsLayoutCreateInfoNVX
123    * slink:VkCmdProcessCommandsInfoNVX
124    * slink:VkCmdReserveSpaceForCommandsInfoNVX
125    * slink:VkObjectTableCreateInfoNVX
126    * slink:VkObjectTableEntryNVX
127    * slink:VkObjectTablePipelineEntryNVX
128    * slink:VkObjectTableDescriptorSetEntryNVX
129    * slink:VkObjectTableVertexBufferEntryNVX
130    * slink:VkObjectTableIndexBufferEntryNVX
131    * slink:VkObjectTablePushConstantEntryNVX
132  
133  === New Functions
134  
135    * flink:vkCmdProcessCommandsNVX
136    * flink:vkCmdReserveSpaceForCommandsNVX
137    * flink:vkCreateIndirectCommandsLayoutNVX
138    * flink:vkDestroyIndirectCommandsLayoutNVX
139    * flink:vkCreateObjectTableNVX
140    * flink:vkDestroyObjectTableNVX
141    * flink:vkRegisterObjectsNVX
142    * flink:vkUnregisterObjectsNVX
143    * flink:vkGetPhysicalDeviceGeneratedCommandsPropertiesNVX
144  
145  === Issues
146  
147  1) How to name this extension ?
148  
149  *RESOLVED*: `VK_NVX_device_generated_commands`
150  
151  As usual, one of the hardest issues ;)
152  
153  Alternatives: `VK_gpu_commands`, `VK_execute_commands`,
154  `VK_device_commands`, `VK_device_execute_commands`, `VK_device_execute`,
155  `VK_device_created_commands`, `VK_device_recorded_commands`,
156  `VK_device_generated_commands`
157  
158  2) Should we use serial tokens or redundant sequence description?
159  
160  Similarly to slink:VkPipeline, signatures have the most likelihood to be
161  cross-vendor adoptable.
162  They also benefit from being processable in parallel.
163  
164  3) How to name sequence description
165  
166  stext:ExecuteCommandSignature is a bit long.
167  Maybe just stext:ExecuteSignature, or actually more following Vulkan
168  nomenclature: slink:VkIndirectCommandsLayoutNVX.
169  
170  4) Do we want to provide code:indirectCommands inputs with layout or at
171  code:indirectCommands time?
172  
173  Separate layout from data as Vulkan does.
174  Provide full flexibilty for code:indirectCommands.
175  
176  5) Should the input be provided as SoA or AoS?
177  
178  It is desirable for the application to reuse the list of objects and render
179  them with some kind of an override.
180  This can be done by just selecting a different input for a push constant or
181  a descriptor set, if they are defined as independent arrays.
182  If the data was interleaved, this would not be as easily possible.
183  
184  Allowing input divisors can also reduce the conservative command buffer
185  allocation.
186  
187  6) How do we know the size of the GPU command buffer generated by
188  flink:vkCmdProcessCommandsNVX ?
189  
190  pname:maxSequenceCount can give an upper estimate, even if the actual count
191  is sourced from the gpu buffer at (buffer, countOffset).
192  As such pname:maxSequenceCount must always be set correctly.
193  
194  Developers are encouraged to make well use the
195  slink:VkIndirectCommandsLayoutNVX's ptext:pTokens[].divisor, as they allow
196  less conservative storage costs.
197  Especially pipeline changes on a per-draw basis can be costly memory wise.
198  
199  7) How to deal with dynamic offsets in DescriptorSets?
200  
201  Maybe additional token etext:VK_EXECUTE_DESCRIPTOR_SET_OFFSET_COMMAND_NVX
202  that works for a "`single dynamic buffer`" descriptor set and then use (32
203  bit tableEntry + 32bit offset)
204  
205  added dynamicCount field, variable sized input
206  
207  8) Should we allow updates to the object table, similar to DescriptorSet?
208  
209  Desired yes, people may change "`material`" shaders and not want to recreate
210  the entire register table.
211  However the developer must ensure to not overwrite a registered objectIndex
212  while it is still being used.
213  
214  9) Should we allow dynamic state changes?
215  
216  Seems a bit excessive for "`per-draw`" type of scenario, but GPU could
217  partition work itself with viewport/scissor...
218  
219  10) How do we allow re-using already "`filled`" code:indirectCommands
220  buffers?
221  
222  just use a slink:VkCommandBuffer for the output, and it can be reused
223  easily.
224  
225  11) How portable should such re-use be?
226  
227  Same as secondary command buffer
228  
229  12) Should sequenceOrdered be part of IndirectCommandsLayout or
230  flink:vkCmdProcessCommandsNVX?
231  
232  Seems better for IndirectCommandsLayout, as that is when most heavy lifting
233  in terms of internal device code generation is done.
234  
235  13) Under which conditions is flink:vkCmdProcessCommandsNVX legal?
236  
237  Options:
238  
239  a) on the host command buffer like a regular draw call
240  
241  b) flink:vkCmdProcessCommandsNVX makes use slink:VkCommandBufferBeginInfo
242     and serves as flink:vkBeginCommandBuffer / flink:vkEndCommandBuffer
243     implicitly.
244  
245  c) The pname:targetCommandbuffer must be inside the "`begin`" state already
246     at the moment of being passed.
247     This very likely suggests a new tlink:VkCommandBufferUsageFlags
248     etext:VK_COMMAND_BUFFER_USAGE_DEVICE_GENERATED_BIT.
249  
250  d) The pname:targetCommandbuffer must reserve space via a new function.
251  
252  used a) and d).
253  
254  14) What if different pipelines have different DescriptorSetLayouts at a
255  certain set unit that mismatches in code:token.dynamicCount?
256  
257  Considered legal, as long as the maximum dynamic count of all used
258  DescriptorSetLayouts is provided.
259  
260  15) Should we add "`strides`" to input arrays, so that "`Array of
261  Structures`" type setups can be supported more easily?
262  
263  Maybe provide a usage flag for packed tokens stream (all inputs from same
264  buffer, implicit stride).
265  
266  No, given performance test was worse.
267  
268  16) Should we allow re-using the target command buffer directly, without
269  need to reset command buffer?
270  
271  YES: new api flink:vkCmdReserveSpaceForCommandsNVX.
272  
273  17) Is flink:vkCmdProcessCommandsNVX copying the input data or referencing
274  it ?
275  
276  There are multiple implementations possible:
277  
278    * one could have some emulation code that parse the inputs, and generates
279      an output command buffer, therefore copying the inputs.
280    * one could just reference the inputs, and have the processing done in
281      pipe at execution time.
282  
283  If the data is mandated to be copied, then it puts a penalty on
284  implementation that could process the inputs directly in pipe.
285  If the data is "`referenced`", then it allows both types of implementation
286  
287  The inputs are "`referenced`", and should not be modified after the call to
288  flink:vkCmdProcessCommandsNVX and until after the rendering of the target
289  command buffer is finished.
290  
291  18) Why is this `NVX` and not `NV`?
292  
293  To allow early experimentation and feedback.
294  We expect that a version with a refined design as multi-vendor variant will
295  follow up.
296  
297  19) Should we make the availability for each token type a device limit?
298  
299  Only distinguish between graphics/compute for now, further splitting up may
300  lead to too much fractioning.
301  
302  20) When can the pname:objectTable be modified?
303  
304  Similar to the other inputs for flink:vkCmdProcessCommandsNVX, only when all
305  device access via flink:vkCmdProcessCommandsNVX or execution of target
306  command buffer has completed can an object at a given objectIndex be
307  unregistered or re-registered again.
308  
309  21) Which buffer usage flags are required for the buffers referenced by
310  flink:vkCmdProcessCommandsNVX
311  
312  reuse existing ename:VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT
313  
314    * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesCountBuffer
315    * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesIndexBuffer
316    * slink:VkIndirectCommandsTokenNVX::pname:buffer
317  
318  22) In which pipeline stage do the device generated command expansion
319  happen?
320  
321  flink:vkCmdProcessCommandsNVX is treated as if it occurs in a separate
322  logical pipeline from either graphics or compute, and that pipeline only
323  includes ename:VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, a new stage
324  ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, and
325  ename:VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT.
326  This new stage has two corresponding new access types,
327  ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX and
328  ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX, used to synchronize reading
329  the buffer inputs and writing the command buffer memory output.
330  The output written in the target command buffer is considered to be consumed
331  by the ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage.
332  
333  Thus, to synchronize from writing the input buffers to executing
334  flink:vkCmdProcessCommandsNVX, use:
335  
336    * pname:dstStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
337    * pname:dstAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX
338  
339  To synchronize from executing flink:vkCmdProcessCommandsNVX to executing the
340  generated commands, use
341  
342    * pname:srcStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX
343    * pname:srcAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX
344    * pname:dstStageMask = ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
345    * pname:dstAccessMask = ename:VK_ACCESS_INDIRECT_COMMAND_READ_BIT
346  
347  When flink:vkCmdProcessCommandsNVX is used with a pname:targetCommandBuffer
348  of `NULL`, the generated commands are immediately executed and there is
349  implicit synchronization between generation and execution.
350  
351  23) What if most token data is "`static`", but we frequently want to render
352  a subsection?
353  
354  added "`sequencesIndexBuffer`".
355  This allows to easier sort and filter what should actually be processed.
356  
357  === Example Code
358  
359  Open-Source samples illustrating the usage of the extension can be found at
360  the following locations:
361  
362  https://github.com/nvpro-samples/gl_vk_threaded_cadscene/blob/master/doc/vulkan_nvxdevicegenerated.md
363  
364  https://github.com/NVIDIAGameWorks/GraphicsSamples/tree/master/samples/vk10-kepler/BasicDeviceGeneratedCommandsVk
365  
366  [source,c]
367  ---------------------------------------------------
368  
369    // setup secondary command buffer
370      vkBeginCommandBuffer(generatedCmdBuffer, &beginInfo);
371      ... setup its state as usual
372  
373    // insert the reservation (there can only be one per command buffer)
374    // where the generated calls should be filled into
375      VkCmdReserveSpaceForCommandsInfoNVX reserveInfo = { VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX };
376      reserveInfo.objectTable = objectTable;
377      reserveInfo.indirectCommandsLayout = deviceGeneratedLayout;
378      reserveInfo.maxSequencesCount = myCount;
379      vkCmdReserveSpaceForCommandsNVX(generatedCmdBuffer, &reserveInfo);
380  
381      vkEndCommandBuffer(generatedCmdBuffer);
382  
383    // trigger the generation at some point in another primary command buffer
384      VkCmdProcessCommandsInfoNVX processInfo = { VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX };
385      processInfo.objectTable = objectTable;
386      processInfo.indirectCommandsLayout = deviceGeneratedLayout;
387      processInfo.maxSequencesCount = myCount;
388      // set the target of the generation (if null we would directly execute with mainCmd)
389      processInfo.targetCommandBuffer = generatedCmdBuffer;
390      // provide input data
391      processInfo.indirectCommandsTokenCount = 3;
392      processInfo.pIndirectCommandsTokens = myTokens;
393  
394    // If you modify the input buffer data referenced by VkCmdProcessCommandsInfoNVX,
395    // ensure you have added the appropriate barriers prior generation process.
396    // When regenerating the content of the same reserved space, ensure prior operations have completed
397  
398      VkMemoryBarrier memoryBarrier = { VK_STRUCTURE_TYPE_MEMORY_BARRIER };
399      memoryBarrier.srcAccessMask = ...;
400      memoryBarrier.dstAccessMask = VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX;
401  
402      vkCmdPipelineBarrier(mainCmd,
403                           /*srcStageMask*/VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
404                           /*dstStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
405                           /*dependencyFlags*/0,
406                           /*memoryBarrierCount*/1,
407                           /*pMemoryBarriers*/&memoryBarrier,
408                           ...);
409  
410      vkCmdProcessCommandsNVX(mainCmd, &processInfo);
411      ...
412    // execute the secondary command buffer and ensure the processing that modifies command-buffer content
413    // has completed
414  
415      memoryBarrier.srcAccessMask = VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX;
416      memoryBarrier.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
417  
418      vkCmdPipelineBarrier(mainCmd,
419                           /*srcStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX,
420                           /*dstStageMask*/VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
421                           /*dependencyFlags*/0,
422                           /*memoryBarrierCount*/1,
423                           /*pMemoryBarriers*/&memoryBarrier,
424                           ...)
425      vkCmdExecuteCommands(mainCmd, 1, &generatedCmdBuffer);
426  
427  ---------------------------------------------------
428  
429  === Version History
430  
431   * Revision 3, 2017-07-25 (Chris Hebert)
432     - Correction to specification of dynamicCount for push_constant token in
433       VkIndirectCommandsLayoutNVX.
434       Stride was incorrectly computed as dynamicCount was not treated as byte
435       size.
436   * Revision 2, 2017-06-01 (Christoph Kubisch)
437     - header compatibility break: add missing _TYPE to
438       VkIndirectCommandsTokenTypeNVX and VkObjectEntryTypeNVX enums to follow
439       Vulkan naming convention
440     - behavior clarification: only allow a single work provoking token per
441       sequence when creating a slink:VkIndirectCommandsLayoutNVX
442   * Revision 1, 2016-10-31 (Christoph Kubisch)
443     - Initial draft