/ appendices / VK_NVX_device_generated_commands.txt
VK_NVX_device_generated_commands.txt
1 include::meta/VK_NVX_device_generated_commands.txt[] 2 3 *Last Modified Date*:: 4 2017-07-25 5 *Contributors*:: 6 - Pierre Boudier, NVIDIA 7 - Christoph Kubisch, NVIDIA 8 - Mathias Schott, NVIDIA 9 - Jeff Bolz, NVIDIA 10 - Eric Werness, NVIDIA 11 - Detlef Roettger, NVIDIA 12 - Daniel Koch, NVIDIA 13 - Chris Hebert, NVIDIA 14 15 This extension allows the device to generate a number of critical commands 16 for command buffers. 17 18 When rendering a large number of objects, the device can be leveraged to 19 implement a number of critical functions, like updating matrices, or 20 implementing occlusion culling, frustum culling, front to back sorting, etc. 21 Implementing those on the device does not require any special extension, 22 since an application is free to define its own data structure, and just 23 process them using shaders. 24 25 However, if the application desires to quickly kick off the rendering of the 26 final stream of objects, then unextended Vulkan forces the application to 27 read back the processed stream and issue graphics command from the host. 28 For very large scenes, the synchronization overhead, and cost to generate 29 the command buffer can become the bottleneck. 30 This extension allows an application to generate a device side stream of 31 state changes and commands, and convert it efficiently into a command buffer 32 without having to read it back on the host. 33 34 Furthermore, it allows incremental changes to such command buffers by 35 manipulating only partial sections of a command stream -- for example 36 pipeline bindings. 37 Unextended Vulkan requires re-creation of entire command buffers in such 38 scenario, or updates synchronized on the host. 39 40 The intended usage for this extension is for the application to: 41 42 * create its objects as in unextended Vulkan 43 * create a slink:VkObjectTableNVX, and register the various Vulkan objects 44 that are needed to evaluate the input parameters. 45 * create a slink:VkIndirectCommandsLayoutNVX, which lists the 46 elink:VkIndirectCommandsTokenTypeNVX it wants to dynamically change as 47 atomic command sequence. 48 This step likely involves some internal device code compilation, since 49 the intent is for the GPU to generate the command buffer in the 50 pipeline. 51 * fill the input buffers with the data for each of the inputs it needs. 52 Each input is an array that will be filled with an index in the object 53 table, instead of using CPU pointers. 54 * set up a target secondary command buffer 55 * reserve command buffer space via flink:vkCmdReserveSpaceForCommandsNVX 56 in a target command buffer at the position you want the generated 57 commands to be executed. 58 * call flink:vkCmdProcessCommandsNVX to create the actual device commands 59 for all sequences based on the array contents into a provided target 60 command buffer. 61 * execute the target command buffer like a regular secondary command 62 buffer 63 64 For each draw/dispatch, the following can be specified: 65 66 * a different pipeline state object 67 * a number of descriptor sets, with dynamic offsets 68 * a number of vertex buffer bindings, with an optional dynamic offset 69 * a different index buffer, with an optional dynamic offset 70 71 Applications should: register a small number of objects, and use dynamic 72 offsets whenever possible. 73 74 While the GPU can be faster than a CPU to generate the commands, it may not 75 happen asynchronously, therefore the primary use-case is generating "`less`" 76 total work (occlusion culling, classification to use specialized shaders, 77 etc.). 78 79 === New Object Types 80 81 * slink:VkObjectTableNVX 82 * slink:VkIndirectCommandsLayoutNVX 83 84 === New Flag Types 85 86 * tlink:VkIndirectCommandsLayoutUsageFlagsNVX 87 * tlink:VkObjectEntryUsageFlagsNVX 88 89 === New Enum Constants 90 91 Extending elink:VkStructureType: 92 93 ** ename:VK_STRUCTURE_TYPE_OBJECT_TABLE_CREATE_INFO_NVX 94 ** ename:VK_STRUCTURE_TYPE_INDIRECT_COMMANDS_LAYOUT_CREATE_INFO_NVX 95 ** ename:VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX 96 ** ename:VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX 97 ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_LIMITS_NVX 98 ** ename:VK_STRUCTURE_TYPE_DEVICE_GENERATED_COMMANDS_FEATURES_NVX 99 100 Extending elink:VkPipelineStageFlagBits: 101 102 ** ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX 103 104 Extending elink:VkAccessFlagBits: 105 106 ** ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX 107 ** ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX 108 109 === New Enums 110 111 * elink:VkIndirectCommandsLayoutUsageFlagBitsNVX 112 * elink:VkIndirectCommandsTokenTypeNVX 113 * elink:VkObjectEntryUsageFlagBitsNVX 114 * elink:VkObjectEntryTypeNVX 115 116 === New Structures 117 118 * slink:VkDeviceGeneratedCommandsFeaturesNVX 119 * slink:VkDeviceGeneratedCommandsLimitsNVX 120 * slink:VkIndirectCommandsTokenNVX 121 * slink:VkIndirectCommandsLayoutTokenNVX 122 * slink:VkIndirectCommandsLayoutCreateInfoNVX 123 * slink:VkCmdProcessCommandsInfoNVX 124 * slink:VkCmdReserveSpaceForCommandsInfoNVX 125 * slink:VkObjectTableCreateInfoNVX 126 * slink:VkObjectTableEntryNVX 127 * slink:VkObjectTablePipelineEntryNVX 128 * slink:VkObjectTableDescriptorSetEntryNVX 129 * slink:VkObjectTableVertexBufferEntryNVX 130 * slink:VkObjectTableIndexBufferEntryNVX 131 * slink:VkObjectTablePushConstantEntryNVX 132 133 === New Functions 134 135 * flink:vkCmdProcessCommandsNVX 136 * flink:vkCmdReserveSpaceForCommandsNVX 137 * flink:vkCreateIndirectCommandsLayoutNVX 138 * flink:vkDestroyIndirectCommandsLayoutNVX 139 * flink:vkCreateObjectTableNVX 140 * flink:vkDestroyObjectTableNVX 141 * flink:vkRegisterObjectsNVX 142 * flink:vkUnregisterObjectsNVX 143 * flink:vkGetPhysicalDeviceGeneratedCommandsPropertiesNVX 144 145 === Issues 146 147 1) How to name this extension ? 148 149 *RESOLVED*: `VK_NVX_device_generated_commands` 150 151 As usual, one of the hardest issues ;) 152 153 Alternatives: `VK_gpu_commands`, `VK_execute_commands`, 154 `VK_device_commands`, `VK_device_execute_commands`, `VK_device_execute`, 155 `VK_device_created_commands`, `VK_device_recorded_commands`, 156 `VK_device_generated_commands` 157 158 2) Should we use serial tokens or redundant sequence description? 159 160 Similarly to slink:VkPipeline, signatures have the most likelihood to be 161 cross-vendor adoptable. 162 They also benefit from being processable in parallel. 163 164 3) How to name sequence description 165 166 stext:ExecuteCommandSignature is a bit long. 167 Maybe just stext:ExecuteSignature, or actually more following Vulkan 168 nomenclature: slink:VkIndirectCommandsLayoutNVX. 169 170 4) Do we want to provide code:indirectCommands inputs with layout or at 171 code:indirectCommands time? 172 173 Separate layout from data as Vulkan does. 174 Provide full flexibilty for code:indirectCommands. 175 176 5) Should the input be provided as SoA or AoS? 177 178 It is desirable for the application to reuse the list of objects and render 179 them with some kind of an override. 180 This can be done by just selecting a different input for a push constant or 181 a descriptor set, if they are defined as independent arrays. 182 If the data was interleaved, this would not be as easily possible. 183 184 Allowing input divisors can also reduce the conservative command buffer 185 allocation. 186 187 6) How do we know the size of the GPU command buffer generated by 188 flink:vkCmdProcessCommandsNVX ? 189 190 pname:maxSequenceCount can give an upper estimate, even if the actual count 191 is sourced from the gpu buffer at (buffer, countOffset). 192 As such pname:maxSequenceCount must always be set correctly. 193 194 Developers are encouraged to make well use the 195 slink:VkIndirectCommandsLayoutNVX's ptext:pTokens[].divisor, as they allow 196 less conservative storage costs. 197 Especially pipeline changes on a per-draw basis can be costly memory wise. 198 199 7) How to deal with dynamic offsets in DescriptorSets? 200 201 Maybe additional token etext:VK_EXECUTE_DESCRIPTOR_SET_OFFSET_COMMAND_NVX 202 that works for a "`single dynamic buffer`" descriptor set and then use (32 203 bit tableEntry + 32bit offset) 204 205 added dynamicCount field, variable sized input 206 207 8) Should we allow updates to the object table, similar to DescriptorSet? 208 209 Desired yes, people may change "`material`" shaders and not want to recreate 210 the entire register table. 211 However the developer must ensure to not overwrite a registered objectIndex 212 while it is still being used. 213 214 9) Should we allow dynamic state changes? 215 216 Seems a bit excessive for "`per-draw`" type of scenario, but GPU could 217 partition work itself with viewport/scissor... 218 219 10) How do we allow re-using already "`filled`" code:indirectCommands 220 buffers? 221 222 just use a slink:VkCommandBuffer for the output, and it can be reused 223 easily. 224 225 11) How portable should such re-use be? 226 227 Same as secondary command buffer 228 229 12) Should sequenceOrdered be part of IndirectCommandsLayout or 230 flink:vkCmdProcessCommandsNVX? 231 232 Seems better for IndirectCommandsLayout, as that is when most heavy lifting 233 in terms of internal device code generation is done. 234 235 13) Under which conditions is flink:vkCmdProcessCommandsNVX legal? 236 237 Options: 238 239 a) on the host command buffer like a regular draw call 240 241 b) flink:vkCmdProcessCommandsNVX makes use slink:VkCommandBufferBeginInfo 242 and serves as flink:vkBeginCommandBuffer / flink:vkEndCommandBuffer 243 implicitly. 244 245 c) The pname:targetCommandbuffer must be inside the "`begin`" state already 246 at the moment of being passed. 247 This very likely suggests a new tlink:VkCommandBufferUsageFlags 248 etext:VK_COMMAND_BUFFER_USAGE_DEVICE_GENERATED_BIT. 249 250 d) The pname:targetCommandbuffer must reserve space via a new function. 251 252 used a) and d). 253 254 14) What if different pipelines have different DescriptorSetLayouts at a 255 certain set unit that mismatches in code:token.dynamicCount? 256 257 Considered legal, as long as the maximum dynamic count of all used 258 DescriptorSetLayouts is provided. 259 260 15) Should we add "`strides`" to input arrays, so that "`Array of 261 Structures`" type setups can be supported more easily? 262 263 Maybe provide a usage flag for packed tokens stream (all inputs from same 264 buffer, implicit stride). 265 266 No, given performance test was worse. 267 268 16) Should we allow re-using the target command buffer directly, without 269 need to reset command buffer? 270 271 YES: new api flink:vkCmdReserveSpaceForCommandsNVX. 272 273 17) Is flink:vkCmdProcessCommandsNVX copying the input data or referencing 274 it ? 275 276 There are multiple implementations possible: 277 278 * one could have some emulation code that parse the inputs, and generates 279 an output command buffer, therefore copying the inputs. 280 * one could just reference the inputs, and have the processing done in 281 pipe at execution time. 282 283 If the data is mandated to be copied, then it puts a penalty on 284 implementation that could process the inputs directly in pipe. 285 If the data is "`referenced`", then it allows both types of implementation 286 287 The inputs are "`referenced`", and should not be modified after the call to 288 flink:vkCmdProcessCommandsNVX and until after the rendering of the target 289 command buffer is finished. 290 291 18) Why is this `NVX` and not `NV`? 292 293 To allow early experimentation and feedback. 294 We expect that a version with a refined design as multi-vendor variant will 295 follow up. 296 297 19) Should we make the availability for each token type a device limit? 298 299 Only distinguish between graphics/compute for now, further splitting up may 300 lead to too much fractioning. 301 302 20) When can the pname:objectTable be modified? 303 304 Similar to the other inputs for flink:vkCmdProcessCommandsNVX, only when all 305 device access via flink:vkCmdProcessCommandsNVX or execution of target 306 command buffer has completed can an object at a given objectIndex be 307 unregistered or re-registered again. 308 309 21) Which buffer usage flags are required for the buffers referenced by 310 flink:vkCmdProcessCommandsNVX 311 312 reuse existing ename:VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT 313 314 * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesCountBuffer 315 * slink:VkCmdProcessCommandsInfoNVX::pname:sequencesIndexBuffer 316 * slink:VkIndirectCommandsTokenNVX::pname:buffer 317 318 22) In which pipeline stage do the device generated command expansion 319 happen? 320 321 flink:vkCmdProcessCommandsNVX is treated as if it occurs in a separate 322 logical pipeline from either graphics or compute, and that pipeline only 323 includes ename:VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, a new stage 324 ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, and 325 ename:VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT. 326 This new stage has two corresponding new access types, 327 ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX and 328 ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX, used to synchronize reading 329 the buffer inputs and writing the command buffer memory output. 330 The output written in the target command buffer is considered to be consumed 331 by the ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage. 332 333 Thus, to synchronize from writing the input buffers to executing 334 flink:vkCmdProcessCommandsNVX, use: 335 336 * pname:dstStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX 337 * pname:dstAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX 338 339 To synchronize from executing flink:vkCmdProcessCommandsNVX to executing the 340 generated commands, use 341 342 * pname:srcStageMask = ename:VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX 343 * pname:srcAccessMask = ename:VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX 344 * pname:dstStageMask = ename:VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT 345 * pname:dstAccessMask = ename:VK_ACCESS_INDIRECT_COMMAND_READ_BIT 346 347 When flink:vkCmdProcessCommandsNVX is used with a pname:targetCommandBuffer 348 of `NULL`, the generated commands are immediately executed and there is 349 implicit synchronization between generation and execution. 350 351 23) What if most token data is "`static`", but we frequently want to render 352 a subsection? 353 354 added "`sequencesIndexBuffer`". 355 This allows to easier sort and filter what should actually be processed. 356 357 === Example Code 358 359 Open-Source samples illustrating the usage of the extension can be found at 360 the following locations: 361 362 https://github.com/nvpro-samples/gl_vk_threaded_cadscene/blob/master/doc/vulkan_nvxdevicegenerated.md 363 364 https://github.com/NVIDIAGameWorks/GraphicsSamples/tree/master/samples/vk10-kepler/BasicDeviceGeneratedCommandsVk 365 366 [source,c] 367 --------------------------------------------------- 368 369 // setup secondary command buffer 370 vkBeginCommandBuffer(generatedCmdBuffer, &beginInfo); 371 ... setup its state as usual 372 373 // insert the reservation (there can only be one per command buffer) 374 // where the generated calls should be filled into 375 VkCmdReserveSpaceForCommandsInfoNVX reserveInfo = { VK_STRUCTURE_TYPE_CMD_RESERVE_SPACE_FOR_COMMANDS_INFO_NVX }; 376 reserveInfo.objectTable = objectTable; 377 reserveInfo.indirectCommandsLayout = deviceGeneratedLayout; 378 reserveInfo.maxSequencesCount = myCount; 379 vkCmdReserveSpaceForCommandsNVX(generatedCmdBuffer, &reserveInfo); 380 381 vkEndCommandBuffer(generatedCmdBuffer); 382 383 // trigger the generation at some point in another primary command buffer 384 VkCmdProcessCommandsInfoNVX processInfo = { VK_STRUCTURE_TYPE_CMD_PROCESS_COMMANDS_INFO_NVX }; 385 processInfo.objectTable = objectTable; 386 processInfo.indirectCommandsLayout = deviceGeneratedLayout; 387 processInfo.maxSequencesCount = myCount; 388 // set the target of the generation (if null we would directly execute with mainCmd) 389 processInfo.targetCommandBuffer = generatedCmdBuffer; 390 // provide input data 391 processInfo.indirectCommandsTokenCount = 3; 392 processInfo.pIndirectCommandsTokens = myTokens; 393 394 // If you modify the input buffer data referenced by VkCmdProcessCommandsInfoNVX, 395 // ensure you have added the appropriate barriers prior generation process. 396 // When regenerating the content of the same reserved space, ensure prior operations have completed 397 398 VkMemoryBarrier memoryBarrier = { VK_STRUCTURE_TYPE_MEMORY_BARRIER }; 399 memoryBarrier.srcAccessMask = ...; 400 memoryBarrier.dstAccessMask = VK_ACCESS_COMMAND_PROCESS_READ_BIT_NVX; 401 402 vkCmdPipelineBarrier(mainCmd, 403 /*srcStageMask*/VK_PIPELINE_STAGE_ALL_COMMANDS_BIT, 404 /*dstStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, 405 /*dependencyFlags*/0, 406 /*memoryBarrierCount*/1, 407 /*pMemoryBarriers*/&memoryBarrier, 408 ...); 409 410 vkCmdProcessCommandsNVX(mainCmd, &processInfo); 411 ... 412 // execute the secondary command buffer and ensure the processing that modifies command-buffer content 413 // has completed 414 415 memoryBarrier.srcAccessMask = VK_ACCESS_COMMAND_PROCESS_WRITE_BIT_NVX; 416 memoryBarrier.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT; 417 418 vkCmdPipelineBarrier(mainCmd, 419 /*srcStageMask*/VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX, 420 /*dstStageMask*/VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, 421 /*dependencyFlags*/0, 422 /*memoryBarrierCount*/1, 423 /*pMemoryBarriers*/&memoryBarrier, 424 ...) 425 vkCmdExecuteCommands(mainCmd, 1, &generatedCmdBuffer); 426 427 --------------------------------------------------- 428 429 === Version History 430 431 * Revision 3, 2017-07-25 (Chris Hebert) 432 - Correction to specification of dynamicCount for push_constant token in 433 VkIndirectCommandsLayoutNVX. 434 Stride was incorrectly computed as dynamicCount was not treated as byte 435 size. 436 * Revision 2, 2017-06-01 (Christoph Kubisch) 437 - header compatibility break: add missing _TYPE to 438 VkIndirectCommandsTokenTypeNVX and VkObjectEntryTypeNVX enums to follow 439 Vulkan naming convention 440 - behavior clarification: only allow a single work provoking token per 441 sequence when creating a slink:VkIndirectCommandsLayoutNVX 442 * Revision 1, 2016-10-31 (Christoph Kubisch) 443 - Initial draft