/ docs / implementation_nvidia_gpus.md
implementation_nvidia_gpus.md
  1  # Implementation on Nvidia GPUs
  2  
  3  This documentation references useful information for implementing and optimizing for Nvidia GPUs
  4  
  5  ## Integer instruction bug
  6  
  7  The instruction integer fused-multiply-ad  with carry-in may
  8  be incorrectly compiled in PTX prior to Cuda 11.5.1:
  9  https://forums.developer.nvidia.com/t/wrong-result-returned-by-madc-hi-u64-ptx-instruction-for-specific-operands/196094
 10  
 11  Test case from: https://github.com/tickinbuaa/CudaTest/blob/master/main.cu
 12  
 13  ```C
 14  #include <cuda_runtime.h>
 15  #include <memory>
 16  
 17  __device__
 18  inline void mac_with_carry(uint64_t &lo, uint64_t &hi, const uint64_t &a, const uint64_t &b, const uint64_t &c) {
 19      if (blockIdx.x == 0 && threadIdx.x == 0) {
 20          printf("GPU calculation input: a = %lx b = %lx c = %lx\n", a, b, c);
 21      }
 22      asm("mad.lo.cc.u64 %0, %2, %3, %4;\n\t"
 23          "madc.hi.u64 %1, %2, %3, 0;\n\t"
 24          :"=l"(lo), "=l"(hi): "l"(a), "l"(b), "l"(c));
 25      if (blockIdx.x == 0 && threadIdx.x == 0) {
 26          printf("GPU calculation result: hi = %lx low = %lx\n", hi, lo);
 27      }
 28  }
 29  
 30  __global__
 31  void test(uint64_t *out, uint32_t num){
 32      unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
 33      if (tid >= num) {
 34          return;
 35      }
 36      uint64_t a = 0x42737a020c0d6393UL;
 37      uint64_t b = 0xffffffff00000001UL;
 38      uint64_t c = 0xc999e990f3f29c6dUL;
 39      mac_with_carry(out[tid << 1], out[(tid << 1) + 1], a, b, c);
 40  }
 41  
 42  int main() {
 43      uint64_t *d_out;
 44      uint32_t num = 1;
 45      cudaMalloc(&d_out, num * 2 * sizeof(uint64_t));
 46      const uint32_t BLOCK_SIZE = 256;
 47      uint32_t block_num = (num + BLOCK_SIZE - 1) / BLOCK_SIZE;
 48      test<<<block_num, BLOCK_SIZE>>>(d_out, num);
 49      cudaDeviceSynchronize();
 50      unsigned __int128 a = 0x42737a020c0d6393UL;
 51      unsigned __int128 b = 0xffffffff00000001UL;
 52      unsigned __int128 c = 0xc999e990f3f29c6dUL;
 53      unsigned __int128 result = a * b + c;
 54      printf("Cpu result: hi:%lx low:%lx\n", (uint64_t)((result >> 64) & 0xffffffffffffffffUL), (uint64_t)(result & 0xffffffffffffffffUL));
 55  }
 56  ```
 57  
 58  ## The hidden XMAD instruction
 59  
 60  There is a "hidden" instruction called xmad on Nvidia GPUs described in
 61  - Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs\
 62    Niall Emmart , Justin Luitjens , Charles Weems and Cliff Woolley\
 63    https://ieeexplore.ieee.org/abstract/document/7563271
 64  
 65  On Maxwell and Pascal GPUs (SM 5.3), there was no native 32-bit integer multiplication, probably due to die size constraint.
 66  So 32-bit mul was based on 16-bit muladd (XMAD) with some PTX->SASS compiler pattern matching to detect optimal XMAD
 67  scheduling.
 68  Starting from Volta (SM 7.0 / RTX 2XXX), there is now an hardware integer multiply again
 69  https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions
 70  
 71  Code to generate the proper XMAD is available in:
 72  - https://github.com/NVlabs/xmp/blob/0052dbb/src/include/ptx/PTXInliner_impl.h#L371-L384
 73  - https://github.com/NVlabs/CGBN/blob/e8b9d26/include/cgbn/arith/asm.cu#L131-L142
 74  
 75  ## Double-precision floating point arithmetic
 76  
 77  On double-precision floating point arithmetic.
 78  There are some recent papers exploring using the 52-bit mantissa of a float64 to accelerate elliptic curve cryptography.
 79  This is similar to the AVX approaches on CPU.
 80  
 81  - Faster Modular Exponentiation Using Double Precision Floating Point Arithmetic on the GPU\
 82    Niall Emmart, Fangyu Zheng, Charles Weems\
 83    https://ieeexplore.ieee.org/document/8464792
 84  
 85  - DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-Point Computing Power of GPUs
 86    Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, C. Weems\
 87    https://ieeexplore.ieee.org/document/9139772
 88  
 89  Unfortunately float64 arithmetic is extremely slow on Nvidia GPUs except for Tesla-class GPU due to market segmentation.
 90  https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#architecture-8-x
 91  
 92  SM 8.0 corresponds to a Tesla A100, and SM 8.6 to RTX 30X0 or Quadro AX000
 93  
 94  > A Streaming Multiprocessor (SM) consists of:
 95  >   - 64 FP32 cores for single-precision arithmetic operations in devices of compute capability 8.0\
 96  >        and 128 FP32 cores in devices of compute capability 8.6, 8.7 and 8.9,
 97  >   - 32 FP64 cores for double-precision arithmetic operations in devices of compute capability 8.0\
 98  >        and **2 FP64 cores** in devices of compute capability 8.6, 8.7 and 8.9
 99  >   - **64 INT32 cores** for integer math
100  
101  Hence Nvidia choose to replace 30 FP64 cores with 64 FP32 cores on consumer GPU. An understandable business decision since graphics and machine learning use and are benchmarked on FP32 with FP64 being used mostly in scientific and simulation workloads. Hozever for blockchain, it's important for decentralization that as much as possible can run on consumer hardware, Tesla cards are $10k so we want to optimize for consumer GPUs with 1/32 INT32/FP64 throughput ratio.
102  
103  So assuming 1 cycle per instruction on the matching core, we can do 64 INT32 instructions while we do 2 FP64 instructions, hence 1/32 throughput ratio.
104  
105  Concretely to emulate 64x64->128 extended precision multiplication we need 4 32-bit multiplications (and fused additions):
106  ```
107        a₁a₀
108  *     b₁b₀
109  ---------------------------
110        a₀b₀
111      a₁b₀
112      a₀b₁
113    a₁b₁
114  ```
115  
116  Assuming we need only 2 FP64 instructions for 64x64->128 integer mul (mul.lo and mul.hi) the throughput ratio would be:
117  `1/32 (base throughput) * 4 (mul int32 instr) * 1/2 (mul fp64) = 1/16`
118  
119  In reality:
120  - we use 52-bit mantissa so we would have calculated only 104 bit
121  - there is extra addition/substraction, shifting and masking involved
122  - this significantly increase the chances of mistakes. Furthermore formal verification or fuzzing on GPUs isn't the easiest
123  
124  ## Code generation considerations
125  
126  ### Parameter passing:
127  - https://reviews.llvm.org/D118084
128    > The motivation for this change is to allow SROA to eliminate local copies in more cases. Local copies that make it to the generated PTX present a substantial performance hit, as we end up with all threads on the GPU rushing to access their own chunk of very high-latency memory.
129  Direct parameter passing is easier to analyze but not worthwhile 
130  for large aggregate
131  
132  
133  ### Important optimization passes:
134  
135  - https://www.llvm.org/docs/Passes.html
136  - gvn, global value numbering to remove redundant loads
137  - mem2reg, will promote memory into regisster, memory is expensive in GPUs
138    > This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.
139    https://stackoverflow.com/a/66082008
140  - SROA, Scalar Replacement of Aggregates, to remove local copies and alloca. Static indices access help.
141    as mentioned in https://discourse.llvm.org/t/nvptx-calling-convention-for-aggregate-arguments-passed-by-value/
142  
143    https://github.com/llvm/llvm-project/issues/51734#issuecomment-981047833
144    > Local loads/stores on GPU are expensive enough to be worth quite a few extra instructions.
145  - https://github.com/apc-llc/nvcc-llvm-ir
146  
147  Note: The dead code/instructions elimination passes might remove the ASM not marked sideeffect/volatile
148  
149  Ordering GVN before SROA: https://reviews.llvm.org/D111471
150  
151  If we use "normal" instructions instead of inline assembly, this thread links to many LLVM internal discussions
152  on the passes that optimize to add-with-carry: https://github.com/llvm/llvm-project/issues/31102
153  We have:
154  - InstCombine, for instruction combining (see also: https://reviews.llvm.org/D8889, https://reviews.llvm.org/D124698, https://github.com/llvm/llvm-project/issues/39832)
155  - CodegenPrepare, for ISA specific codegen
156  
157  ### LLVM NVPTX or Nvidia libNVVM
158  
159  https://docs.nvidia.com/cuda/libnvvm-api/index.html
160  https://docs.nvidia.com/pdf/libNVVM_API.pdf
161  https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html
162  https://docs.nvidia.com/cuda/pdf/NVVM_IR_Specification.pdf
163  
164  ⚠ NVVM IR is based on LLVM 7.0.1 IR which dates from december 2018.
165  There are a couple of caveats:
166  - LLVM 7.0.1 is usually not available in repo, making installation difficult
167  - There was a ABI breaking bug making the 7.0.1 and 7.1.0 versions messy (https://www.phoronix.com/news/LLVM-7.0.1-Released)
168  - LLVM 7.0.1 does not have LLVMBuildCall2 and relies on the deprecated LLVMBuildCall meaning
169    supporting that and latest LLVM (for AMDGPU and SPIR-V backends) will likely have heavy costs
170  - When generating a add-with-carry kernel with inline ASM calls from LLVM-14,
171    if the LLVM IR is passed as bitcode,
172    the kernel content is silently discarded, this does not happen with built-in add.
173    It is unsure if it's call2 or inline ASM incompatibility that causes the issues
174  - When generating a add-with-carry kernel with inline ASM calls from LLVM-14,
175    if the LLVM IR is passed as testual IR, the code is refused with NVVM_ERROR_INVALID_IR
176  
177  Hence, using LLVM NVPTX backend instead of libNVVM is likely the sustainable way forward
178  
179  ### Register pressure
180  
181  See this AMD paper https://dl.acm.org/doi/pdf/10.1145/3368826.3377918
182  However if we want to reduce register pressure we need to store to local memory which is also expensive.
183  
184  ## Parallel reductions
185  
186  Batch elliptic point addition `r = P₀ + P₁ + ... + Pₙ` and
187  multi-scalar multiplication (MSM) `r = [k₀]P₀ + [k₁]P₁ + ... + [kₙ]Pₙ`
188  are reduction operations.
189  
190  There is a wealth of resources regarding optimized implementations of those.
191  The baseline is provided by: [Optimizing Parallel Reduction in CUDA, Mark harris](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)
192  Then on later architectures:
193  - https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
194  - https://www.irisa.fr/alf/downloads/collange/talks/collange_warp_synchronous_19.pdf
195  
196  Other interesting resources:
197  - https://on-demand.gputechconf.com/gtc/2017/presentation/s7622-Kyrylo-perelygin-robust-and-scalable-cuda.pdf \
198    This explains in great details the cooperative group features
199    and examples in reduction kernels
200  - https://github.com/umfranzw/cuda-reduction-example \
201    This explains and uses overlapping streams for latency hiding
202  - https://vccvisualization.org/teaching/CS380/CS380_fall2020_lecture_25.pdf
203    SHFL instruction
204  - https://unum.cloud/post/2022-01-28-reduce/
205    - https://github.com/ashvardanian/ParallelReductionsBenchmark \
206    This provides an overview and benchmark code across CPU (AVX2, OpenMP, TBB), OpenCL, Cuda (Cublas, Thrust, Cub)
207  - https://diglib.eg.org/bitstream/handle/10.2312/egt20211037/CUDA_day2.pdf
208    - https://cuda-tutorial.github.io/part3_22.pdf
209    - https://github.com/CUDA-Tutorial/CodeSamples