FP8 is ~100 tflops faster when the kernel name has "cutlass" in it

FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
#China's secretive #Tianh 3 #supercomputer uses homegrown hybrid #CPU — rivals US systems with 1.57 #Exaflops of performance. #NUDT #MT3000 features a unique heterogeneous architecture that includes general-purpose CPU cores with 96 control cores and 1,536 accelerator cores. Netting the MT-3000 processor reportedly achieves 11.6 FP64 #TFLOPS of peak performance and demonstrates a power efficiency of 45.4 #GigaFLOPS/Watt at operational frequency of 1.20 GHz https://www.tomshardware.com/tech-industry/supercomputers/chinas-secretive-tianhe-3-supercomputer-uses-homegrown-hybrid-cpu-rivals-us-systems-with-157-exaflops-of-performance-report #hpc #sanctions