Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).
Hyena Hierarchy: Towards Larger Convolutional Language Models
https://arxiv.org/abs/2302.10866
They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.
#DeepLearning #LLMs #PaperThread
1/4
Hyena transformers show good mechanistic generalization on traditionally challenging "reasoning" synthetic tasks.
The implicit parametrization of the convolution kernel makes it possible to have kernel length as long as the input sequence with a limited number of trainable parameters.
Furthermore, the use of FFTConv makes it possible to decrease the number of Floating Point arithmetic operations (FLOPS) to compute those convolutions.
2/4
Unfortunately, the reduced FLOPS of Hyena layers does not necessarily yield a competitive walltime performance because long length kernel FFT convolutions typically have a hard time at using hardware accelerators (GPUs, TPUs) efficiently.
In particular, FlashAttentionV2 transformers can stay competitive for relatively long input sequences because of their highly optimized fused kernels.
3/4
However, the following paper:
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
https://arxiv.org/abs/2311.05908
https://github.com/HazyResearch/flash-fft-conv
shows that it's possible to implement FFTConv efficiently on GPUs, therefore making the Hyena architecture more competitive.
This might be a game changer to tackle long sequence "reasoning" and recall tasks for LLMs, DNA sequence analysis and so on.
4/4