Being HyperAttentive in Times of AI
When mere attention isn't enough! (not even in reading the Title here)
As input contexts become larger and larger, there was need for a better computing solution.
Researchers at Yale University and Google Inc have come up with a breakthrough that promises far more complex LLMs which but are even more easier to handle.
They have been able to achieve ‘Approximate’ linear time when LLM’s attention matrix are un-bounded or very large than the ones which mandate the usage of quadratic time.
To achieve this the researchers have introduced two parameters :-
Max column norm in normalised attention matrix
Ratio of row norms in un-normalised attention matrix and row norms after detecting and removing large entries
This HyperAttention model was tested on various large-context datasets and what it achieved was tremendous:-
With a Context Length of 32k on ChatGLM, the inference time was found to be 50% faster.
Perplexity of LLM increased from 5.6 to 6.3
NB: Context length is the maximum number of tokens that an LLM can accomodate in a single process while Perplexity is the measure of accuracy of an LLM (standard deviation…)
Up until now, LLMs were modified to using sparse matrices or low-rank matrices to mitigate the effects of quadratic-time attention-layer.
The researchers’ work is characterised by the following:-
Modularity in design
A mechnism of LSH or Locality Sentitive Hashing to detect large contexts
KDE for simple linear time approximation of LLM algorithms
Adoption of Fast Matrix Multiplication
54X faster without casual masking
5.4X faster with casual masking
The simplification of attention mechanism in this fashion has opened the doors for lower computation requirements and much faster inference & training.
The NLP folks must be happier and paying their due attention !