The Gradient @thegradient

Recent searches

Search options

Only available when logged in.

Leshem Choshen @LChoshen@sigmoid.social

Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
https://arxiv.org/abs/2302.01318
#NLProc #Generation #inference
#DeepMind

Feb 07, 2023, 08:01 AM··Web

2boosts·0favorites

**Leshem Choshen** @LChoshen · Feb 7, 2023

Feb 7, 2023

Leshem Choshen @LChoshen

The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them

**Leshem Choshen** @LChoshen · Feb 7, 2023

Feb 7, 2023

Leshem Choshen @LChoshen

For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)

**Leshem Choshen** @LChoshen · Feb 7, 2023

Feb 7, 2023

Leshem Choshen @LChoshen

What if we reject?
We just pick another with some other probability
and while not explicit, I guess lose the future predictions by the poor model?

Great speedups, simple and clean.

**Leshem Choshen** @LChoshen · Feb 7, 2023

Feb 7, 2023

Leshem Choshen @LChoshen

If you noted, they did choose a rather odd way of rechoosing when they did not agree, this is contrastive sampling, right?
It is assumed to be better, but if it is not, there is no reason not to sample in any other way at this point or am I missing something

**Leshem Choshen** @LChoshen · Feb 7, 2023

Feb 7, 2023

Leshem Choshen @LChoshen

Apparently something quite similar was proposed in November, getting hot in here
https://arxiv.org/abs/2211.17192

arXiv.orgFast Inference from Transformers via Speculative DecodingInference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back