The Gradient @thegradient

Leshem Choshen @LChoshen@sigmoid.social

The newFormer is introduced,
but what do we really know about it?

@ari and others
imagine a new large-scale architecture &
ask how would you interptret its abilities and behaviours
https://arxiv.org/abs/2308.00189
#deepRead #NLProc #MachineLearning

Aug 09, 2023, 07:02 AM··Web

3boosts·2favorites

**Leshem Choshen** @LChoshen · Aug 9, 2023

Aug 9, 2023

Leshem Choshen @LChoshen

@ari We NLPers became a Complex Systems Science (like studying the brain or the weather)
We won't look at all the parts, or understand the math of every module, instead, we try to make sense of the system at different levels of granularity

**Leshem Choshen** @LChoshen · Aug 9, 2023

Aug 9, 2023

Leshem Choshen @LChoshen

@ari Once, they say, we might have created many variants of such networks, but this is long past, too costly
(I am not convinced this is true yet, we might still find network dynamics, babyLM models etc. can tell us some things without full pretraining, but it remains to be seen)

For more on BabyLM
http://babylm.github.io
Or network dynamics
https://arxiv.org/abs/2109.06096

**Leshem Choshen** @LChoshen · Aug 9, 2023

Aug 9, 2023

Leshem Choshen @LChoshen

@ari This leaves us to discover (not design!) what those models are capable of.
To do that, we must start by testing what newFormer does and build from there up the how and why
Specifically, we need to test the function by trying out data(s) and checking the outputs

**Leshem Choshen** @LChoshen · Aug 9, 2023

Aug 9, 2023

Leshem Choshen @LChoshen

@ari Fortunately, we have some uncommon advantages Complex Systems hardly have
First, we know every detail in it, already an advantage over other large-scale efforts (e.g. the brain)
Second, we can run an experiment

**Leshem Choshen** @LChoshen · Aug 9, 2023

Aug 9, 2023

Leshem Choshen @LChoshen

@ari How do you think we can tackle this challenge? We all know those models grow, what are the right questions to ask?

And what would generalize (e.g., I am quite sure our recent work below would not stay true forever, how would we know it is wrong in the new model or make longlasting deductions?)

https://sigmoid.social/@LChoshen/110055711042606816