Transcript of YouTube Video: Jacob Andreas | What Learning Algorithm is In-Context Learning?

Transcript of YouTube Video: Jacob Andreas | What Learning Algorithm is In-Context Learning?

The following is a summary and article by AI based on a transcript of the video "Jacob Andreas | What Learning Algorithm is In-Context Learning?". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Article By AIVideo Transcript
00:00

hi everyone uh I'm Jacob I work on uh

00:03

mostly natural language processing at

00:04

MIT

00:05

um today I'm going to be talking about

00:07

something a little different which is uh

00:09

the worst possible way to fit uh or to

00:11

estimate the parameters of a linear

00:12

model and what this can teach us about

00:14

uh sort of modern NLP systems and

00:17

language generation systems

00:19

um here we go okay so to start off right

00:21

everybody's favorite introductory

00:23

estimation problem linear regression we

00:25

have some x's we have some y's a new X

00:27

comes along and we want to be able to

00:29

predict uh the corresponding y from this

00:31

x under whatever the data data

00:33

generating process is

00:35

um right and so as every undergraduate

00:37

now knows uh the way to do this is you

00:39

take all of your x's and you put them

00:41

into a matrix you take all your y's you

00:42

put them into a matrix and then you open

00:45

up chat GPT and you paste your access

00:46

and your y's into chatgpt and you ask it

00:49

to give you a solution to this problem

00:51

um I actually I wasn't going to do this

00:53

for real but I did this when I was

00:54

putting these slides together and it

00:55

produces this very long very

00:57

authoritative sounding answer there's a

00:59

bunch of you know sort of intermediate

01:00

computations and math here and if you

01:03

take the actual predictor that it gives

01:04

you at the end and you drop it back into

01:06

this problem

01:08

um it looks a little bit like this so

01:11

not necessarily a reliable way to uh to

01:14

estimate linear regression models right

01:16

now and you know more generally I guess

01:19

this is obviously a kind of ridiculous

01:21

thing to do we have already uh very very

01:24

old uh very very good algorithms for

01:27

solving these kinds of problems and so

01:28

it doesn't seem like the sort of thing

01:30

uh that we might necessarily uh need to

01:34

ask a language model to do or or where

01:36

it would be an interesting question to

01:37

ask whether language models text

01:40

generation systems uh like the kind that

01:42

we have today are able to do it at all

01:46

um nevertheless uh the kinds of things

01:49

that we can do with these big open-ended

01:51

text generation systems uh are becoming

01:55

you know sort of like more and more

01:57

General uh these models themselves are

01:59

becoming more and more capable right so

02:01

here's an example from what's now you

02:03

know an oldish paper 2020 uh showing

02:05

that these models can do real sort of

02:07

open-ended uh text generation so here's

02:09

a particular language model and we'll

02:11

talk more specifically about what these

02:12

language models are in a minute uh but

02:14

that is being prompted right it's asked

02:16

to predict what text should come next

02:18

after uh this text that we have in light

02:21

gray up at the top of the slide so we

02:23

give it the title of uh sort of

02:25

hypothetical news article we give it the

02:26

subtitle of a hypothetical news article

02:29

and then we ask it to generate the text

02:30

of the article itself and it gives you

02:32

this long paragraph right that uh are

02:35

you know both sort of compatible with

02:37

the title describing the sort of thing

02:38

uh that you would expect to be in an

02:40

article with that title written in the

02:42

style of a news article uh and more or

02:45

less internally coherent um it's also

02:47

worth noting that uh this article is or

02:50

the sort of factual content of this

02:51

article is wrong uh that it's describing

02:54

you know as prompted a sort of real

02:55

split that happened in the Methodist

02:57

Church but the identities of the

02:58

individual sort of groups that came out

03:00

of the Schism uh while plausible

03:02

sounding are totally made up

03:05

um now we can do more with these text

03:08

generation systems than just generate

03:10

long blocks of text

03:12

um one of the other surprising

03:14

capabilities uh that we've seen as

03:17

people have started to play around with

03:18

these things is that they can do

03:20

instruction following right so you can

03:22

take again a sort of generic uh next

03:25

word prediction system uh you can say

03:28

give it as input the text explained the

03:30

moon landing to a six-year-old in a few

03:31

sentences

03:33

um and a good enough text predictor will

03:35

actually respond by following this

03:37

instruction and by generating some text

03:38

that specifically explains the moon

03:40

landing or answers a question why do

03:42

birds migrate south for the winter

03:44

um and so on and so forth

03:46

um

03:47

and maybe the last sort of surprising

03:50

capability that we've seen in these

03:52

models as they have scaled up and as

03:56

people have started to play around with

03:57

them more

03:58

um is a phenomenon that has is coming to

04:01

be referred to as in context learning

04:03

and so the What's Happening Here is that

04:05

rather than just giving this model uh

04:08

the sort of beginning of a document that

04:09

we want it to complete and rather than

04:11

giving it an explicit textual

04:12

description of the task that we wanted

04:14

to perform we just give it some examples

04:17

of that task being performed on other

04:19

kinds of inputs so here we're asking it

04:21

to do some sort of grammatical error

04:23

correction task and we do this by

04:26

placing you know as the input to this

04:28

model the string poor English input I

04:30

eated the purple berries good English

04:32

output I ate the purple berries poor

04:35

English input thank you for picking me

04:36

as your designer I'd appreciate it but

04:38

good English output and we fill in all

04:40

of these pieces like this so again all

04:42

of this text that we're showing in light

04:43

gray on the slide is uh text that we as

04:46

a human user of this system are

04:48

providing and notice that the sort of

04:50

structure of this input right is that we

04:52

have some human written bad examples

04:54

good examples some human written inputs

04:56

and outputs

04:57

um uh We've Ended here with an input and

05:01

what the model is going to generate next

05:02

is a desired output and in this case

05:04

what we get out of the model is in fact

05:07

a corrected version of the last sentence

05:09

that we asked for and so we can specify

05:11

tasks that we want these language

05:14

generation systems these next word

05:16

prediction systems to perform uh again

05:18

not sort of in in the form of natural

05:20

language instructions but in the form of

05:22

tiny tiny data sets for these new tasks

05:24

that we're asking these models to

05:26

generate and this turns out uh you know

05:29

and this was something that was not even

05:31

necessarily planned out ahead of time

05:32

but discovered in these models uh around

05:34

2020 uh and it's something that you can

05:36

do for a wide variety of tasks uh you

05:39

can do it even for simple arithmetic

05:41

kinds of things you can do it for

05:43

spelling correction you can do it for

05:44

translation you can do it for that

05:46

grammatical error correction and we'll

05:47

see a couple more examples later on um

05:52

um and I think you know the important

05:53

thing to note here is that in some cases

05:55

this is uh

05:58

more effective right than providing a

06:01

precise natural language description of

06:03

the task that you want the model to

06:04

perform that they actually get better at

06:06

doing these kinds of things when you

06:08

give them uh explicit examples in the

06:11

input uh and just ask them to generate

06:13

more examples or generate you know the

06:15

output pieces of those examples uh from

06:17

that same distribution um and so this

06:19

you know like I said before this

06:21

phenomenon is coming to be referred to

06:22

as in context learning and what this is

06:25

meant to evoke is that there is a kind

06:27

of learning that looks like machine

06:30

learning as we're used to thinking of it

06:32

or parameter estimation is we're used to

06:33

thinking of it but that is happening now

06:35

not as part of training one of these

06:37

prediction systems but actually uh in

06:41

the course of figuring out how to

06:42

generate text or figuring out how to

06:44

complete one of these documents

06:47

um and so what this talk is going to be

06:49

about the sort of big question that I

06:51

want to ask today is basically what on

06:53

Earth is going on here uh what is the

06:55

relationship between this particular

06:57

phenomenon and learning as we're used to

06:59

thinking about it in a machine learning

07:01

setting and what can we say about the

07:03

the sort of you know reliability of

07:05

these Pro this process or the limits of

07:06

this process

07:08

um I want to note also before I go on

07:10

here that uh everything that I'm going

07:11

to be talking about today was led by my

07:13

student accurac with a big collaboration

07:15

from some folks at Google and Stanford

07:17

as well uh Dale Sherman's and Denny Joe

07:19

at Google and Tanya ma who's joined

07:21

between Google and Stanford

07:23

and the sort of broad uh outline of this

07:25

talk is to First figure out uh or you

07:28

know discuss just what this in context

07:29

learning phenomenon is uh and what

07:32

possible hypotheses we might build about

07:34

what's going on when models are doing

07:36

this text generation process

07:39

um and then we're going to ask with the

07:41

kinds of models that we have today

07:43

algorithmically what kinds of things

07:44

might they be able to do what kinds of

07:46

things might they not be able to do when

07:48

presented with these sorts of in-context

07:50

learning problems

07:51

um and this is all going to be kind of

07:52

in principle arguments just from the

07:54

structure of uh language models as they

07:57

exist today uh you know without looking

07:59

at any real models and then we'll

08:01

actually go out and train some models

08:02

and try to figure out what's what's

08:03

really going on under the hood and then

08:06

talk about sort of the implications of

08:07

this uh more more broadly for research

08:10

on NLP research on language models

08:13

so starting with this first question uh

08:17

what exactly is this uh in context

08:20

learning phenomenon and how might we be

08:22

able to describe this behavior

08:24

um and so to talk about this right let's

08:26

start by being a little bit more precise

08:28

about what these language models are

08:29

what something like uh you know chat apt

08:32

or whatever is

08:33

um and fundamentally what one of these

08:34

models is uh is just a next word

08:37

prediction system right like the thing

08:39

that sits on your phone as you're

08:41

writing a text message and tries to

08:42

generate what word you're going to say

08:43

next

08:44

um and so what that means in practice is

08:46

that we're modeling a distribution over

08:48

strings typically natural language

08:50

strings but now in practice kind of

08:51

anything that you can write in Unicode

08:53

um and that that distribution is

08:55

represented uh via just the sort of

08:58

normal chain rule distribution one token

09:00

at a time with some kind of parametric

09:02

model that's just trying to predict

09:03

every word given all of the words that

09:05

came before right so concretely right

09:07

thinking about what one of these in

09:09

context learning examples looks like uh

09:12

we're going to try to model a joint

09:13

distribution over for example input

09:17

output pairs with some separator tokens

09:18

and some input markers and whatever by

09:21

just asking what's the probability of

09:24

the first input and then the probability

09:25

of some separator symbol given that

09:27

first input and then the probability of

09:29

the first output and so on and so forth

09:31

all right and as you might expect uh the

09:34

very beginning of this generation

09:36

procedure the sort of first step in this

09:37

in this decomposition is going to be

09:39

super high entropy the model doesn't

09:40

know what it's supposed to be doing at

09:42

all if it has actually learned by the

09:43

learn the task that we're asking it to

09:45

perform by the end when we're asking it

09:48

to predict the X's it'll know the

09:49

distribution of x's and when we're

09:50

asking it to predict the Y's it'll you

09:52

know know sort of precisely the

09:54

distribution of y's given X's

09:56

um so how do we actually model this

10:00

distribution uh in practice with you

10:04

know sort of in in the kinds of language

10:06

models that we're using today

10:07

um and pretty much every language model

10:10

that is in in widespread use that you'll

10:13

encounter that people are working with

10:14

in in the research Community right now

10:16

um is a neural network and it's a neural

10:18

network of a very specific kind called a

10:21

Transformer

10:22

um there's a lot of low-level details

10:23

about these Transformer models that I'm

10:25

not going to talk about at all right now

10:27

but that but sorry but there are some

10:29

high level details that are important

10:32

for uh for sort of understanding how

10:34

these models work and so I want to talk

10:35

through those things

10:36

um at a high level when I'm given an

10:39

input like this one or that contains

10:41

goat misspelled Arrow goat spelled

10:43

correctly comma snake misspelled Arrow

10:47

um what we're going to try to do uh I

10:49

guess some of the alignments are off

10:50

here but hopefully the correspondence is

10:52

right the first thing that we're going

10:53

to do is we're going to take each of

10:55

these natural language uh tokens right

10:58

the word goat or the word or you know

11:01

the the symbol Arrow or whatever

11:03

um and we're going to assign them what's

11:04

called an embedding we're gonna you know

11:06

map each of these words using some fixed

11:08

dictionary that isn't self learned to

11:11

some high dimensional parameter vector

11:13

and so if you've heard of word

11:14

embeddings right every you know what

11:15

we're giving this model is input uh is

11:18

an embedding of each of these words in

11:19

sequence

11:21

um and the first thing that we're going

11:22

to do once we've represented each of

11:24

these words in the input uh as some high

11:27

dimensional Vector

11:29

um is that we're going to perform what's

11:30

called attention so every word uh is

11:34

going to sort of pick up its or you know

11:35

at every position in this input we're

11:37

going to pick up the embedding that

11:39

we've computed so far

11:41

um and we're going to take

11:42

a linear combination of all of the word

11:45

embeddings that are in the sentence uh

11:47

up until this word that we've predicted

11:49

right now where the sort of Weights in

11:51

that linear embedding are themselves

11:53

some learned function of the input so

11:54

effectively you know when we're looking

11:56

at this word goat it's going to decide

11:59

how strongly you know and the the reason

12:02

this uh is called an attention mechanism

12:04

think of it as deciding how strongly it

12:06

wants to look at all of the other words

12:08

uh in the input up until this point uh

12:10

pool them according to sort of how

12:12

strongly it's looking at each of them

12:13

and then build some new representation

12:15

in context of this word goat that

12:18

incorporates all of this other

12:19

information from all of the other words

12:20

that have come before um and we can do a

12:22

little extra processing on top of this

12:24

so we don't like lose the identity of

12:26

the the word that we had as input but

12:27

you know at a high level think of this

12:29

the first thing that a Transformer does

12:31

this attention mechanism uh is just

12:33

taking some weighted combination of all

12:35

of the other representations that we've

12:36

built up to this point

12:38

um and we're going to do this in

12:39

parallel for every token in the input

12:42

right so we're going to compute a

12:43

representation of goat in this way and

12:45

go gets to look at all three tokens to

12:47

the left we're going to compute a

12:48

representation of for example this last

12:50

Arrow character and that last Arrow

12:51

character gets to look at the entire

12:53

sentence that we've seen so far and so

12:55

we're going to get as a result of this

12:56

entire process a sequence of now how

12:59

these attention vectors that are derived

13:02

from our original input

13:07

once we've done this uh we're just going

13:11

to apply some other uh learned

13:13

non-linear transformation to each of

13:15

these new vectors that we've built up uh

13:17

this now doesn't look at any other piece

13:19

of the input at all it's happening again

13:20

sort of locally in parallel at every

13:22

step of the input

13:24

um you know and works basically the way

13:27

a normal feed forward neural network

13:28

works if you've seen those uh but we're

13:30

doing that now for every word

13:31

simultaneously and these two steps

13:35

together uh this attention step followed

13:37

by this multi-layer perceptron this sort

13:40

of generic non-linear transformation

13:41

we're going to refer to as a Transformer

13:44

layer and we can actually take these

13:47

Transformer layers and we can stack a

13:48

bunch of them on top of each other in

13:50

sequence so we can do this once and then

13:52

we can do it a second time where now

13:54

when we do that attention step we're

13:56

attending not to the original word

13:57

embeddings but to all of the your you

14:00

know all of the representations that we

14:01

built at the last step of this procedure

14:03

uh and you know in practice for big

14:05

models you're doing this tens of times

14:07

or hundreds of times uh to eventually

14:10

build up a representation of the entire

14:12

input sequence that you have and what

14:14

we're going to get here at the end uh

14:16

again is a sort of sequence of uh of the

14:18

outputs of those final uh MLP steps uh

14:23

one for every token in this input

14:25

sentence and the last thing that we're

14:27

going to do is we're going to take the

14:28

very last hidden representation the last

14:30

output from the last of these

14:31

Transformer layers and we're just going

14:33

to try to decode it into a distribution

14:36

over possible next tokens using

14:39

basically something that looks like a

14:40

log linear model right so again every

14:42

word in my vocabulary just like at the

14:44

embedding layer is associated with

14:46

some vector and we're just going to sort

14:49

of compute dot products between those

14:50

vectors and this last hidden

14:51

representation and then exponentially

14:53

normalized to get a distribution over

14:55

y's and so what we hope right in the

14:58

context of these kinds of in context

15:00

learning problems is that this

15:02

distribution that we get over the end

15:03

places a lot of Mass on the correct

15:06

spelling of snake and not a lot of Mass

15:08

on on all of the other words in our

15:09

vocabulary

15:12

good so how uh having defined this model

15:15

architecture do we train them uh and you

15:19

know in the most basic sense the

15:20

training is super easy you go out and

15:22

you scrape as much text Data as you can

15:24

get your hands on uh you crawl the

15:25

entire internet you pay a bunch of

15:27

people to sit in a room and write you

15:28

know sort of nicely structured documents

15:30

for you

15:31

um and then you just do maximum

15:32

likelihood estimation with this model on

15:34

that gigantic uh gigantic data set

15:37

um a lot of the challenges here are

15:39

really engineering challenges uh more

15:41

than uh than kind of modeling challenges

15:42

and a lot of the reason that you know

15:44

sort of open AI uh is able to train

15:46

these models and people in Academia are

15:48

not uh has to do with just like the cost

15:51

of uh computational resources that you

15:53

need to really do this at a scale that

15:55

makes it effective

15:57

um one thing that I'll say that's not

15:58

going to be part of this talk at all is

16:00

that in modern most modern language

16:03

models especially the ones that sort of

16:04

look and talk like chat Bots there's

16:06

another step that goes on after this

16:08

process of sort of learning uh via

16:11

something that looks like reinforcement

16:12

learning from human feedback

16:14

that's not really going to be relevant

16:16

for what we're talking about today the

16:17

relationship between that and uh some of

16:19

the sort of capabilities that I was

16:21

talking about at the beginning is still

16:22

a little unclear

16:24

um

16:25

but is part of kind of the modern recipe

16:28

okay so this is the basic uh Transformer

16:31

architecture

16:32

um what happens when we actually do this

16:34

when we you know fit this uh maximum

16:37

likelihood objective uh to the entire

16:39

internet

16:40

um and what happens is you get a model

16:41

that is very very good at a very large

16:44

number of different things right so here

16:45

are some examples uh from a recent paper

16:48

showing uh good performance on the bar

16:50

exam although apparently these bar

16:52

numbers are actually really sketchy and

16:53

this is you know maybe not something to

16:55

take seriously

16:56

um but pretty good performance if not

16:58

like top human level performance across

17:00

a wide variety of uh things requiring

17:03

both a little bit of reasoning and uh

17:06

quite a lot of knowledge right the LSAT

17:08

the SAT the GRE uh both the math

17:10

sections and the verbal sections

17:13

um and you know this has been in the

17:14

news I assume many people have seen this

17:16

before

17:17

um and I think you know it was

17:19

surprising surprising is maybe not even

17:21

a strong enough word but uh you know to

17:24

many of us in the field both that uh

17:28

just doing this at a large enough scale

17:30

would get you here

17:32

um and that we were going to be able to

17:33

do this kind of as quickly as possible

17:35

and that the large enough scale turns

17:37

out to be you know only all of the text

17:39

that Humanity has ever written and not

17:41

something much more than that

17:43

okay good so among the things that

17:46

happens when you train a model in this

17:49

way like we were talking about before

17:51

um is this in context learning

17:53

phenomenon right so here now we can talk

17:55

a little bit more precisely about what

17:56

ICL uh is or what it would take to be a

17:59

good in context learner and it's

18:01

something like being able to take a

18:03

sequence of inputs that looks like this

18:05

um and you know have as your output

18:08

distribution something that places a lot

18:10

of probability uh on uh in this context

18:13

the French word for cheese and you know

18:15

maybe other sort of plausible variations

18:17

on it

18:18

um similarly if we want to do not

18:20

machine translation but sentiment

18:21

analysis we want to be able to plug in

18:23

some uh examples of sentences associated

18:27

with with sentiment uh assign them

18:29

natural language labels uh and and be

18:31

able to predict the appropriate ones I

18:32

guess there's some text getting cut off

18:33

here but uh but in the right way

18:36

um and finally uh one of the kind of

18:38

surprising things that happens is that

18:40

you can do this not just with

18:43

um you know sort of well-defined

18:46

problems that you expect to see

18:47

occurring many times in the training

18:48

data like the sentiment analysis problem

18:50

or the machine translation problem uh

18:52

but with weird made up problems that

18:54

surely nobody in the world has ever

18:55

asked one of these models to solve

18:57

before uh so you know this thing on the

18:59

slide this particular combination of

19:00

symbols and meanings

19:02

um is something that ekken made up the

19:03

first time he gave a version of this

19:05

talk uh and uh models are at least the

19:08

you know particular language model that

19:09

he uh asked this question to was able to

19:12

produce the right output here uh the

19:15

first time

19:16

so coming back to the original question

19:18

at the very beginning of the talk

19:20

um what is it that's actually going on

19:22

here under the hood in virtue of which

19:25

these models are able to solve problems

19:26

like this

19:28

um and I think you know the initial

19:29

reaction from most people in the

19:32

language processing and machine learning

19:34

communities uh was that learning was

19:36

probably not the right way to describe

19:38

what was actually going on in these

19:40

models and maybe better to think of it

19:42

as something like uh identification and

19:45

here's a quote from Reynolds and

19:48

McDonald that I think captures this sort

19:50

of hypothesis about what's going on here

19:52

pretty nicely

19:53

um fuchsia prompting is not really

19:54

learning of skills but just locating the

19:56

skills that the model already has this

19:59

is most obvious for translation where

20:00

you can't learn a new language from five

20:02

examples and you know I think the

20:05

translation case is a clear example of

20:07

uh of where this is the right way of

20:09

thinking about these models in fact you

20:10

can't learn French from uh

20:13

five examples like skipping ahead here

20:17

you know certainly if there's an English

20:19

word that you've never seen before and

20:20

you can produce the right French output

20:21

some of the knowledge about the

20:23

relationship between English and French

20:24

does not reside in the sort of in

20:26

context training set here but has to

20:28

come from the model's background

20:29

knowledge

20:30

but when we think about these more kind

20:34

of algebraic puzzle solving problems um

20:37

it's a little less clear that this is

20:38

the right way of thinking about things

20:40

uh right I think we can be pretty

20:42

confident that at least then maybe not

20:45

now this particular never particular

20:47

example didn't appear in the training

20:48

data for any of these models you know

20:50

maybe other things with a similar flavor

20:52

a similar grammar induction structure

20:54

um but uh but not exactly this one and

20:58

so competing with that hypothesis that

21:01

we had on the previous slide is another

21:03

hypothesis that the Zim context learning

21:06

thing is quote unquote real learning

21:08

right that it's capable of sort of

21:09

taking a training set that fully

21:11

determines the behavior that you want to

21:12

get out and using that kind of training

21:15

set to index within some reasonably

21:17

large hypothesis class the actual

21:19

function that you want this model to

21:21

implement even if that's a function that

21:22

it never saw executed anywhere at

21:25

training time

21:27

and so the thing that we set out to do

21:29

in this project was basically to answer

21:31

that question to figure out

21:33

um whether it is possible even in

21:35

principle uh for something like one of

21:38

these Transformer models uh to make a

21:40

version of this hypothesis true to get

21:42

them to do real learning and then to

21:43

actually figure out what's going on in

21:45

Real Models Yeah question

21:56

um that is a great question I am

21:59

uh not sure we could try it

22:02

yeah I mean and like this is I I so I'm

22:06

not going to be able to like produce off

22:07

of my top of my head though like full

22:08

exploration that we did here but no I

22:10

mean that's a good point and you know uh

22:12

there are also examples of cases where

22:14

providing this kind of extraneous

22:15

information actually confuses the model

22:17

and

22:18

um uh you know I'm sure it's also not

22:20

the case that uh you can get 100

22:22

accuracy on things in this General class

22:28

um

22:29

other questions before we go on

22:32

Okay cool so yeah the the first question

22:34

that we're gonna ask here then is

22:37

whether it's possible for these

22:38

Transformer type models uh to do real

22:41

learning whatever that means uh even in

22:43

principle so to sort of take this

22:45

hypothesis that we had before

22:47

um and figure out whether there's some

22:49

class of uh functions and some class of

22:51

models for which this is true

22:54

um and we're going to do this by looking

22:55

actually at those uh linear models that

22:59

we looked at at the very beginning of

23:01

the talk uh just because this is a case

23:03

where we know exactly what the kind of

23:05

space of algorithmic solutions looks

23:06

like it's very easy to generate data and

23:09

it's a simple enough problem that you

23:10

don't have to go out and use something

23:12

trained on the entire internet you can

23:13

actually do all of the learning yourself

23:16

um one other sort of important caveat to

23:18

make here uh is that you know in some

23:21

sense this question of uh can a

23:24

Transformer Implement function X where

23:26

even function X is you know training

23:27

some little machine learning model on

23:28

the inside

23:30

um was already answered decades ago we

23:32

know that neural networks are general

23:34

purpose function approximators when made

23:37

sufficiently large and trained on enough

23:38

data so really the question here is

23:41

whether uh you can do but but the

23:43

constructions that uh you use to do that

23:46

kind of universal approximation are kind

23:48

of ridiculous they blow up uh

23:50

exponentially in the the size of the

23:51

input and the complexity of the function

23:52

so really what we're trying to figure

23:54

out here

23:56

um is whether you can do this not just

23:58

at all but with models that are the kind

24:00

of size and shape of the ones that we're

24:02

actually using uh today and of the size

24:05

and shape that are uh showing all those

24:07

real world results that I was showing

24:09

you before

24:09

um citation is cut off here but it is to

24:12

paper by Garg at all at Stanford who

24:15

around the same time we were doing this

24:16

uh asked a similar set of Behavioral

24:19

questions about just like what kinds of

24:20

functions extensionally are you able to

24:22

learn in this framework okay so uh the

24:25

way we're going to go about looking at

24:26

this right rather than taking our

24:28

translation problems from before and

24:30

asking models to solve translation

24:31

problems by via this word embedding step

24:34

or whatever we're just going to generate

24:36

some input output pairs we're going to

24:37

sample uh you know a random weight

24:38

Vector here I guess I'm showing this as

24:40

one dimensional but these are going to

24:42

be higher dimensional problems later on

24:44

uh and we're just going to train the

24:45

model right in exactly whoops uh in

24:48

exactly the same way and I guess for now

24:49

we're not even training models we're

24:51

just asking how we can parameterize them

24:52

to begin with so can we uh parameterize

24:55

some Transformer in such a way that you

24:58

know we give this zero as input it

24:59

predicts a zero as output

25:01

um one thing to note here you know we

25:04

can also and in all in the experiments

25:06

that I'm going to show later on we are

25:07

going to train these models both to

25:10

predict wise or y's given x's and to

25:13

predict X's uh given the whole history

25:16

of interactions that have happened

25:17

before uh here we're not going to worry

25:19

about actually modeling the input

25:21

distribution at all we're just going to

25:22

worry about modeling that conditional

25:24

distribution

25:25

um and I think it's actually an open

25:26

question how much this matters for uh

25:28

for real training things

25:30

um good so you know

25:32

not to belabor this point too much uh

25:34

you know we've had linear regression and

25:36

solutions to linear regression problems

25:37

since the 1800s uh we know that there

25:40

are lots of algorithms that you might

25:41

pack into one of these Transformers that

25:43

will solve them

25:44

um and you know you can do this by just

25:46

sort of directly in inverting this

25:47

relevant Matrix you can do this by

25:50

gradient descent on uh some kind of

25:52

least squares objective or something

25:54

else that looks like that and so the

25:56

question is whether we can take any of

25:58

these algorithms and pack them into one

26:00

of these models like this

26:03

um

26:04

and you know it turns out that you can

26:06

do this uh

26:08

the details are sort of mechanical and

26:10

I'm just going to give a sort of high

26:11

level flavor for one of these but it

26:14

turns out that you can show uh both that

26:17

if you're iterative trying to fit these

26:19

linear models via SGD or via batch

26:21

gradient descent you can do that you can

26:24

do that using a relatively shallow model

26:26

uh and you know using a model that has

26:29

depth that scales proportionally to the

26:31

number of steps of SGD that you want to

26:33

do here

26:34

um using you know sort of a reasonable

26:36

number of these attention mechanisms a

26:38

reasonable number of these layers and

26:40

second you can do this not just via SGD

26:43

but just sort of by directly

26:44

constructing uh that final predictor via

26:47

a sequence of rank 1 updates to

26:49

um uh that uh xdx inverse Matrix

26:54

um you know again there's a lot of sort

26:57

of low-level mechanical stuff but just

26:58

to give a high level like Flavor of how

27:01

this works

27:02

um we're going to define a little

27:03

calculus of operations that we can

27:05

Implement inside a transformer for sort

27:07

of moving data around and applying

27:09

linear transformations to it and then

27:11

once you have these things you can just

27:12

chain them together uh and Implement uh

27:15

really any algorithm uh that you choose

27:17

that bottoms out in these operations

27:19

right uh so what are the things that we

27:21

need to do for example paradigmatically

27:24

to implement uh gradient descent or even

27:26

just the first step of gradient descent

27:28

on this objective

27:29

um well one thing that we're going to

27:30

need to be able to do if uh most of the

27:33

processing is going to get done by these

27:35

multi-layer perceptron units these feed

27:37

forward layers that we were talking

27:38

about before is just to consolidate our

27:41

data right so we need to get our x's and

27:43

our y's into the same kind of part of

27:45

the representation space so that the

27:46

model can work on them uh we're going to

27:48

call this a move operation and it turns

27:50

out that you can implement this move

27:51

operation uh pretty simply using uh that

27:54

attention mechanism that we saw before

27:56

they can just sort of pick up a piece of

27:57

the Hidden State uh from one input and

28:00

move it to the next time step so the

28:01

first thing that we're going to do for

28:03

sdd here for example is just accumulate

28:05

the X's uh into our y's

28:07

another thing that you need to do here

28:09

is take offline Transformations right

28:13

for example on this first step we have

28:15

some initial guess at the weight Vector

28:16

we need to dot that by our initial

28:18

feature Vector to figure out what our

28:19

error is and so we need some way of

28:21

doing affine Transformations uh and you

28:24

know affine Transformations are in some

28:25

sense like the only thing that these

28:27

multi-layer perceptron uh units can do

28:29

and so this is very very easy to

28:31

implement using that and MLP layer that

28:34

we were showing before

28:35

um and then the last thing and the thing

28:36

that turns out to be the sort of

28:38

fussiest piece of this

28:39

um is Computing dot products between

28:42

pieces of these feature vectors or

28:44

between feature vectors across time you

28:46

need this for example to scale your

28:49

input by your prediction error when

28:50

you're doing an SGD step

28:52

um and it turns out that you can also do

28:54

this uh by exploiting a bunch of very

28:57

low level details in the way these MLP

29:01

layers are defined in practice for sort

29:03

of real world Transformer models and

29:05

this is like I I don't know very inside

29:07

baseball

29:08

um probably not worth going into too

29:10

much details and if they would make

29:11

sense you've already read this paper but

29:12

basically you can get uh the

29:15

non-linearity in that MLP to pretty well

29:18

approximate uh element-wise

29:20

multiplication of a couple of these

29:22

vectors they're scaling by a scalar

29:24

um if you get things very close to zero

29:25

and do things in the right way

29:27

um and once you do this right once

29:29

you've sort of defined uh Transformer

29:31

chunk that does this move operation a

29:33

Transformer chunk uh that does this

29:35

affine operation and Transformer chunk

29:37

that does this thought operation

29:39

um then all you need to do to show that

29:40

a Transformer can Implement SGD is write

29:43

the program in terms of these operations

29:44

uh that actually uh does so right so

29:48

here for example is the one that does

29:49

SGD you can write a corresponding one uh

29:52

for doing the sort of sequence of rank

29:54

one updates with the the Sherman

29:55

Morrison formula

29:57

um and that's all you need to do to

29:59

generate these kind of in-principle uh

30:01

demonstrations that Transformers can do

30:03

real in context learning

30:06

um one other nice kind of detail here is

30:09

that all of those three things that I

30:10

was showing before you can actually

30:11

consolidate into a big generic kind of

30:13

like uh read transform and write

30:16

operation uh that you can you can

30:19

implement it in a sort of generic way

30:21

um and there has been some follow-up

30:23

work since we put this out sort of

30:24

developing just generalized versions of

30:26

that like uh generic Transformer read

30:28

attend write operator uh to make in some

30:31

sense nicer programming languages for

30:34

describing what goes on or what what

30:35

kinds of functions you can Implement

30:37

with these Transformers uh and coming up

30:39

with tweaks to this architecture

30:42

good and you know so uh maybe

30:44

unsurprisingly but it's nice to see that

30:46

in fact uh it is possible to

30:48

parameterize these models uh so that

30:50

they Implement real algorithms that

30:52

we've written down ahead of time uh and

30:54

thus it's possible at least in theory

30:55

that these Transformers that we're

30:57

seeing out in the real world for at

30:58

least some subset of these uh in context

31:00

learning problems uh are doing real ICL

31:03

um and so you know of course the natural

31:05

question now is what is happening in

31:09

practice when we actually train these

31:11

models on the distribution that we

31:13

assumed we were going to get in the

31:14

previous part of this talk uh do they

31:17

actually converge to the kinds of

31:18

solutions that we were expecting to find

31:20

here um and so we can ask this in a

31:22

couple of different ways uh one is you

31:25

know just do you get the same kinds of

31:27

generalizations that would be predicted

31:28

if you were implementing some version of

31:31

real learning as we've described it

31:33

before and second what's the

31:35

relationship between a model's ability

31:37

to discover the solution and uh lower

31:40

level details about the capacity of the

31:42

model how big it is and other

31:44

implementational things

31:46

um so we're going to try to answer both

31:47

of these things and we're going to do it

31:49

right by now just taking this basic

31:51

learning setup that we assumed for

31:53

constructing some in context Learners

31:55

before and actually now just training

31:57

models on this data right so we're going

31:59

to sample

32:00

um some weight vectors we're going to

32:02

sample some input vectors we're going to

32:04

generate some y's using the weight

32:06

vectors and the input vectors that we

32:07

sampled and in particular right we're

32:09

going to construct a sequence or a data

32:12

set of sequences where each of these

32:14

sequences uh was generated by a single

32:17

one of these weight vectors right so uh

32:19

you know here I have a bunch of y's that

32:21

were generated from some particular W

32:24

sampling my x's from from some normal

32:27

distribution that's going to stay the

32:28

same

32:29

um and I'm going to present the model

32:31

with the sequence and ask it to make a

32:33

prediction at every step of the sequence

32:35

and then when it gets to the end of the

32:37

sequence I'm going to sample a new

32:38

weight vector and I'm going to give it a

32:39

sequence of X Y pairs that were

32:41

generated by that new weight Vector

32:44

um and you know again we can do this

32:46

over and over again this data is fake

32:48

it's easy to generate we can generate

32:49

lots of it and you can train uh actually

32:52

smallish models as we're going to see

32:53

later on to do pretty well on this

32:55

objective

32:57

so what happens when you do this

33:01

um and so the first thing that we're

33:02

going to look at here is just the

33:04

quality of the predictions that the

33:06

model is making

33:07

um so notice here that when we're

33:08

actually

33:10

um oh the axis labels got chopped off

33:12

but uh uh we'll we'll explain these in a

33:14

minute

33:15

um when we are

33:17

uh

33:20

uh sampling from the model right at no

33:22

point are we asking it to explicitly

33:24

exhibit uh an inferred parameter Vector

33:26

we're just asking it to predict uh you

33:28

know our y's from our X's

33:30

um and for all of the plots that I'm

33:32

going to show now we're going to look at

33:33

eight dimensional problems so

33:34

unsurprisingly right with fewer than

33:36

eight examples in the input the model

33:38

can't possibly know exactly what uh

33:40

noiseless linear function is trying to

33:41

learn and so you get some high error

33:43

rate that declines uh gradually over

33:45

time

33:46

um so this is just the you know sort of

33:49

uh squared error for the predictions

33:52

made by the model all of the other lines

33:54

that I'm going to show right now are not

33:56

agreement with ground truth but

33:57

agreement with other uh prediction

34:00

algorithms that we can fit to these

34:01

datas the the these data sets that we're

34:04

handing to our Transformer model to do

34:06

ICL right uh so we might oh yeah

34:08

question

34:13

sorry

34:16

yeah

34:18

you just won just one Epoch yeah so it

34:20

just sees you know X Y X Y X Y and

34:22

sequence without repetition yeah

34:26

other questions about the setup

34:28

okay

34:29

um good so what we're going to ask now

34:32

is you know not just what's the fit of

34:34

this model to the sort of ground truth

34:36

labeling function but to other learning

34:38

algorithms that we might train on the

34:40

same data uh we might hypothesize and in

34:43

fact people in the NLP Community have

34:44

hypothesized that the right way to think

34:46

about ICL is that it's doing something

34:48

like nearest neighbors so we can fit

34:49

some nearest neighbors models and those

34:51

don't actually seem to do a very good

34:53

job at all of describing uh the kinds of

34:56

predictions that this Transformer makes

34:58

you can do SGD and you know various

35:02

versions of of SGD and gradient descent

35:05

where you look at all the examples in a

35:06

batch or you look at them one at a time

35:08

um all of these things early on seem to

35:11

agree relatively well with the model's

35:12

predictions uh and you know sort of

35:15

diverge around that critical example and

35:18

then start to fit again better later on

35:21

um you can ask about you know more sort

35:23

of Old School uh I guess not old school

35:25

but but standard regression algorithms

35:27

here we're looking at uh regularized

35:31

you know Ridge regression here uh this

35:33

is a noiseless problem so you shouldn't

35:35

need this in principle nevertheless this

35:37

gives us a really pretty good fit to our

35:38

data but maybe unsurprisingly right if

35:41

you just do ordinarily squares uh taking

35:43

the mid-norm solution in this uh less

35:45

than eight regime uh this is almost a

35:47

perfect fit to the predictions that

35:49

these models are making so they really

35:50

do seem to be uh behaving uh now without

35:54

saying about the sort of anything about

35:55

the computations that support them uh or

35:58

support that but behaving like

36:00

um uh like these OLS type models um and

36:03

this you know maybe should not surprise

36:04

us right if this is really our data

36:06

generation process we know that the Min

36:09

Bayes risk uh predictions are going to

36:11

come from a weight Vector that looks

36:12

like this and so if we could really

36:14

drive this

36:16

loss function all the way down to zero

36:18

in some sense what we would have to do

36:20

would be to make predictions that look

36:22

like this in this noiseless regime

36:25

um and you know we can sort of stress

36:27

test this a little bit by changing the

36:28

data generation process such that the

36:30

minrisk predictor looks a little bit

36:32

different uh we can make it for example

36:34

we can add some noise to our y's uh so

36:36

that now the right thing is to behave uh

36:39

as though you're regularized a little

36:40

bit right and we know in particular for

36:42

uh particular uh particular distribution

36:46

of weight vectors uh particular

36:47

distribution of noise being added to

36:50

these y's uh that there's a nice closed

36:52

form for uh the grand truth predictor

36:54

here in terms of these two things and so

36:56

you know you can do uh something that

36:58

looks like Ridge regression uh with a

37:00

parameter that's determined by these two

37:02

uh

37:03

noises

37:05

um and when you do this and you ask

37:06

these questions about what's the sort of

37:08

best fit to the model uh you again see

37:10

exactly the pattern that you want to see

37:11

here right that when there's no noise at

37:12

all uh the fit to OLS is better than

37:15

everything else as you start to add

37:17

noise either to the data generation

37:19

process or sorry either do your exes uh

37:22

or sorry either to your W's or to your

37:25

uh wise given W's

37:27

um you see this nice pattern where uh

37:29

the predictor that that best fits the

37:31

predictions that you actually get out of

37:33

this Transformer are exactly the ones

37:34

that you would expect if you were

37:37

uh you know so this is just saying that

37:39

this sort of finding before is robust

37:41

that basically these models at least at

37:43

the scale that we're training them do

37:44

actually give you exactly the predictor

37:46

that you want for these kinds of linear

37:49

regression problems and this agrees with

37:51

uh that uh Garg at all paper that I was

37:54

talking about before that does similar

37:55

kinds of experiments also in the

37:57

presence of uh you know sort of sparse

37:59

X's sparse W's things like that

38:02

um a more interesting question that we

38:04

can ask here

38:06

um is whether this always happens or

38:08

whether in the sort of uh presence of

38:13

stronger computational constraints than

38:15

we have assumed up to this point uh

38:17

models are still able to sort of

38:19

perfectly fit these distributions or

38:20

whether they do something different

38:22

um and the cool thing that happens here

38:23

is that you actually get different fits

38:25

to different algorithms as a function of

38:27

model size so as we make you know for

38:29

the sort of very shallowest models if we

38:31

only give them one layer uh they are not

38:34

perfectly described but best described

38:36

as doing something that looks like a

38:39

single step of I think I've covered this

38:41

up on the legend but of actually bash

38:42

gradient descent on uh the input as you

38:46

make these models bigger uh they look a

38:48

little bit less like they're doing

38:49

gradient descent and a little bit more

38:51

like they're doing uh proper properly

38:53

squares uh and you know as you make them

38:55

uh big enough like we've been doing

38:57

before uh you converge to that OLS

38:59

solution uh you can also look at this as

39:01

a function of uh the hidden size of the

39:04

model the size of these embedding

39:05

vectors that we're passing up and here

39:07

you see less clear but also a similar

39:09

kind of trend uh where for very very

39:11

small hidden sizes

39:13

um uh you have a slightly better fit but

39:16

not a great fit from these SGD type

39:18

predictors uh and at bigger sizes you

39:20

have a better fit from uh from the right

39:22

ones

39:23

um and so we can think of there being a

39:24

couple kind of uh phase phases or

39:28

regimes in uh model parameter space uh

39:31

or sorry in in model architecture space

39:33

that describe the kinds of algorithmic

39:35

solutions that these models find to

39:38

um to this regression problem

39:40

um and we're going to look uh later on

39:42

specifically at uh at this sort of SGD

39:44

regime uh and try to figure out better

39:46

what's going on there

39:48

the last question that we are going to

39:53

ask about these train models before we

39:55

do that

39:56

um is just whether we can figure out

39:57

anything at all about what's going on

39:59

under the hood uh you know so far all of

40:02

the characterization of real models that

40:04

we've done uh has been uh sort of

40:07

extensional right in terms of their

40:09

functional form and not in terms of

40:10

their uh internal computations

40:13

um but it's reasonable to ask given that

40:15

we have these sort of constructions

40:16

lying around for what intermediate

40:17

quantities you might need to compute in

40:19

order to solve these problems whether we

40:21

can see any evidence that trained models

40:23

are actually Computing the relevant

40:25

intermediate quantities

40:26

um and so to do this we're going to do

40:27

what's called I guess now in the the ml

40:30

literature A probing experiment we're

40:31

going to take our original trained model

40:33

we're going to freeze its parameters and

40:36

we're going to fit some teeny little

40:37

model uh either uh just like a single

40:41

linear readout or a little tiny little

40:43

multi-layer perceptron and we're going

40:45

to train this probe model to try to

40:47

predict what the real uh

40:50

optimal W hat was uh for the problem

40:54

that's being presented in the input from

40:56

the internal states of the Transformer

40:58

model so basically can I just look at

41:01

one of these hidden States and recover

41:02

with some predictor that's not powerful

41:04

enough to like solve the layer

41:06

regression problem on its own uh the

41:08

predictor that we think the model is

41:10

using uh to actually make predictions

41:12

um and so you can do this for different

41:14

kinds of intermediate quantities that

41:16

that you might want to look for here the

41:18

natural one is just this weight Vector

41:20

right we said before we're never

41:21

actually showing the model any weight

41:23

vectors we're never asking it to

41:24

generate a weight Vector but if you try

41:26

to probe for this weight Vector in its

41:28

intermediate States uh you find that you

41:29

can do that and you can actually do it

41:31

pretty well with linear readout right

41:34

near the end of the network which I

41:36

think is exactly what we would have

41:37

expected from the little construction

41:38

that we gave before

41:40

um as some sanity checks here right you

41:42

can try doing this on a Model that is

41:45

trained to perform some task that

41:46

requires you to look at all of the

41:47

inputs and outputs but is not linear

41:50

regression and for those what we're

41:53

calling control tasks

41:55

um uh no matter where you look in the

41:57

network you don't acquire the ability to

42:00

recover W as accurately as we were here

42:02

so some evidence that you know in the

42:04

course of making the predictions that we

42:06

were looking at before

42:08

um our weight Vector actually gets

42:10

encoded by uh by the models that are

42:13

making these predictions which is a cool

42:14

thing to find

42:15

um we can ask about other relevant

42:16

intermediate quantities right just like

42:18

the the product of our X Matrix and our

42:20

y Matrix and if we do this the story is

42:22

a little muddier

42:23

um it's definitely not linearly encoded

42:26

maybe it's non-linearly encoded and the

42:28

evidence that it's non-linearly encoded

42:30

is the sort of gap between this control

42:32

task and the real task which is not as

42:34

big as we were looking at here

42:36

um but to the extent that there's

42:37

anything going on

42:38

um it's going on much earlier in the

42:40

network that the model sort of computes

42:42

this around layer 7 or layer eight and

42:44

then hangs on to it

42:46

um and you know again this is a little

42:48

bit more speculative I wouldn't take

42:49

this as uh like dispositive of of models

42:53

Computing this quantity but it is what

42:55

we would expect if it were implementing

42:57

one of these algorithms that we looked

42:58

at before because this is a sort of

43:00

First Step that you need to do in order

43:01

to compute the W later on

43:05

um good so to come back to all of these

43:08

questions uh that we were sort of asking

43:11

at the beginning of this talk uh right

43:13

what is in context learning and the sort

43:14

of main hypotheses that we want to

43:16

discriminate here are whether it's task

43:18

identification or quote unquote real

43:20

learning um and in the context of a very

43:22

very very simple real learning problem

43:24

uh We've shown that you know it's

43:26

possible at least in principle that

43:27

models are are really doing it that you

43:29

can get smallish Transformers uh to

43:32

implement real learning algorithms and

43:34

we've presented both sort of Behavioral

43:35

evidence and at least preliminary

43:38

um uh representational evidence that

43:41

this is actually what's being

43:42

implemented by these models at least at

43:44

the scale that we're looking at

43:45

so in I guess the little bit of time

43:47

that I have left before I switch over to

43:49

questions

43:50

um one thing that's been really fun this

43:51

was obviously a question that was like

43:52

very much on people's mind uh I guess a

43:57

year ago last summer when we started

43:58

doing this project and there's been a

44:00

ton of work that's come out in this

44:02

space both kind of concurrently with uh

44:04

with what we were doing here uh and

44:05

since this paper came out

44:07

um so you know some natural questions

44:09

that you might have at the end of

44:11

everything that I've been showing here

44:12

is whether these constructions we've

44:15

given are actually the right ones or

44:16

whether especially given that you know

44:18

we were getting pretty good results from

44:20

like two layer Transformers four layer

44:22

Transformers whether we can do these

44:23

things more efficiently whether we can

44:25

say anything uh you know sort of

44:27

theoretically about the conditions under

44:28

which we're actually going to recover

44:30

the the real learning solution

44:32

um and how this relates to data

44:35

distributions both in these kinds of

44:37

synthetic models and in Real Models um

44:39

and what's nice is that we're starting

44:40

to get answers actually to all of these

44:42

questions so shortly after

44:45

um uh we did this uh there was a paper

44:48

actually also from another group at

44:49

Google that I guess had not been talking

44:51

to our Google collaborators uh showing a

44:53

very similar thing showing that you

44:54

could get standard Transformers uh to

44:57

implement in this case uh fitting these

45:00

linear linear regression problems via a

45:03

single step of gradient descent uh and

45:06

they do it in a very different way they

45:07

use the attention mechanism actually to

45:09

compute all of the dot products that you

45:10

need to compute and this allows them to

45:12

do it in a single layer with a single

45:15

attention head

45:16

um and so if you think back to those

45:18

kind of experimental results we had

45:19

before right showing that in the single

45:22

layer regime we're looking more like a

45:24

one step of gradient descent model than

45:26

anything else here is now some evidence

45:28

or here is an example of a way in which

45:31

you can solve this problem by

45:34

parameterizing a Transformer that would

45:36

generate exactly that behavior for you

45:38

um there's one kind of important caveat

45:39

here which is that this is not actually

45:41

using the kind of real models that

45:42

people train and practice but something

45:44

with a slightly simpler and slightly

45:46

less expressive attention mechanism and

45:49

this might account for the little Gap

45:51

that we were seeing between our SGD

45:53

predictors and uh and the real Model

45:55

Behavior

45:56

um even cooler very recently some folks

45:58

at Berkeley I think Spencer is now at UC

46:01

Davis but showed that not only can you

46:03

do this but you are under certain

46:04

conditions guarantee to converge to

46:08

exactly this linear self-attention one

46:10

step of gradient descent solution so uh

46:13

you know we can say now a little bit

46:15

more precisely uh the conditions under

46:17

which in context learning really is real

46:19

learning and and you can actually uh

46:22

guarantee this up front using this

46:23

linear self-attention model

46:25

um finally all of these interesting

46:27

questions about data sets

46:30

um so one and I think this is a very

46:32

recent paper but but cool thing that

46:34

came out right

46:35

um You can imagine that if you only ever

46:38

trained uh your model at training time

46:41

on outputs from a single weight Vector

46:44

uh the right thing to do kind of no

46:46

matter what is to not do any in context

46:47

learning just memorize that weight

46:49

vector and use that weight Vector to

46:50

predict whatever a wise you're going to

46:52

see at test time

46:54

um and so there's a sort of question of

46:55

whether like how much diversity in that

46:57

training set you really need to see

46:59

before you switch over to being uh a

47:03

real learner as opposed to something

47:04

that's memorized a fixed family of wheat

47:06

vectors

47:07

um and so now some empirical evidence

47:08

that you do get exactly that kind of

47:10

again phase transition as a function of

47:13

the diversity of the data set that for

47:15

small W's you remember a small number of

47:17

W's you memorize all the W's for a

47:19

sufficiently large number uh even finite

47:22

of W's that you get to see over and over

47:24

again during training time eventually uh

47:26

you learn to uh to do in context

47:29

learning instead I think this is

47:31

especially interesting because it sort

47:32

of goes against that

47:34

um you know Min Bay's risk story that I

47:35

was talking about before when you're in

47:37

the regime of having a finite number of

47:39

uh W's that you've ever seen at training

47:41

time uh probably the right thing to do

47:43

is eventually you know at least sort of

47:45

in a Bayesian sense

47:46

um is to just memorize that large finite

47:48

set of W's models don't do that and at

47:51

some point they learn the the solution

47:52

that generalizes instead

47:55

um and finally we can start to ask these

47:57

kinds of questions about Real Models

47:58

thinking back to the very beginning of

48:00

the talk right we have this kind of back

48:01

and forth between is this just task

48:03

identification uh is this uh real

48:06

learning uh and now some empirical

48:08

evidence that in fact uh it's it's a

48:10

mixture of both and that you can buy

48:11

changing the label distribution or uh

48:14

the kinds of instructions that you

48:15

provide up front uh actually induce

48:18

models to behave more like task

48:19

identifiers or more like uh in context

48:22

Learners uh again just by sort of

48:24

manipulating the inputs

48:26

um finally oh man everything moved

48:28

around on the slide but one thing that

48:29

we started to look at uh that's a sort

48:31

of next direction that I'm super excited

48:32

about

48:33

um is generalizing to more interesting

48:36

kinds of prediction problems right so we

48:38

can ask now not just to produce a single

48:40

categorical label for these things uh

48:42

but more structured kinds of outputs

48:44

that maybe start to look a little bit

48:45

more uh like the kind of real machine

48:47

translation examples and and other text

48:49

generation examples uh that we saw at

48:51

the beginning of this uh of this talk

48:53

um uh what we're looking at specifically

48:55

is in context learning of uh finite

48:58

automata and other sort of formal

48:59

languages and here it turns out that at

49:02

least empirically right these models are

49:03

also very very good at doing this uh in

49:05

context you can get close to perfect

49:07

accuracy

49:08

um and interestingly this seems to be a

49:10

task that separates uh these

49:13

Transformers the sort of current

49:14

architecture from a lot of the other uh

49:18

work that has been done in recent years

49:19

uh trying to propose some new new kinds

49:23

of models that are that are easier to

49:24

train uh have lower computational

49:26

budgets things like that

49:28

um and so

49:30

ICL right this this in context learning

49:33

thing seems not only to be a sort of

49:35

surprising property of these uh these

49:37

models but not something that you

49:39

necessarily get for free at scale in any

49:41

sufficiently large neural sequence

49:43

predictor uh there is something special

49:45

about Transformers and so the question

49:46

is what what is that thing uh and

49:49

ultimately if we can answer that

49:50

question uh we will hopefully know what

49:53

we need to do to to figure out what the

49:54

next generation of model architectures

49:56

looks like

49:57

um with that I will wrap up uh you know

49:59

as always most of the credit goes to

50:01

ecken and and the rest equally

50:03

distributed among our collaborators this

50:05

was a super fun collaboration

50:06

um and happy to answer any questions

50:10

foreign