Transcript

Transcript of YouTube Video: Jacob Andreas | What Learning Algorithm is In-Context Learning?

The following is a summary and article by AI based on a transcript of the video "Jacob Andreas | What Learning Algorithm is In-Context Learning?". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Video on Youtube

Article By AIVideo Transcript

00:00	hi everyone uh I'm Jacob I work on uh
00:03	mostly natural language processing at
00:04	MIT
00:05	um today I'm going to be talking about
00:07	something a little different which is uh
00:09	the worst possible way to fit uh or to
00:11	estimate the parameters of a linear
00:12	model and what this can teach us about
00:14	uh sort of modern NLP systems and
00:17	language generation systems
00:19	um here we go okay so to start off right
00:21	everybody's favorite introductory
00:23	estimation problem linear regression we
00:25	have some x's we have some y's a new X
00:27	comes along and we want to be able to
00:29	predict uh the corresponding y from this
00:31	x under whatever the data data
00:33	generating process is
00:35	um right and so as every undergraduate
00:37	now knows uh the way to do this is you
00:39	take all of your x's and you put them
00:41	into a matrix you take all your y's you
00:42	put them into a matrix and then you open
00:45	up chat GPT and you paste your access
00:46	and your y's into chatgpt and you ask it
00:49	to give you a solution to this problem
00:51	um I actually I wasn't going to do this
00:53	for real but I did this when I was
00:54	putting these slides together and it
00:55	produces this very long very
00:57	authoritative sounding answer there's a
00:59	bunch of you know sort of intermediate
01:00	computations and math here and if you
01:03	take the actual predictor that it gives
01:04	you at the end and you drop it back into
01:06	this problem
01:08	um it looks a little bit like this so
01:11	not necessarily a reliable way to uh to
01:14	estimate linear regression models right
01:16	now and you know more generally I guess
01:19	this is obviously a kind of ridiculous
01:21	thing to do we have already uh very very
01:24	old uh very very good algorithms for
01:27	solving these kinds of problems and so
01:28	it doesn't seem like the sort of thing
01:30	uh that we might necessarily uh need to
01:34	ask a language model to do or or where
01:36	it would be an interesting question to
01:37	ask whether language models text
01:40	generation systems uh like the kind that
01:42	we have today are able to do it at all
01:46	um nevertheless uh the kinds of things
01:49	that we can do with these big open-ended
01:51	text generation systems uh are becoming
01:55	you know sort of like more and more
01:57	General uh these models themselves are
01:59	becoming more and more capable right so
02:01	here's an example from what's now you
02:03	know an oldish paper 2020 uh showing
02:05	that these models can do real sort of
02:07	open-ended uh text generation so here's
02:09	a particular language model and we'll
02:11	talk more specifically about what these
02:12	language models are in a minute uh but
02:14	that is being prompted right it's asked
02:16	to predict what text should come next
02:18	after uh this text that we have in light
02:21	gray up at the top of the slide so we
02:23	give it the title of uh sort of
02:25	hypothetical news article we give it the
02:26	subtitle of a hypothetical news article
02:29	and then we ask it to generate the text
02:30	of the article itself and it gives you
02:32	this long paragraph right that uh are
02:35	you know both sort of compatible with
02:37	the title describing the sort of thing
02:38	uh that you would expect to be in an
02:40	article with that title written in the
02:42	style of a news article uh and more or
02:45	less internally coherent um it's also
02:47	worth noting that uh this article is or
02:50	the sort of factual content of this
02:51	article is wrong uh that it's describing
02:54	you know as prompted a sort of real
02:55	split that happened in the Methodist
02:57	Church but the identities of the
02:58	individual sort of groups that came out
03:00	of the Schism uh while plausible
03:02	sounding are totally made up
03:05	um now we can do more with these text
03:08	generation systems than just generate
03:10	long blocks of text
03:12	um one of the other surprising
03:14	capabilities uh that we've seen as
03:17	people have started to play around with
03:18	these things is that they can do
03:20	instruction following right so you can
03:22	take again a sort of generic uh next
03:25	word prediction system uh you can say
03:28	give it as input the text explained the
03:30	moon landing to a six-year-old in a few
03:31	sentences
03:33	um and a good enough text predictor will
03:35	actually respond by following this
03:37	instruction and by generating some text
03:38	that specifically explains the moon
03:40	landing or answers a question why do
03:42	birds migrate south for the winter
03:44	um and so on and so forth
03:46	um
03:47	and maybe the last sort of surprising
03:50	capability that we've seen in these
03:52	models as they have scaled up and as
03:56	people have started to play around with
03:57	them more
03:58	um is a phenomenon that has is coming to
04:01	be referred to as in context learning
04:03	and so the What's Happening Here is that
04:05	rather than just giving this model uh
04:08	the sort of beginning of a document that
04:09	we want it to complete and rather than
04:11	giving it an explicit textual
04:12	description of the task that we wanted
04:14	to perform we just give it some examples
04:17	of that task being performed on other
04:19	kinds of inputs so here we're asking it
04:21	to do some sort of grammatical error
04:23	correction task and we do this by
04:26	placing you know as the input to this
04:28	model the string poor English input I
04:30	eated the purple berries good English
04:32	output I ate the purple berries poor
04:35	English input thank you for picking me
04:36	as your designer I'd appreciate it but
04:38	good English output and we fill in all
04:40	of these pieces like this so again all
04:42	of this text that we're showing in light
04:43	gray on the slide is uh text that we as
04:46	a human user of this system are
04:48	providing and notice that the sort of
04:50	structure of this input right is that we
04:52	have some human written bad examples
04:54	good examples some human written inputs
04:56	and outputs
04:57	um uh We've Ended here with an input and
05:01	what the model is going to generate next
05:02	is a desired output and in this case
05:04	what we get out of the model is in fact
05:07	a corrected version of the last sentence
05:09	that we asked for and so we can specify
05:11	tasks that we want these language
05:14	generation systems these next word
05:16	prediction systems to perform uh again
05:18	not sort of in in the form of natural
05:20	language instructions but in the form of
05:22	tiny tiny data sets for these new tasks
05:24	that we're asking these models to
05:26	generate and this turns out uh you know
05:29	and this was something that was not even
05:31	necessarily planned out ahead of time
05:32	but discovered in these models uh around
05:34	2020 uh and it's something that you can
05:36	do for a wide variety of tasks uh you
05:39	can do it even for simple arithmetic
05:41	kinds of things you can do it for
05:43	spelling correction you can do it for
05:44	translation you can do it for that
05:46	grammatical error correction and we'll
05:47	see a couple more examples later on um
05:52	um and I think you know the important
05:53	thing to note here is that in some cases
05:55	this is uh
05:58	more effective right than providing a
06:01	precise natural language description of
06:03	the task that you want the model to
06:04	perform that they actually get better at
06:06	doing these kinds of things when you
06:08	give them uh explicit examples in the
06:11	input uh and just ask them to generate
06:13	more examples or generate you know the
06:15	output pieces of those examples uh from
06:17	that same distribution um and so this
06:19	you know like I said before this
06:21	phenomenon is coming to be referred to
06:22	as in context learning and what this is
06:25	meant to evoke is that there is a kind
06:27	of learning that looks like machine
06:30	learning as we're used to thinking of it
06:32	or parameter estimation is we're used to
06:33	thinking of it but that is happening now
06:35	not as part of training one of these
06:37	prediction systems but actually uh in
06:41	the course of figuring out how to
06:42	generate text or figuring out how to
06:44	complete one of these documents
06:47	um and so what this talk is going to be
06:49	about the sort of big question that I
06:51	want to ask today is basically what on
06:53	Earth is going on here uh what is the
06:55	relationship between this particular
06:57	phenomenon and learning as we're used to
06:59	thinking about it in a machine learning
07:01	setting and what can we say about the
07:03	the sort of you know reliability of
07:05	these Pro this process or the limits of
07:06	this process
07:08	um I want to note also before I go on
07:10	here that uh everything that I'm going
07:11	to be talking about today was led by my
07:13	student accurac with a big collaboration
07:15	from some folks at Google and Stanford
07:17	as well uh Dale Sherman's and Denny Joe
07:19	at Google and Tanya ma who's joined
07:21	between Google and Stanford
07:23	and the sort of broad uh outline of this
07:25	talk is to First figure out uh or you
07:28	know discuss just what this in context
07:29	learning phenomenon is uh and what
07:32	possible hypotheses we might build about
07:34	what's going on when models are doing
07:36	this text generation process
07:39	um and then we're going to ask with the
07:41	kinds of models that we have today
07:43	algorithmically what kinds of things
07:44	might they be able to do what kinds of
07:46	things might they not be able to do when
07:48	presented with these sorts of in-context
07:50	learning problems
07:51	um and this is all going to be kind of
07:52	in principle arguments just from the
07:54	structure of uh language models as they
07:57	exist today uh you know without looking
07:59	at any real models and then we'll
08:01	actually go out and train some models
08:02	and try to figure out what's what's
08:03	really going on under the hood and then
08:06	talk about sort of the implications of
08:07	this uh more more broadly for research
08:10	on NLP research on language models
08:13	so starting with this first question uh
08:17	what exactly is this uh in context
08:20	learning phenomenon and how might we be
08:22	able to describe this behavior
08:24	um and so to talk about this right let's
08:26	start by being a little bit more precise
08:28	about what these language models are
08:29	what something like uh you know chat apt
08:32	or whatever is
08:33	um and fundamentally what one of these
08:34	models is uh is just a next word
08:37	prediction system right like the thing
08:39	that sits on your phone as you're
08:41	writing a text message and tries to
08:42	generate what word you're going to say
08:43	next
08:44	um and so what that means in practice is
08:46	that we're modeling a distribution over
08:48	strings typically natural language
08:50	strings but now in practice kind of
08:51	anything that you can write in Unicode
08:53	um and that that distribution is
08:55	represented uh via just the sort of
08:58	normal chain rule distribution one token
09:00	at a time with some kind of parametric
09:02	model that's just trying to predict
09:03	every word given all of the words that
09:05	came before right so concretely right
09:07	thinking about what one of these in
09:09	context learning examples looks like uh
09:12	we're going to try to model a joint
09:13	distribution over for example input
09:17	output pairs with some separator tokens
09:18	and some input markers and whatever by
09:21	just asking what's the probability of
09:24	the first input and then the probability
09:25	of some separator symbol given that
09:27	first input and then the probability of
09:29	the first output and so on and so forth
09:31	all right and as you might expect uh the
09:34	very beginning of this generation
09:36	procedure the sort of first step in this
09:37	in this decomposition is going to be
09:39	super high entropy the model doesn't
09:40	know what it's supposed to be doing at
09:42	all if it has actually learned by the
09:43	learn the task that we're asking it to
09:45	perform by the end when we're asking it
09:48	to predict the X's it'll know the
09:49	distribution of x's and when we're
09:50	asking it to predict the Y's it'll you
09:52	know know sort of precisely the
09:54	distribution of y's given X's
09:56	um so how do we actually model this
10:00	distribution uh in practice with you
10:04	know sort of in in the kinds of language
10:06	models that we're using today
10:07	um and pretty much every language model
10:10	that is in in widespread use that you'll
10:13	encounter that people are working with
10:14	in in the research Community right now
10:16	um is a neural network and it's a neural
10:18	network of a very specific kind called a
10:21	Transformer
10:22	um there's a lot of low-level details
10:23	about these Transformer models that I'm
10:25	not going to talk about at all right now
10:27	but that but sorry but there are some
10:29	high level details that are important
10:32	for uh for sort of understanding how
10:34	these models work and so I want to talk
10:35	through those things
10:36	um at a high level when I'm given an
10:39	input like this one or that contains
10:41	goat misspelled Arrow goat spelled
10:43	correctly comma snake misspelled Arrow
10:47	um what we're going to try to do uh I
10:49	guess some of the alignments are off
10:50	here but hopefully the correspondence is
10:52	right the first thing that we're going
10:53	to do is we're going to take each of
10:55	these natural language uh tokens right
10:58	the word goat or the word or you know
11:01	the the symbol Arrow or whatever
11:03	um and we're going to assign them what's
11:04	called an embedding we're gonna you know
11:06	map each of these words using some fixed
11:08	dictionary that isn't self learned to
11:11	some high dimensional parameter vector
11:13	and so if you've heard of word
11:14	embeddings right every you know what
11:15	we're giving this model is input uh is
11:18	an embedding of each of these words in
11:19	sequence
11:21	um and the first thing that we're going
11:22	to do once we've represented each of
11:24	these words in the input uh as some high
11:27	dimensional Vector
11:29	um is that we're going to perform what's
11:30	called attention so every word uh is
11:34	going to sort of pick up its or you know
11:35	at every position in this input we're
11:37	going to pick up the embedding that
11:39	we've computed so far
11:41	um and we're going to take
11:42	a linear combination of all of the word
11:45	embeddings that are in the sentence uh
11:47	up until this word that we've predicted
11:49	right now where the sort of Weights in
11:51	that linear embedding are themselves
11:53	some learned function of the input so
11:54	effectively you know when we're looking
11:56	at this word goat it's going to decide
11:59	how strongly you know and the the reason
12:02	this uh is called an attention mechanism
12:04	think of it as deciding how strongly it
12:06	wants to look at all of the other words
12:08	uh in the input up until this point uh
12:10	pool them according to sort of how
12:12	strongly it's looking at each of them
12:13	and then build some new representation
12:15	in context of this word goat that
12:18	incorporates all of this other
12:19	information from all of the other words
12:20	that have come before um and we can do a
12:22	little extra processing on top of this
12:24	so we don't like lose the identity of
12:26	the the word that we had as input but
12:27	you know at a high level think of this
12:29	the first thing that a Transformer does
12:31	this attention mechanism uh is just
12:33	taking some weighted combination of all
12:35	of the other representations that we've
12:36	built up to this point
12:38	um and we're going to do this in
12:39	parallel for every token in the input
12:42	right so we're going to compute a
12:43	representation of goat in this way and
12:45	go gets to look at all three tokens to
12:47	the left we're going to compute a
12:48	representation of for example this last
12:50	Arrow character and that last Arrow
12:51	character gets to look at the entire
12:53	sentence that we've seen so far and so
12:55	we're going to get as a result of this
12:56	entire process a sequence of now how
12:59	these attention vectors that are derived
13:02	from our original input
13:07	once we've done this uh we're just going
13:11	to apply some other uh learned
13:13	non-linear transformation to each of
13:15	these new vectors that we've built up uh
13:17	this now doesn't look at any other piece
13:19	of the input at all it's happening again
13:20	sort of locally in parallel at every
13:22	step of the input
13:24	um you know and works basically the way
13:27	a normal feed forward neural network
13:28	works if you've seen those uh but we're
13:30	doing that now for every word
13:31	simultaneously and these two steps
13:35	together uh this attention step followed
13:37	by this multi-layer perceptron this sort
13:40	of generic non-linear transformation
13:41	we're going to refer to as a Transformer
13:44	layer and we can actually take these
13:47	Transformer layers and we can stack a
13:48	bunch of them on top of each other in
13:50	sequence so we can do this once and then
13:52	we can do it a second time where now
13:54	when we do that attention step we're
13:56	attending not to the original word
13:57	embeddings but to all of the your you
14:00	know all of the representations that we
14:01	built at the last step of this procedure
14:03	uh and you know in practice for big
14:05	models you're doing this tens of times
14:07	or hundreds of times uh to eventually
14:10	build up a representation of the entire
14:12	input sequence that you have and what
14:14	we're going to get here at the end uh
14:16	again is a sort of sequence of uh of the
14:18	outputs of those final uh MLP steps uh
14:23	one for every token in this input
14:25	sentence and the last thing that we're
14:27	going to do is we're going to take the
14:28	very last hidden representation the last
14:30	output from the last of these
14:31	Transformer layers and we're just going
14:33	to try to decode it into a distribution
14:36	over possible next tokens using
14:39	basically something that looks like a
14:40	log linear model right so again every
14:42	word in my vocabulary just like at the
14:44	embedding layer is associated with
14:46	some vector and we're just going to sort
14:49	of compute dot products between those
14:50	vectors and this last hidden
14:51	representation and then exponentially
14:53	normalized to get a distribution over
14:55	y's and so what we hope right in the
14:58	context of these kinds of in context
15:00	learning problems is that this
15:02	distribution that we get over the end
15:03	places a lot of Mass on the correct
15:06	spelling of snake and not a lot of Mass
15:08	on on all of the other words in our
15:09	vocabulary
15:12	good so how uh having defined this model
15:15	architecture do we train them uh and you
15:19	know in the most basic sense the
15:20	training is super easy you go out and
15:22	you scrape as much text Data as you can
15:24	get your hands on uh you crawl the
15:25	entire internet you pay a bunch of
15:27	people to sit in a room and write you
15:28	know sort of nicely structured documents
15:30	for you
15:31	um and then you just do maximum
15:32	likelihood estimation with this model on
15:34	that gigantic uh gigantic data set
15:37	um a lot of the challenges here are
15:39	really engineering challenges uh more
15:41	than uh than kind of modeling challenges
15:42	and a lot of the reason that you know
15:44	sort of open AI uh is able to train
15:46	these models and people in Academia are
15:48	not uh has to do with just like the cost
15:51	of uh computational resources that you
15:53	need to really do this at a scale that
15:55	makes it effective
15:57	um one thing that I'll say that's not
15:58	going to be part of this talk at all is
16:00	that in modern most modern language
16:03	models especially the ones that sort of
16:04	look and talk like chat Bots there's
16:06	another step that goes on after this
16:08	process of sort of learning uh via
16:11	something that looks like reinforcement
16:12	learning from human feedback
16:14	that's not really going to be relevant
16:16	for what we're talking about today the
16:17	relationship between that and uh some of
16:19	the sort of capabilities that I was
16:21	talking about at the beginning is still
16:22	a little unclear
16:24	um
16:25	but is part of kind of the modern recipe
16:28	okay so this is the basic uh Transformer
16:31	architecture
16:32	um what happens when we actually do this
16:34	when we you know fit this uh maximum
16:37	likelihood objective uh to the entire
16:39	internet
16:40	um and what happens is you get a model
16:41	that is very very good at a very large
16:44	number of different things right so here
16:45	are some examples uh from a recent paper
16:48	showing uh good performance on the bar
16:50	exam although apparently these bar
16:52	numbers are actually really sketchy and
16:53	this is you know maybe not something to
16:55	take seriously
16:56	um but pretty good performance if not
16:58	like top human level performance across
17:00	a wide variety of uh things requiring
17:03	both a little bit of reasoning and uh
17:06	quite a lot of knowledge right the LSAT
17:08	the SAT the GRE uh both the math
17:10	sections and the verbal sections
17:13	um and you know this has been in the
17:14	news I assume many people have seen this
17:16	before
17:17	um and I think you know it was
17:19	surprising surprising is maybe not even
17:21	a strong enough word but uh you know to
17:24	many of us in the field both that uh
17:28	just doing this at a large enough scale
17:30	would get you here
17:32	um and that we were going to be able to
17:33	do this kind of as quickly as possible
17:35	and that the large enough scale turns
17:37	out to be you know only all of the text
17:39	that Humanity has ever written and not
17:41	something much more than that
17:43	okay good so among the things that
17:46	happens when you train a model in this
17:49	way like we were talking about before
17:51	um is this in context learning
17:53	phenomenon right so here now we can talk
17:55	a little bit more precisely about what
17:56	ICL uh is or what it would take to be a
17:59	good in context learner and it's
18:01	something like being able to take a
18:03	sequence of inputs that looks like this
18:05	um and you know have as your output
18:08	distribution something that places a lot
18:10	of probability uh on uh in this context
18:13	the French word for cheese and you know
18:15	maybe other sort of plausible variations
18:17	on it
18:18	um similarly if we want to do not
18:20	machine translation but sentiment
18:21	analysis we want to be able to plug in
18:23	some uh examples of sentences associated
18:27	with with sentiment uh assign them
18:29	natural language labels uh and and be
18:31	able to predict the appropriate ones I
18:32	guess there's some text getting cut off
18:33	here but uh but in the right way
18:36	um and finally uh one of the kind of
18:38	surprising things that happens is that
18:40	you can do this not just with
18:43	um you know sort of well-defined
18:46	problems that you expect to see
18:47	occurring many times in the training
18:48	data like the sentiment analysis problem
18:50	or the machine translation problem uh
18:52	but with weird made up problems that
18:54	surely nobody in the world has ever
18:55	asked one of these models to solve
18:57	before uh so you know this thing on the
18:59	slide this particular combination of
19:00	symbols and meanings
19:02	um is something that ekken made up the
19:03	first time he gave a version of this
19:05	talk uh and uh models are at least the
19:08	you know particular language model that
19:09	he uh asked this question to was able to
19:12	produce the right output here uh the
19:15	first time
19:16	so coming back to the original question
19:18	at the very beginning of the talk
19:20	um what is it that's actually going on
19:22	here under the hood in virtue of which
19:25	these models are able to solve problems
19:26	like this
19:28	um and I think you know the initial
19:29	reaction from most people in the
19:32	language processing and machine learning
19:34	communities uh was that learning was
19:36	probably not the right way to describe
19:38	what was actually going on in these
19:40	models and maybe better to think of it
19:42	as something like uh identification and
19:45	here's a quote from Reynolds and
19:48	McDonald that I think captures this sort
19:50	of hypothesis about what's going on here
19:52	pretty nicely
19:53	um fuchsia prompting is not really
19:54	learning of skills but just locating the
19:56	skills that the model already has this
19:59	is most obvious for translation where
20:00	you can't learn a new language from five
20:02	examples and you know I think the
20:05	translation case is a clear example of
20:07	uh of where this is the right way of
20:09	thinking about these models in fact you
20:10	can't learn French from uh
20:13	five examples like skipping ahead here
20:17	you know certainly if there's an English
20:19	word that you've never seen before and
20:20	you can produce the right French output
20:21	some of the knowledge about the
20:23	relationship between English and French
20:24	does not reside in the sort of in
20:26	context training set here but has to
20:28	come from the model's background
20:29	knowledge
20:30	but when we think about these more kind
20:34	of algebraic puzzle solving problems um
20:37	it's a little less clear that this is
20:38	the right way of thinking about things
20:40	uh right I think we can be pretty
20:42	confident that at least then maybe not
20:45	now this particular never particular
20:47	example didn't appear in the training
20:48	data for any of these models you know
20:50	maybe other things with a similar flavor
20:52	a similar grammar induction structure
20:54	um but uh but not exactly this one and
20:58	so competing with that hypothesis that
21:01	we had on the previous slide is another
21:03	hypothesis that the Zim context learning
21:06	thing is quote unquote real learning
21:08	right that it's capable of sort of
21:09	taking a training set that fully
21:11	determines the behavior that you want to
21:12	get out and using that kind of training
21:15	set to index within some reasonably
21:17	large hypothesis class the actual
21:19	function that you want this model to
21:21	implement even if that's a function that
21:22	it never saw executed anywhere at
21:25	training time
21:27	and so the thing that we set out to do
21:29	in this project was basically to answer
21:31	that question to figure out
21:33	um whether it is possible even in
21:35	principle uh for something like one of
21:38	these Transformer models uh to make a
21:40	version of this hypothesis true to get
21:42	them to do real learning and then to
21:43	actually figure out what's going on in
21:45	Real Models Yeah question
21:56	um that is a great question I am
21:59	uh not sure we could try it
22:02	yeah I mean and like this is I I so I'm
22:06	not going to be able to like produce off
22:07	of my top of my head though like full
22:08	exploration that we did here but no I
22:10	mean that's a good point and you know uh
22:12	there are also examples of cases where
22:14	providing this kind of extraneous
22:15	information actually confuses the model
22:17	and
22:18	um uh you know I'm sure it's also not
22:20	the case that uh you can get 100
22:22	accuracy on things in this General class
22:28	um
22:29	other questions before we go on
22:32	Okay cool so yeah the the first question
22:34	that we're gonna ask here then is
22:37	whether it's possible for these
22:38	Transformer type models uh to do real
22:41	learning whatever that means uh even in
22:43	principle so to sort of take this
22:45	hypothesis that we had before
22:47	um and figure out whether there's some
22:49	class of uh functions and some class of
22:51	models for which this is true
22:54	um and we're going to do this by looking
22:55	actually at those uh linear models that
22:59	we looked at at the very beginning of
23:01	the talk uh just because this is a case
23:03	where we know exactly what the kind of
23:05	space of algorithmic solutions looks
23:06	like it's very easy to generate data and
23:09	it's a simple enough problem that you
23:10	don't have to go out and use something
23:12	trained on the entire internet you can
23:13	actually do all of the learning yourself
23:16	um one other sort of important caveat to
23:18	make here uh is that you know in some
23:21	sense this question of uh can a
23:24	Transformer Implement function X where
23:26	even function X is you know training
23:27	some little machine learning model on
23:28	the inside
23:30	um was already answered decades ago we
23:32	know that neural networks are general
23:34	purpose function approximators when made
23:37	sufficiently large and trained on enough
23:38	data so really the question here is
23:41	whether uh you can do but but the
23:43	constructions that uh you use to do that
23:46	kind of universal approximation are kind
23:48	of ridiculous they blow up uh
23:50	exponentially in the the size of the
23:51	input and the complexity of the function
23:52	so really what we're trying to figure
23:54	out here
23:56	um is whether you can do this not just
23:58	at all but with models that are the kind
24:00	of size and shape of the ones that we're
24:02	actually using uh today and of the size
24:05	and shape that are uh showing all those
24:07	real world results that I was showing
24:09	you before
24:09	um citation is cut off here but it is to
24:12	paper by Garg at all at Stanford who
24:15	around the same time we were doing this
24:16	uh asked a similar set of Behavioral
24:19	questions about just like what kinds of
24:20	functions extensionally are you able to
24:22	learn in this framework okay so uh the
24:25	way we're going to go about looking at
24:26	this right rather than taking our
24:28	translation problems from before and
24:30	asking models to solve translation
24:31	problems by via this word embedding step
24:34	or whatever we're just going to generate
24:36	some input output pairs we're going to
24:37	sample uh you know a random weight
24:38	Vector here I guess I'm showing this as
24:40	one dimensional but these are going to
24:42	be higher dimensional problems later on
24:44	uh and we're just going to train the
24:45	model right in exactly whoops uh in
24:48	exactly the same way and I guess for now
24:49	we're not even training models we're
24:51	just asking how we can parameterize them
24:52	to begin with so can we uh parameterize
24:55	some Transformer in such a way that you
24:58	know we give this zero as input it
24:59	predicts a zero as output
25:01	um one thing to note here you know we
25:04	can also and in all in the experiments
25:06	that I'm going to show later on we are
25:07	going to train these models both to
25:10	predict wise or y's given x's and to
25:13	predict X's uh given the whole history
25:16	of interactions that have happened
25:17	before uh here we're not going to worry
25:19	about actually modeling the input
25:21	distribution at all we're just going to
25:22	worry about modeling that conditional
25:24	distribution
25:25	um and I think it's actually an open
25:26	question how much this matters for uh
25:28	for real training things
25:30	um good so you know
25:32	not to belabor this point too much uh
25:34	you know we've had linear regression and
25:36	solutions to linear regression problems
25:37	since the 1800s uh we know that there
25:40	are lots of algorithms that you might
25:41	pack into one of these Transformers that
25:43	will solve them
25:44	um and you know you can do this by just
25:46	sort of directly in inverting this
25:47	relevant Matrix you can do this by
25:50	gradient descent on uh some kind of
25:52	least squares objective or something
25:54	else that looks like that and so the
25:56	question is whether we can take any of
25:58	these algorithms and pack them into one
26:00	of these models like this
26:03	um
26:04	and you know it turns out that you can
26:06	do this uh
26:08	the details are sort of mechanical and
26:10	I'm just going to give a sort of high
26:11	level flavor for one of these but it
26:14	turns out that you can show uh both that
26:17	if you're iterative trying to fit these
26:19	linear models via SGD or via batch
26:21	gradient descent you can do that you can
26:24	do that using a relatively shallow model
26:26	uh and you know using a model that has
26:29	depth that scales proportionally to the
26:31	number of steps of SGD that you want to
26:33	do here
26:34	um using you know sort of a reasonable
26:36	number of these attention mechanisms a
26:38	reasonable number of these layers and
26:40	second you can do this not just via SGD
26:43	but just sort of by directly
26:44	constructing uh that final predictor via
26:47	a sequence of rank 1 updates to
26:49	um uh that uh xdx inverse Matrix
26:54	um you know again there's a lot of sort
26:57	of low-level mechanical stuff but just
26:58	to give a high level like Flavor of how
27:01	this works
27:02	um we're going to define a little
27:03	calculus of operations that we can
27:05	Implement inside a transformer for sort
27:07	of moving data around and applying
27:09	linear transformations to it and then
27:11	once you have these things you can just
27:12	chain them together uh and Implement uh
27:15	really any algorithm uh that you choose
27:17	that bottoms out in these operations
27:19	right uh so what are the things that we
27:21	need to do for example paradigmatically
27:24	to implement uh gradient descent or even
27:26	just the first step of gradient descent
27:28	on this objective
27:29	um well one thing that we're going to
27:30	need to be able to do if uh most of the
27:33	processing is going to get done by these
27:35	multi-layer perceptron units these feed
27:37	forward layers that we were talking
27:38	about before is just to consolidate our
27:41	data right so we need to get our x's and
27:43	our y's into the same kind of part of
27:45	the representation space so that the
27:46	model can work on them uh we're going to
27:48	call this a move operation and it turns
27:50	out that you can implement this move
27:51	operation uh pretty simply using uh that
27:54	attention mechanism that we saw before
27:56	they can just sort of pick up a piece of
27:57	the Hidden State uh from one input and
28:00	move it to the next time step so the
28:01	first thing that we're going to do for
28:03	sdd here for example is just accumulate
28:05	the X's uh into our y's
28:07	another thing that you need to do here
28:09	is take offline Transformations right
28:13	for example on this first step we have
28:15	some initial guess at the weight Vector
28:16	we need to dot that by our initial
28:18	feature Vector to figure out what our
28:19	error is and so we need some way of
28:21	doing affine Transformations uh and you
28:24	know affine Transformations are in some
28:25	sense like the only thing that these
28:27	multi-layer perceptron uh units can do
28:29	and so this is very very easy to
28:31	implement using that and MLP layer that
28:34	we were showing before
28:35	um and then the last thing and the thing
28:36	that turns out to be the sort of
28:38	fussiest piece of this
28:39	um is Computing dot products between
28:42	pieces of these feature vectors or
28:44	between feature vectors across time you
28:46	need this for example to scale your
28:49	input by your prediction error when
28:50	you're doing an SGD step
28:52	um and it turns out that you can also do
28:54	this uh by exploiting a bunch of very
28:57	low level details in the way these MLP
29:01	layers are defined in practice for sort
29:03	of real world Transformer models and
29:05	this is like I I don't know very inside
29:07	baseball
29:08	um probably not worth going into too
29:10	much details and if they would make
29:11	sense you've already read this paper but
29:12	basically you can get uh the
29:15	non-linearity in that MLP to pretty well
29:18	approximate uh element-wise
29:20	multiplication of a couple of these
29:22	vectors they're scaling by a scalar
29:24	um if you get things very close to zero
29:25	and do things in the right way
29:27	um and once you do this right once
29:29	you've sort of defined uh Transformer
29:31	chunk that does this move operation a
29:33	Transformer chunk uh that does this
29:35	affine operation and Transformer chunk
29:37	that does this thought operation
29:39	um then all you need to do to show that
29:40	a Transformer can Implement SGD is write
29:43	the program in terms of these operations
29:44	uh that actually uh does so right so
29:48	here for example is the one that does
29:49	SGD you can write a corresponding one uh
29:52	for doing the sort of sequence of rank
29:54	one updates with the the Sherman
29:55	Morrison formula
29:57	um and that's all you need to do to
29:59	generate these kind of in-principle uh
30:01	demonstrations that Transformers can do
30:03	real in context learning
30:06	um one other nice kind of detail here is
30:09	that all of those three things that I
30:10	was showing before you can actually
30:11	consolidate into a big generic kind of
30:13	like uh read transform and write
30:16	operation uh that you can you can
30:19	implement it in a sort of generic way
30:21	um and there has been some follow-up
30:23	work since we put this out sort of
30:24	developing just generalized versions of
30:26	that like uh generic Transformer read
30:28	attend write operator uh to make in some
30:31	sense nicer programming languages for
30:34	describing what goes on or what what
30:35	kinds of functions you can Implement
30:37	with these Transformers uh and coming up
30:39	with tweaks to this architecture
30:42	good and you know so uh maybe
30:44	unsurprisingly but it's nice to see that
30:46	in fact uh it is possible to
30:48	parameterize these models uh so that
30:50	they Implement real algorithms that
30:52	we've written down ahead of time uh and
30:54	thus it's possible at least in theory
30:55	that these Transformers that we're
30:57	seeing out in the real world for at
30:58	least some subset of these uh in context
31:00	learning problems uh are doing real ICL
31:03	um and so you know of course the natural
31:05	question now is what is happening in
31:09	practice when we actually train these
31:11	models on the distribution that we
31:13	assumed we were going to get in the
31:14	previous part of this talk uh do they
31:17	actually converge to the kinds of
31:18	solutions that we were expecting to find
31:20	here um and so we can ask this in a
31:22	couple of different ways uh one is you
31:25	know just do you get the same kinds of
31:27	generalizations that would be predicted
31:28	if you were implementing some version of
31:31	real learning as we've described it
31:33	before and second what's the
31:35	relationship between a model's ability
31:37	to discover the solution and uh lower
31:40	level details about the capacity of the
31:42	model how big it is and other
31:44	implementational things
31:46	um so we're going to try to answer both
31:47	of these things and we're going to do it
31:49	right by now just taking this basic
31:51	learning setup that we assumed for
31:53	constructing some in context Learners
31:55	before and actually now just training
31:57	models on this data right so we're going
31:59	to sample
32:00	um some weight vectors we're going to
32:02	sample some input vectors we're going to
32:04	generate some y's using the weight
32:06	vectors and the input vectors that we
32:07	sampled and in particular right we're
32:09	going to construct a sequence or a data
32:12	set of sequences where each of these
32:14	sequences uh was generated by a single
32:17	one of these weight vectors right so uh
32:19	you know here I have a bunch of y's that
32:21	were generated from some particular W
32:24	sampling my x's from from some normal
32:27	distribution that's going to stay the
32:28	same
32:29	um and I'm going to present the model
32:31	with the sequence and ask it to make a
32:33	prediction at every step of the sequence
32:35	and then when it gets to the end of the
32:37	sequence I'm going to sample a new
32:38	weight vector and I'm going to give it a
32:39	sequence of X Y pairs that were
32:41	generated by that new weight Vector
32:44	um and you know again we can do this
32:46	over and over again this data is fake
32:48	it's easy to generate we can generate
32:49	lots of it and you can train uh actually
32:52	smallish models as we're going to see
32:53	later on to do pretty well on this
32:55	objective
32:57	so what happens when you do this
33:01	um and so the first thing that we're
33:02	going to look at here is just the
33:04	quality of the predictions that the
33:06	model is making
33:07	um so notice here that when we're
33:08	actually
33:10	um oh the axis labels got chopped off
33:12	but uh uh we'll we'll explain these in a
33:14	minute
33:15	um when we are
33:17	uh
33:20	uh sampling from the model right at no
33:22	point are we asking it to explicitly
33:24	exhibit uh an inferred parameter Vector
33:26	we're just asking it to predict uh you
33:28	know our y's from our X's
33:30	um and for all of the plots that I'm
33:32	going to show now we're going to look at
33:33	eight dimensional problems so
33:34	unsurprisingly right with fewer than
33:36	eight examples in the input the model
33:38	can't possibly know exactly what uh
33:40	noiseless linear function is trying to
33:41	learn and so you get some high error
33:43	rate that declines uh gradually over
33:45	time
33:46	um so this is just the you know sort of
33:49	uh squared error for the predictions
33:52	made by the model all of the other lines
33:54	that I'm going to show right now are not
33:56	agreement with ground truth but
33:57	agreement with other uh prediction
34:00	algorithms that we can fit to these
34:01	datas the the these data sets that we're
34:04	handing to our Transformer model to do
34:06	ICL right uh so we might oh yeah
34:08	question
34:13	sorry
34:16	yeah
34:18	you just won just one Epoch yeah so it
34:20	just sees you know X Y X Y X Y and
34:22	sequence without repetition yeah
34:26	other questions about the setup
34:28	okay
34:29	um good so what we're going to ask now
34:32	is you know not just what's the fit of
34:34	this model to the sort of ground truth
34:36	labeling function but to other learning
34:38	algorithms that we might train on the
34:40	same data uh we might hypothesize and in
34:43	fact people in the NLP Community have
34:44	hypothesized that the right way to think
34:46	about ICL is that it's doing something
34:48	like nearest neighbors so we can fit
34:49	some nearest neighbors models and those
34:51	don't actually seem to do a very good
34:53	job at all of describing uh the kinds of
34:56	predictions that this Transformer makes
34:58	you can do SGD and you know various
35:02	versions of of SGD and gradient descent
35:05	where you look at all the examples in a
35:06	batch or you look at them one at a time
35:08	um all of these things early on seem to
35:11	agree relatively well with the model's
35:12	predictions uh and you know sort of
35:15	diverge around that critical example and
35:18	then start to fit again better later on
35:21	um you can ask about you know more sort
35:23	of Old School uh I guess not old school
35:25	but but standard regression algorithms
35:27	here we're looking at uh regularized
35:31	you know Ridge regression here uh this
35:33	is a noiseless problem so you shouldn't
35:35	need this in principle nevertheless this
35:37	gives us a really pretty good fit to our
35:38	data but maybe unsurprisingly right if
35:41	you just do ordinarily squares uh taking
35:43	the mid-norm solution in this uh less
35:45	than eight regime uh this is almost a
35:47	perfect fit to the predictions that
35:49	these models are making so they really
35:50	do seem to be uh behaving uh now without
35:54	saying about the sort of anything about
35:55	the computations that support them uh or
35:58	support that but behaving like
36:00	um uh like these OLS type models um and
36:03	this you know maybe should not surprise
36:04	us right if this is really our data
36:06	generation process we know that the Min
36:09	Bayes risk uh predictions are going to
36:11	come from a weight Vector that looks
36:12	like this and so if we could really
36:14	drive this
36:16	loss function all the way down to zero
36:18	in some sense what we would have to do
36:20	would be to make predictions that look
36:22	like this in this noiseless regime
36:25	um and you know we can sort of stress
36:27	test this a little bit by changing the
36:28	data generation process such that the
36:30	minrisk predictor looks a little bit
36:32	different uh we can make it for example
36:34	we can add some noise to our y's uh so
36:36	that now the right thing is to behave uh
36:39	as though you're regularized a little
36:40	bit right and we know in particular for
36:42	uh particular uh particular distribution
36:46	of weight vectors uh particular
36:47	distribution of noise being added to
36:50	these y's uh that there's a nice closed
36:52	form for uh the grand truth predictor
36:54	here in terms of these two things and so
36:56	you know you can do uh something that
36:58	looks like Ridge regression uh with a
37:00	parameter that's determined by these two
37:02	uh
37:03	noises
37:05	um and when you do this and you ask
37:06	these questions about what's the sort of
37:08	best fit to the model uh you again see
37:10	exactly the pattern that you want to see
37:11	here right that when there's no noise at
37:12	all uh the fit to OLS is better than
37:15	everything else as you start to add
37:17	noise either to the data generation
37:19	process or sorry either do your exes uh
37:22	or sorry either to your W's or to your
37:25	uh wise given W's
37:27	um you see this nice pattern where uh
37:29	the predictor that that best fits the
37:31	predictions that you actually get out of
37:33	this Transformer are exactly the ones
37:34	that you would expect if you were
37:37	uh you know so this is just saying that
37:39	this sort of finding before is robust
37:41	that basically these models at least at
37:43	the scale that we're training them do
37:44	actually give you exactly the predictor
37:46	that you want for these kinds of linear
37:49	regression problems and this agrees with
37:51	uh that uh Garg at all paper that I was
37:54	talking about before that does similar
37:55	kinds of experiments also in the
37:57	presence of uh you know sort of sparse
37:59	X's sparse W's things like that
38:02	um a more interesting question that we
38:04	can ask here
38:06	um is whether this always happens or
38:08	whether in the sort of uh presence of
38:13	stronger computational constraints than
38:15	we have assumed up to this point uh
38:17	models are still able to sort of
38:19	perfectly fit these distributions or
38:20	whether they do something different
38:22	um and the cool thing that happens here
38:23	is that you actually get different fits
38:25	to different algorithms as a function of
38:27	model size so as we make you know for
38:29	the sort of very shallowest models if we
38:31	only give them one layer uh they are not
38:34	perfectly described but best described
38:36	as doing something that looks like a
38:39	single step of I think I've covered this
38:41	up on the legend but of actually bash
38:42	gradient descent on uh the input as you
38:46	make these models bigger uh they look a
38:48	little bit less like they're doing
38:49	gradient descent and a little bit more
38:51	like they're doing uh proper properly
38:53	squares uh and you know as you make them
38:55	uh big enough like we've been doing
38:57	before uh you converge to that OLS
38:59	solution uh you can also look at this as
39:01	a function of uh the hidden size of the
39:04	model the size of these embedding
39:05	vectors that we're passing up and here
39:07	you see less clear but also a similar
39:09	kind of trend uh where for very very
39:11	small hidden sizes
39:13	um uh you have a slightly better fit but
39:16	not a great fit from these SGD type
39:18	predictors uh and at bigger sizes you
39:20	have a better fit from uh from the right
39:22	ones
39:23	um and so we can think of there being a
39:24	couple kind of uh phase phases or
39:28	regimes in uh model parameter space uh
39:31	or sorry in in model architecture space
39:33	that describe the kinds of algorithmic
39:35	solutions that these models find to
39:38	um to this regression problem
39:40	um and we're going to look uh later on
39:42	specifically at uh at this sort of SGD
39:44	regime uh and try to figure out better
39:46	what's going on there
39:48	the last question that we are going to
39:53	ask about these train models before we
39:55	do that
39:56	um is just whether we can figure out
39:57	anything at all about what's going on
39:59	under the hood uh you know so far all of
40:02	the characterization of real models that
40:04	we've done uh has been uh sort of
40:07	extensional right in terms of their
40:09	functional form and not in terms of
40:10	their uh internal computations
40:13	um but it's reasonable to ask given that
40:15	we have these sort of constructions
40:16	lying around for what intermediate
40:17	quantities you might need to compute in
40:19	order to solve these problems whether we
40:21	can see any evidence that trained models
40:23	are actually Computing the relevant
40:25	intermediate quantities
40:26	um and so to do this we're going to do
40:27	what's called I guess now in the the ml
40:30	literature A probing experiment we're
40:31	going to take our original trained model
40:33	we're going to freeze its parameters and
40:36	we're going to fit some teeny little
40:37	model uh either uh just like a single
40:41	linear readout or a little tiny little
40:43	multi-layer perceptron and we're going
40:45	to train this probe model to try to
40:47	predict what the real uh
40:50	optimal W hat was uh for the problem
40:54	that's being presented in the input from
40:56	the internal states of the Transformer
40:58	model so basically can I just look at
41:01	one of these hidden States and recover
41:02	with some predictor that's not powerful
41:04	enough to like solve the layer
41:06	regression problem on its own uh the
41:08	predictor that we think the model is
41:10	using uh to actually make predictions
41:12	um and so you can do this for different
41:14	kinds of intermediate quantities that
41:16	that you might want to look for here the
41:18	natural one is just this weight Vector
41:20	right we said before we're never
41:21	actually showing the model any weight
41:23	vectors we're never asking it to
41:24	generate a weight Vector but if you try
41:26	to probe for this weight Vector in its
41:28	intermediate States uh you find that you
41:29	can do that and you can actually do it
41:31	pretty well with linear readout right
41:34	near the end of the network which I
41:36	think is exactly what we would have
41:37	expected from the little construction
41:38	that we gave before
41:40	um as some sanity checks here right you
41:42	can try doing this on a Model that is
41:45	trained to perform some task that
41:46	requires you to look at all of the
41:47	inputs and outputs but is not linear
41:50	regression and for those what we're
41:53	calling control tasks
41:55	um uh no matter where you look in the
41:57	network you don't acquire the ability to
42:00	recover W as accurately as we were here
42:02	so some evidence that you know in the
42:04	course of making the predictions that we
42:06	were looking at before
42:08	um our weight Vector actually gets
42:10	encoded by uh by the models that are
42:13	making these predictions which is a cool
42:14	thing to find
42:15	um we can ask about other relevant
42:16	intermediate quantities right just like
42:18	the the product of our X Matrix and our
42:20	y Matrix and if we do this the story is
42:22	a little muddier
42:23	um it's definitely not linearly encoded
42:26	maybe it's non-linearly encoded and the
42:28	evidence that it's non-linearly encoded
42:30	is the sort of gap between this control
42:32	task and the real task which is not as
42:34	big as we were looking at here
42:36	um but to the extent that there's
42:37	anything going on
42:38	um it's going on much earlier in the
42:40	network that the model sort of computes
42:42	this around layer 7 or layer eight and
42:44	then hangs on to it
42:46	um and you know again this is a little
42:48	bit more speculative I wouldn't take
42:49	this as uh like dispositive of of models
42:53	Computing this quantity but it is what
42:55	we would expect if it were implementing
42:57	one of these algorithms that we looked
42:58	at before because this is a sort of
43:00	First Step that you need to do in order
43:01	to compute the W later on
43:05	um good so to come back to all of these
43:08	questions uh that we were sort of asking
43:11	at the beginning of this talk uh right
43:13	what is in context learning and the sort
43:14	of main hypotheses that we want to
43:16	discriminate here are whether it's task
43:18	identification or quote unquote real
43:20	learning um and in the context of a very
43:22	very very simple real learning problem
43:24	uh We've shown that you know it's
43:26	possible at least in principle that
43:27	models are are really doing it that you
43:29	can get smallish Transformers uh to
43:32	implement real learning algorithms and
43:34	we've presented both sort of Behavioral
43:35	evidence and at least preliminary
43:38	um uh representational evidence that
43:41	this is actually what's being
43:42	implemented by these models at least at
43:44	the scale that we're looking at
43:45	so in I guess the little bit of time
43:47	that I have left before I switch over to
43:49	questions
43:50	um one thing that's been really fun this
43:51	was obviously a question that was like
43:52	very much on people's mind uh I guess a
43:57	year ago last summer when we started
43:58	doing this project and there's been a
44:00	ton of work that's come out in this
44:02	space both kind of concurrently with uh
44:04	with what we were doing here uh and
44:05	since this paper came out
44:07	um so you know some natural questions
44:09	that you might have at the end of
44:11	everything that I've been showing here
44:12	is whether these constructions we've
44:15	given are actually the right ones or
44:16	whether especially given that you know
44:18	we were getting pretty good results from
44:20	like two layer Transformers four layer
44:22	Transformers whether we can do these
44:23	things more efficiently whether we can
44:25	say anything uh you know sort of
44:27	theoretically about the conditions under
44:28	which we're actually going to recover
44:30	the the real learning solution
44:32	um and how this relates to data
44:35	distributions both in these kinds of
44:37	synthetic models and in Real Models um
44:39	and what's nice is that we're starting
44:40	to get answers actually to all of these
44:42	questions so shortly after
44:45	um uh we did this uh there was a paper
44:48	actually also from another group at
44:49	Google that I guess had not been talking
44:51	to our Google collaborators uh showing a
44:53	very similar thing showing that you
44:54	could get standard Transformers uh to
44:57	implement in this case uh fitting these
45:00	linear linear regression problems via a
45:03	single step of gradient descent uh and
45:06	they do it in a very different way they
45:07	use the attention mechanism actually to
45:09	compute all of the dot products that you
45:10	need to compute and this allows them to
45:12	do it in a single layer with a single
45:15	attention head
45:16	um and so if you think back to those
45:18	kind of experimental results we had
45:19	before right showing that in the single
45:22	layer regime we're looking more like a
45:24	one step of gradient descent model than
45:26	anything else here is now some evidence
45:28	or here is an example of a way in which
45:31	you can solve this problem by
45:34	parameterizing a Transformer that would
45:36	generate exactly that behavior for you
45:38	um there's one kind of important caveat
45:39	here which is that this is not actually
45:41	using the kind of real models that
45:42	people train and practice but something
45:44	with a slightly simpler and slightly
45:46	less expressive attention mechanism and
45:49	this might account for the little Gap
45:51	that we were seeing between our SGD
45:53	predictors and uh and the real Model
45:55	Behavior
45:56	um even cooler very recently some folks
45:58	at Berkeley I think Spencer is now at UC
46:01	Davis but showed that not only can you
46:03	do this but you are under certain
46:04	conditions guarantee to converge to
46:08	exactly this linear self-attention one
46:10	step of gradient descent solution so uh
46:13	you know we can say now a little bit
46:15	more precisely uh the conditions under
46:17	which in context learning really is real
46:19	learning and and you can actually uh
46:22	guarantee this up front using this
46:23	linear self-attention model
46:25	um finally all of these interesting
46:27	questions about data sets
46:30	um so one and I think this is a very
46:32	recent paper but but cool thing that
46:34	came out right
46:35	um You can imagine that if you only ever
46:38	trained uh your model at training time
46:41	on outputs from a single weight Vector
46:44	uh the right thing to do kind of no
46:46	matter what is to not do any in context
46:47	learning just memorize that weight
46:49	vector and use that weight Vector to
46:50	predict whatever a wise you're going to
46:52	see at test time
46:54	um and so there's a sort of question of
46:55	whether like how much diversity in that
46:57	training set you really need to see
46:59	before you switch over to being uh a
47:03	real learner as opposed to something
47:04	that's memorized a fixed family of wheat
47:06	vectors
47:07	um and so now some empirical evidence
47:08	that you do get exactly that kind of
47:10	again phase transition as a function of
47:13	the diversity of the data set that for
47:15	small W's you remember a small number of
47:17	W's you memorize all the W's for a
47:19	sufficiently large number uh even finite
47:22	of W's that you get to see over and over
47:24	again during training time eventually uh
47:26	you learn to uh to do in context
47:29	learning instead I think this is
47:31	especially interesting because it sort
47:32	of goes against that
47:34	um you know Min Bay's risk story that I
47:35	was talking about before when you're in
47:37	the regime of having a finite number of
47:39	uh W's that you've ever seen at training
47:41	time uh probably the right thing to do
47:43	is eventually you know at least sort of
47:45	in a Bayesian sense
47:46	um is to just memorize that large finite
47:48	set of W's models don't do that and at
47:51	some point they learn the the solution
47:52	that generalizes instead
47:55	um and finally we can start to ask these
47:57	kinds of questions about Real Models
47:58	thinking back to the very beginning of
48:00	the talk right we have this kind of back
48:01	and forth between is this just task
48:03	identification uh is this uh real
48:06	learning uh and now some empirical
48:08	evidence that in fact uh it's it's a
48:10	mixture of both and that you can buy
48:11	changing the label distribution or uh
48:14	the kinds of instructions that you
48:15	provide up front uh actually induce
48:18	models to behave more like task
48:19	identifiers or more like uh in context
48:22	Learners uh again just by sort of
48:24	manipulating the inputs
48:26	um finally oh man everything moved
48:28	around on the slide but one thing that
48:29	we started to look at uh that's a sort
48:31	of next direction that I'm super excited
48:32	about
48:33	um is generalizing to more interesting
48:36	kinds of prediction problems right so we
48:38	can ask now not just to produce a single
48:40	categorical label for these things uh
48:42	but more structured kinds of outputs
48:44	that maybe start to look a little bit
48:45	more uh like the kind of real machine
48:47	translation examples and and other text
48:49	generation examples uh that we saw at
48:51	the beginning of this uh of this talk
48:53	um uh what we're looking at specifically
48:55	is in context learning of uh finite
48:58	automata and other sort of formal
48:59	languages and here it turns out that at
49:02	least empirically right these models are
49:03	also very very good at doing this uh in
49:05	context you can get close to perfect
49:07	accuracy
49:08	um and interestingly this seems to be a
49:10	task that separates uh these
49:13	Transformers the sort of current
49:14	architecture from a lot of the other uh
49:18	work that has been done in recent years
49:19	uh trying to propose some new new kinds
49:23	of models that are that are easier to
49:24	train uh have lower computational
49:26	budgets things like that
49:28	um and so
49:30	ICL right this this in context learning
49:33	thing seems not only to be a sort of
49:35	surprising property of these uh these
49:37	models but not something that you
49:39	necessarily get for free at scale in any
49:41	sufficiently large neural sequence
49:43	predictor uh there is something special
49:45	about Transformers and so the question
49:46	is what what is that thing uh and
49:49	ultimately if we can answer that
49:50	question uh we will hopefully know what
49:53	we need to do to to figure out what the
49:54	next generation of model architectures
49:56	looks like
49:57	um with that I will wrap up uh you know
49:59	as always most of the credit goes to
50:01	ecken and and the rest equally
50:03	distributed among our collaborators this
50:05	was a super fun collaboration
50:06	um and happy to answer any questions
50:10	foreign

Why Isn’t the Climate Movement Voting? | Nathaniel Stinnett | TED

A Better Treatment for Overdose Is Coming

The Most Important Explosion in History

Why You See Faces in Things

The Real Reason Robots Shouldn’t Look Like Humans

Video Transcript