Transcript

Transcript of YouTube Video: Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning

The following is a summary and article by AI based on a transcript of the video "Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Video on Youtube

Article By AIVideo Transcript

00:05	thank you all for having me it's
00:06	exciting to be here uh one of my
00:08	favorite things is talking about what is
00:11	going on inside neural networks or at
00:13	least what we what we're trying to
00:14	figure out is going on inside neural
00:15	networks so it's it's always fun to chat
00:17	about that
00:19	um
00:20	oh gosh i have to figure out how to how
00:22	to do things okay can i
00:24	what i i want okay there we go now now
00:27	we are advancing slides that seems wrong
00:30	um so i think interpretability means
00:32	lots of different things to different
00:34	people um it's a very a very broad term
00:36	and and people mean all sorts of
00:38	different things by it
00:39	um and so i wanted to talk just briefly
00:41	about uh the kind of interpretability
00:43	that i i spend my time thinking about um
00:45	which is what i'd call mechanistic
00:47	interpretability so
00:49	um most of my work actually has not been
00:51	on language models or on rnn's or
00:54	transformers but um on understanding
00:56	vision confidence and and trying to
00:58	understand how do the parameters in
01:00	those models actually map to algorithms
01:04	so you can like think of the parameters
01:06	of a neural network as being like a
01:07	compiled computer program and and the
01:10	neurons are kind of like variables or
01:11	registers and somehow there's there
01:14	there are these these complex computer
01:16	programs that are are embedded in those
01:18	weights and we'd like to turn them back
01:19	in to computer programs that that humans
01:21	can understand it's a kind of kind of
01:23	reverse engineering problem
01:25	um
01:26	and so this is this is kind of a
01:28	fun example that we found where there
01:29	was a car neuron and you could actually
01:31	see that um you know that we have the
01:33	car neuron and it's constructed from
01:35	like a wheel neuron
01:37	and it looks for in the case of the
01:39	wheel neuron it's looking for for the
01:40	wheels on the bottom and those are
01:42	positive weights and it doesn't want to
01:43	see them on top so there's negative
01:45	weights there and there's also a window
01:46	neuron it's looking for the windows on
01:48	the top and and not on the bottom and so
01:50	what we're actually seeing there right
01:51	is it's an algorithm it's an algorithm
01:54	that goes and turns um you know it's
01:57	it's just it's you know saying you know
01:58	well cars is has wheels on the bottom
02:00	and windows on the top and chrome in the
02:02	middle um and that's that's actually
02:03	like just the the strongest neurons for
02:05	that and so we're actually seeing a
02:07	meaningful algorithm and that's that's
02:08	not an exception that's that's sort of
02:10	the the general story that if you're
02:12	willing to go and look at neural neural
02:14	network weights and you're willing to
02:15	invest a lot of energy and trying to
02:16	first engineer them there's there's
02:18	meaningful algorithms written in the
02:20	weights waiting for you to find them
02:22	um and there's a bunch of reasons i
02:24	think that's an interesting thing to
02:25	think about one is you know just no one
02:27	knows how to go and do the things that
02:29	neural networks can do like no one knows
02:30	how to write a computer program that can
02:32	accurately classify imagenet let alone
02:33	you know the language modeling tasks
02:35	that we're doing no one knows how to
02:36	like directly write a computer program
02:38	that can do the things that gpd3 does
02:40	and yet somehow breaking descent is able
02:41	to go and discover a way to do this and
02:43	i want to know what's going on i want to
02:45	know you know how what is it discovered
02:48	that it can do in in these systems
02:51	there's another reason why i think this
02:53	is important which is uh is safety so
02:55	you know if we if we want to go and use
02:57	these systems in in places where they
02:59	have big effect on the world and
03:01	i think a question we need to ask
03:02	ourselves is you know what what happens
03:05	when these models have have
03:07	unanticipated failure modes failure
03:08	modes we didn't know to go and test for
03:10	or to look for to check for
03:12	how can we how can we discover those
03:13	things especially if they're they're
03:14	really pathological failure modes so the
03:16	models in some sense deliberately doing
03:17	something that we don't want well the
03:20	only way that i really see that we we
03:21	can do that is if we can get to a point
03:23	where we really understand what's going
03:24	on inside these systems
03:26	um so that's another reason that i'm
03:28	interested in this
03:30	now uh actually doing interpersonally on
03:32	language models and transformers it's
03:34	new to me i um before this year i spent
03:36	like eight years working on trying
03:38	reverse engineer confidence uh and
03:40	vision models um and so the ideas in
03:42	this talk um are are new things that
03:45	i've been thinking about with my
03:46	collaborators um and we're still
03:48	probably a month or two out maybe maybe
03:50	longer from publishing them um and this
03:52	is also the first public talk that i've
03:53	given on it so uh you know the things
03:55	i'm going to talk about um they made
03:57	they're i think honestly still a little
03:58	bit confused for me um and definitely
04:00	are going to be confused in my
04:01	articulation of them so if i say things
04:03	that are confusing um you know please
04:05	feel free to ask me questions there
04:06	might be some points for me to go
04:07	quickly because there's a lot of content
04:09	um but definitely at the end i will be
04:10	available for a while to chat about the
04:12	stuff um
04:14	and uh yeah also i apologize um if uh if
04:18	i'm unfamiliar with zoom and make make
04:20	mistakes um but
04:22	uh yeah so um with that said uh let's
04:25	dive in
04:26	um so i wanted to start with a mystery
04:31	um before we go and try to actually dig
04:34	into you know what's going on inside
04:35	these models um i wanted to motivate it
04:38	by a really strange piece of discover of
04:41	behavior that we discovered and and
04:42	wanted to understand
04:45	um
04:46	uh and by the way i should say all this
04:48	work is um uh you know is done with my
04:50	my colleagues anthropic and especially
04:52	my colleagues catherine and nelson
04:54	okay so on to the mystery
04:56	um i think probably the the most
04:58	interesting and most exciting thing
05:00	about um about transformers is their
05:03	ability to do in-context learning or
05:06	sometimes people call it meta-learning
05:08	um you know the gp3 paper uh goes and
05:10	and describes things as uh you know uh
05:13	language models are few shot learners
05:14	like there's lots of impressive things
05:15	about gp3 but they choose to focus on
05:17	that and you know now everyone's talking
05:18	about prompt engineering um and um
05:21	andrei caprathi was was joking about how
05:23	you know software 3.0 was designing the
05:25	prompt and so the ability of language
05:28	models of these these large transformers
05:29	to respond to their context and learn
05:32	from their context and change their
05:33	behavior and response to their context
05:35	and you know really seems like probably
05:37	the most surprising and striking and
05:38	remarkable thing about them
05:41	um
05:42	and
05:43	uh some of my my colleagues previously
05:45	published a paper that has a trick in it
05:47	that i i really love which is so we're
05:49	all used to looking at learning curves
05:51	you train your model and you you know as
05:52	your model trains the loss goes down
05:56	sometimes it's a little bit
05:57	discontinuous but it goes down
06:00	another thing that you can do is you can
06:02	go and take a fully trained model and
06:03	you can go and ask you know as we go
06:05	through the context you know as we go we
06:07	predict the first token and then the
06:08	second token and the third token we get
06:10	better at predicting each token because
06:12	we have more information to go and
06:13	predict it on so you know the first the
06:15	first con token the the loss should be
06:17	the the entropy of the unigrams and then
06:19	the next token should be the entry of
06:20	the biograms and it falls
06:22	but it keeps falling
06:24	and it keeps getting better
06:26	and
06:26	in in some sense that's our that's the
06:28	model's ability to go and predict to go
06:31	and do in-context learning the ability
06:34	to go and predict um you know to be
06:36	better at predicting later tokens than
06:37	you are predicting early tokens that is
06:39	that is in some sense a mathematical
06:40	definition of what it means to be good
06:42	at this magical in-context learning or
06:44	meta-learning that these models can do
06:46	and so that's kind of cool because that
06:48	gives us a a way to go and look at
06:50	whether models are good at in-context
06:52	learning
06:54	yeah if i could just ask the question
06:55	like a clarification question
06:57	please when you say learning there are
06:59	no actual parameters
07:02	that is the remarkable thing about
07:03	in-context learning right so yeah indeed
07:05	we traditionally think about neural
07:06	networks as learning over the course of
07:08	training by going and modifying their
07:10	parameters but somehow models appear to
07:12	also be able to learn in some sense um
07:14	if you give them a couple examples in
07:15	their context they can then go and do
07:17	that later in their context even though
07:18	no parameters changed and so it's it's
07:21	some kind of quite different different
07:22	notion of learning as you're as you're
07:24	gesturing that
07:26	uh
07:26	okay i think that's making more sense so
07:28	i mean could you also just describe in
07:31	context learning in this case as
07:32	conditioning as in like conditioning on
07:34	the first five tokens of a ten token
07:36	sentence
07:37	pretty cool tokens yeah i think the
07:39	reason that people sometimes think about
07:41	this as in context learning or meta
07:42	learning is that you can do things where
07:45	you like actually take a training set
07:46	and you embed the training set in your
07:48	context like if you just two or three
07:49	examples and then suddenly your model
07:51	can go and do do this task and so you
07:54	can do fuchsia learning by embedding
07:55	things in the context yeah the formal
07:59	setup is that you're you're just
08:00	conditioning on on on this context and
08:02	it's just that somehow this this ability
08:05	like this thing like there's there's
08:06	some sense you know for a long time
08:08	people were
08:09	were
08:09	i mean i i guess really the history of
08:11	this is uh
08:13	we started to get good at neural
08:14	networks learning right um and we could
08:17	we could go and train language uh train
08:18	vision models and language models that
08:19	could do all these remarkable things but
08:20	then people started to be like well you
08:22	know these systems are they take so many
08:24	more examples than humans do to go and
08:26	learn how can we go and fix this and we
08:28	had all these ideas about metal learning
08:29	develop where we wanted to go and and
08:32	train models
08:33	explicitly to be able to learn from a
08:34	few examples and people developed all
08:36	these complicated schemes and then the
08:37	like truly like absurd thing about about
08:39	transformer language models is without
08:41	any effort at all we get this for free
08:44	that you can go and just give them a
08:45	couple examples in their context and
08:47	they can learn in their context to go
08:48	and do new things um i think that was
08:51	like like that was in some sense the
08:52	like most striking thing about the gpd3
08:54	paper
08:55	um
08:56	and so uh this this yeah this ability to
08:59	go and have the just conditioning on a
09:01	context go and give you you know new
09:03	abilities for free and and the ability
09:05	to generalize to new things is in some
09:07	sense the the most yeah and to me the
09:09	most striking and shocking thing about
09:11	about transformer language models
09:14	that makes sense i mean i guess
09:17	from my perspective
09:18	i'm trying to square like
09:21	the notion of learning in this case
09:23	with you know if you or i were given a
09:25	prompt of like one plus one equals two
09:27	two plus three equals five
09:29	as the sort of few shot set up and then
09:33	somebody else put you know like five
09:34	plus three equals and we had to fill it
09:36	out in that case i wouldn't say that
09:38	we've learned arithmetic because we
09:40	already sort of knew it but rather we're
09:42	just sort of conditioning on the prompt
09:44	to know what it is that we should then
09:46	generate right
09:47	uh but it seems to me like that's
09:50	yeah i think that's on the spectrum
09:52	though because you can you can also go
09:53	and give like completely nonsensical
09:55	problems where the model would never
09:57	have seen um see like mimic this
10:00	function and give a couple examples of
10:01	the function and the model's never seen
10:02	it before and i can go and do that later
10:04	in the context um and i think i think
10:06	what you did learn um in a lot of these
10:08	cases you might not have you might have
10:10	um
10:11	you might not have learned arithmetic
10:12	like you might have had some innate
10:13	faculty for arithmetic that you're using
10:15	but you might have learned oh okay right
10:16	now we're doing arithmetic problems
10:19	um
10:20	got it in the case this is i agree that
10:21	there's like an element of semantics
10:22	here um yeah you know this is helpful
10:25	though just to clarify exactly sort of
10:26	what the
10:27	yeah what you remember
10:29	thank you for watching of course
10:33	so something that's i think really
10:34	striking about all of us
10:36	um
10:37	is well okay so we we've talked about
10:38	how we can we can sort of look at the
10:40	learning curve and we can also look at
10:41	this in context learning curve but
10:43	really those are just two slices of a
10:45	two-dimensional uh space so like the the
10:49	in some sense the more fundamental thing
10:50	is how good are we at producing the nth
10:52	token at a different given point in
10:53	training
10:54	and something that you'll notice if you
10:56	if you look at this and so when we when
10:58	we talk about the loss curve we're just
11:00	talking about if you average over this
11:01	dimension
11:02	if you if you like average like this and
11:04	and project on to the the training step
11:07	that's that's your loss curve um and um
11:10	if you the thing that we are calling the
11:11	incontext learning curve is just this
11:12	line um
11:15	uh yeah this this line uh down the end
11:18	here
11:20	um
11:21	and something that's that's kind of
11:22	striking is there's there's this
11:24	discontinuity in it um like there's this
11:26	point where where you know the model
11:29	seems to get radically better in a very
11:31	very short time step span at going and
11:33	predicting late tokens
11:35	so it's not that different in early time
11:37	steps but in late time steps suddenly
11:39	you get better
11:43	and a way that you can make this more
11:45	striking is you can you can take the
11:47	difference in in your ability to put the
11:48	50th token and your ability to predict
11:51	the 500 token you can subtract from the
11:53	the the 500th token the 50th token loss
11:56	and what you see
11:58	um is that over the course of trading
12:01	you know you're you're you're not very
12:02	good at this and you get a little bit
12:03	better and then suddenly
12:05	you have this cliff and then you never
12:07	get better than the difference between
12:09	these at least never gets better so the
12:10	model gets better at predicting things
12:12	but its ability to go and predict late
12:14	tokens over early tokens never gets
12:16	better
12:17	and so there's in the span of just a few
12:19	hundred steps in training the model has
12:21	gotten radically better at its ability
12:24	to go and and do this kind of in context
12:26	learning
12:28	and so you might ask you know what's
12:30	going on at that point
12:31	um and this is just one model but um
12:33	well so first of all it's worth noting
12:35	this isn't a small a small change and so
12:38	um
12:39	that you can we don't think about this
12:41	very often but you know often we just
12:42	look at loss goes more like did the
12:43	model do better than another model or
12:44	worse than another model but um you can
12:46	you can think about this as in terms of
12:48	gnats and that are are you know it's
12:50	just the information theoretic quantity
12:51	in that um and you can convert that into
12:54	bits and so like one one way you can
12:56	interpret this is it's it's something
12:58	roughly like you know the model 0.4 nas
13:01	is about 0.5 bits is about uh like every
13:03	other token the model gets to go and
13:04	sample twice um and pick the better one
13:07	it's actually it's even stronger than
13:08	that that's sort of an underestimate of
13:10	how big a deal going and getting better
13:12	by 0.4 an ounces so this is like a real
13:15	big difference in the model's ability to
13:16	go and predict late tokens
13:20	um and we can visualize this in
13:21	different ways we can we can also go and
13:23	ask you know how much better are we
13:24	getting at going and predicting later
13:26	tokens and look at the derivative and
13:28	then we we can see very clearly that
13:29	there's there's some kind of
13:30	discontinuity in that derivative at this
13:32	point and we can take the second
13:34	derivative then and we can um with well
13:37	derivative with respect to training and
13:39	now we see that there's like there's
13:40	very very clearly this this line here so
13:43	something in just the span of a few
13:45	steps a few hundred steps is is causing
13:47	some big change and we have some kind of
13:50	phase change going on
13:51	um and this is true across model sizes
13:54	um
13:55	uh you can you can actually see it a
13:56	little bit in the loss curve and there's
13:58	this little bump here and that
13:59	corresponds to the point where you have
14:01	this you have this change we we actually
14:03	could have seen in the lost curve
14:04	earlier too it's it's this bump here
14:08	excuse me so so we have this phase
14:10	change going on and there's a i think a
14:12	really tempting theory to have which is
14:14	that somehow whatever you know there's
14:16	some this this change in the model's
14:18	output and its behaviors and it's in a
14:20	in a in in these sort of outward facing
14:22	properties corresponds presumably to
14:24	some kind of change in the algorithms
14:26	that are running inside the model so if
14:28	we observe this big phase change
14:29	especially in a very small window um in
14:32	in the model's behavior presumably
14:34	there's some change in the circuits
14:35	inside the model that is driving that
14:38	at least that's a you know a natural
14:39	hypothesis so
14:41	um if we want to ask that though we need
14:42	to go and be able to understand you know
14:44	what are the algorithms that's running
14:45	inside the model how can we turn the
14:47	parameters in the model back into those
14:48	algorithms so that's going to be our
14:49	goal
14:51	um now it's going to recover us require
14:53	us to cover a lot of ground um in a
14:55	relatively short amount of time so i'm
14:56	going gonna go a little bit quickly
14:58	through the next section and i will
14:59	highlight sort of the the key takeaways
15:02	and then i will be very happy um to go
15:04	and uh you know explore any of this in
15:07	as much depth i'm free for another hour
15:09	after this call um and just happy to
15:11	talk in as much depth as people want
15:12	about the details of this
15:14	so
15:15	um it turns out this phase change
15:17	doesn't happen in a one layer attention
15:19	only transformer and it does happen in a
15:21	two-layer attention-only transformer so
15:23	if we could understand a one-layer
15:24	attenuate transformer and a two layer
15:26	only attention potentially transformer
15:28	that might give us a pretty big clue as
15:30	to what's going on
15:32	um
15:34	so we're attention only we're also going
15:35	to leave out layer and biases to
15:37	simplify things so you know you one way
15:39	you could describe a attention only
15:42	transformer
15:43	is we're going to embed our tokens
15:45	and then we're going to apply a bunch of
15:46	attention heads and add them into the
15:48	residual stream and then apply our
15:50	unembedding and that'll give us our
15:51	logits
15:53	and we can go and write that out as
15:54	equations if we want multiplied by an
15:55	embedding matrix
15:57	apply attention heads
15:59	and then compute the logics from the
16:00	unembedding
16:03	um
16:04	and the part here that's a little tricky
16:06	is understanding the attention ads and
16:08	this might be a somewhat conventional
16:10	way of describing attention and it
16:12	actually kind of obscures a lot of the
16:14	structure of attentionnets and i think
16:15	that oftentimes it's we we make
16:18	attention heads more more complex than
16:19	they are we sort of hide the interesting
16:21	structure
16:22	so what is this saying let's say you
16:23	know for every token compute a value
16:25	back a value vector and then go and mix
16:28	the value vectors according to the
16:29	attention matrix and then project them
16:31	with the output matrix back into the
16:32	residual string
16:34	um so there's there's another notation
16:37	which you could think of this as a as
16:39	using tensor products or using um using
16:42	uh
16:43	well i guess there's a few left and
16:45	right multiplying there's a few ways you
16:46	can interpret this but um
16:48	i'll just sort of try to explain what
16:50	this notation means um
16:53	so this means
16:54	for every you know x or our residual
16:56	string we have a vector for every single
16:59	token
17:00	and this means go and multiply
17:02	independently the vector for each token
17:04	by wv so compute the value vector for
17:07	every token
17:09	this one on the other hand means notice
17:10	that it's now on the a is on the left
17:12	hand side it means go and go and
17:14	multiply
17:15	the
17:17	attention matrix or go and go into
17:19	linear combinations of the values value
17:20	vectors so don't don't change the value
17:22	vectors you know point wise but go and
17:24	mix them together according to the
17:25	attention pattern create a weighted sum
17:28	and then again independently for every
17:30	position go and apply the output matrix
17:33	and you can apply the distributive
17:34	property to this and it just reveals
17:36	that actually it didn't matter that you
17:37	did the attention sort of in the middle
17:39	you could have done the attention at the
17:40	beginning you could have done it at the
17:41	end um that's that's independent um and
17:44	the thing that actually matters is
17:45	there's this wvwo matrix that describes
17:48	what it's really saying is you know
17:50	wvw describes what information the
17:52	attention head reads from each position
17:54	and how it writes it to its destination
17:56	whereas a describes which tokens we read
17:59	from and write to
18:00	um and that's that's kind of getting
18:02	more the fundamental structure and
18:03	attention an attention head goes and
18:05	moves information from one position to
18:07	another and the process of of which
18:10	position gets moved from and two is
18:11	independent from what information gets
18:13	moved
18:17	and if you rewrite your transformer that
18:19	way
18:20	well first we can go and write uh the
18:22	sum of attention heads just as as in
18:25	this form
18:26	um and then we can uh go and write that
18:29	as the the entire layer by going and
18:31	adding an identity
18:34	and if we go and plug that all in to our
18:36	transformer and go and expand
18:40	um we we have to go and multiply
18:43	everything through we get this
18:45	interesting equation and so we get this
18:47	one term this corresponds to just the
18:49	path directly through the residual
18:50	stream
18:51	and it's going to want to store uh
18:53	bigram statistics it's just you know all
18:55	i guess is the previous token and tries
18:56	to predict the next token
18:58	and so it gets to try and predict uh try
19:00	to store bi-gram statistics and then for
19:02	every attention head we get this matrix
19:04	that says okay well for we have the
19:05	attention pattern so it looks that
19:07	describes which token looks at which
19:08	token and we have this matrix here which
19:10	describes how for every possible token
19:12	you could attend to
19:13	how it affects the logics and that's
19:16	just a table that you can look at it
19:17	just says you know for for this
19:18	attention head if it looks at this token
19:20	it's going to increase the probability
19:21	of these tokens in a one layer attention
19:23	only transformer that's all there is
19:28	um yeah so this is just just the
19:30	interpretation i was describing
19:32	um
19:34	and another thing that's worth noting is
19:36	this
19:36	the according to this the attention on
19:38	the transformer is linear if you fix the
19:41	attention pattern now of course it's the
19:43	attention pattern isn't fixed but
19:44	whenever you even have the opportunity
19:46	to go and make something linear linear
19:47	functions are really easy to understand
19:49	and so if you can fix a small number of
19:50	things and make something linear that's
19:51	actually
19:52	a lot of leverage
19:54	okay
19:56	um
19:57	and yeah we can talk about how the
19:59	attention pattern is computed as well
20:01	um you if you expand it out you'll get
20:03	an equation like this
20:05	and uh notice well i think i think it'll
20:08	be easier
20:09	okay
20:12	the i think the core story though to
20:14	take away from all of these is we have
20:16	these two matrices that actually look
20:17	kind of similar so
20:19	this one here
20:20	tells you if you attend to a token
20:22	how are the logits affected
20:24	and it's you can just think of it as a
20:25	giant matrix of for every possible token
20:28	input token how how is the logic how are
20:30	the logics affected
20:31	by that token are they made more likely
20:33	or less likely
20:34	and we have this one which sort of says
20:36	how much does every token want to attend
20:38	to every other token
20:42	um one way that you can you can picture
20:44	this is
20:46	uh okay that's really there's really
20:48	three tokens involved when we're
20:49	thinking about an attention head we have
20:51	the
20:52	token that
20:54	we're going to move information to and
20:56	that's attending backwards
20:58	we have the source token that's going to
21:00	get attended to and we have the output
21:02	token whose logits are going to be
21:03	affected
21:04	and you can just trace through this so
21:06	you can ask what happens um how does the
21:09	the attending to this token affect the
21:11	output well first we embed the token
21:14	then we multiply by wv to get the value
21:17	vector the information gets moved by the
21:19	attention pattern
21:20	we multiply by wo to add it back into
21:22	the residual stream we get hit by the
21:24	unembedding and we affect the logits and
21:26	that's where that one matrix comes from
21:28	and we can also ask you know what
21:29	decides you know whether a token gets a
21:31	high score when we're when we're
21:33	computing the attention pattern and it
21:34	just says
21:35	you know embed embed the token
21:39	turn it into a query embed the other
21:40	token turn it into a key
21:43	and dot product to them and see you
21:44	that's where those those two matrices
21:46	come from
21:47	so i know that i'm going quite quickly
21:50	um
21:51	maybe i'll just briefly pause here and
21:54	if anyone wants to ask for
21:55	clarifications uh this would be a good
21:57	time and then we'll actually go and
21:59	reverse engineer and and say you know
22:01	everything that's going on in a
22:02	one-layer pendulum transformer is now in
22:03	the palm of our hands
22:05	it's a very toy model
22:07	you know one actually uses one layer
22:09	attention on the transformers but we'll
22:10	be able to understand the one layer
22:12	attention only transformer
22:16	so just to be clear so you're saying
22:17	that yes the the
22:19	quite key circuit is learning the
22:21	attention weights
22:23	and like essentially it's responsive
22:24	running the sort of attention between
22:26	different uh tokens
22:27	yeah yeah so
22:29	so this this matrix when it yeah you
22:31	know all three of those parts are
22:33	learned but that's that's what expresses
22:35	whether
22:36	a attention pattern is yeah that's what
22:38	generates the attention patterns gets
22:40	run for every pair of tokens and you can
22:42	you can you can think of values in that
22:43	matrix as just being how much every
22:45	token wants to attend to every other
22:46	token if it was in the context and we're
22:49	we're drawing positional weddings here
22:50	so there's a little bit that we're sort
22:51	of aligning over there as well but sort
22:53	of in in a global sense how much does
22:55	every token want to attend every other
22:56	token right
22:57	and the other circuit like the output
22:59	value circuit is
23:00	using the attention that's calculated to
23:03	guess
23:05	like affect the final outputs it's sort
23:08	of saying if if the attention head
23:09	assume that the attention head attends
23:11	to some token so let's set aside the
23:12	question of how that gets computed just
23:14	assume that it hence to some token how
23:16	would it affect the outputs if it
23:17	attended to that token
23:19	and you can just you can just calculate
23:20	that um it's just a big table of values
23:23	that says you know for this token
23:24	it's going to make this token more
23:25	likely this token will make this token
23:27	less likely
23:29	right okay
23:31	and it's completely independent like
23:32	it's just two separate matrices they're
23:34	they're not you know the the formulas
23:37	that might make them seem entangled but
23:38	they're actually separate
23:40	all right so to me it seems like the
23:43	lecture supervision is coming from the
23:44	output value circuit and the query key
23:46	second seems to be more like
23:47	unsupervised kind of thing because
23:48	there's no
23:50	i mean they're just i think in the sense
23:53	that every in in an yeah in a model like
23:56	every every neuron is in some sense you
23:57	know like
23:59	signals is is somehow downstream from
24:02	the ultimate the ultimate signal and so
24:04	you know the output value signal the
24:06	output value circuit is getting more
24:08	more direct is perhaps getting more
24:09	direct signal correct um but yeah
24:12	yes
24:16	we will be able to dig into this in lots
24:18	of detail in as much detail as you want
24:20	uh in a little bit so we can um maybe
24:23	i'll push forward and i think also
24:24	actually an example of how to use this
24:26	reverse engineer one layer model will
24:28	maybe make it a little bit more more
24:29	motivated
24:31	okay so
24:32	um just just to emphasize this there's
24:35	three different tokens that we can talk
24:37	about there's a token that gets attended
24:38	to
24:39	there's the token that does the
24:40	attention which are called the
24:42	destination and then there's the token
24:43	that gets affected yet it gets the next
24:45	token which its probabilities are
24:46	affected
24:48	um
24:49	and so something we can do is notice
24:51	that the the only token that connects to
24:53	both of these is the token that gets
24:55	attended to
24:56	so these two are sort of they're they're
24:58	bridged
24:59	by their their interaction with the
25:01	source token so something that's kind of
25:02	natural is to
25:04	ask for a given source token you know
25:06	how does it interact with both of these
25:09	so let's let's take for instance the
25:11	token perfect
25:13	which tokens for one thing we can ask is
25:15	which tokens want to attend to perfect
25:19	well apparently the tokens that most
25:20	want to attend to perfect are are and
25:23	looks and is and provides
25:26	um so r is the most looks is the next
25:27	most and so on
25:29	and then when we attempt to perfect and
25:30	this is with one one single attention
25:32	head so you know it'd be different if we
25:33	did a different intention attention ed
25:35	it wants to really increase the
25:36	probability of perfect and then to a
25:38	lesser extent super and absolute and
25:40	pure and we can ask you know what what
25:44	sequences of tokens are made more likely
25:47	by this
25:48	this particular
25:49	um set of you know this particular set
25:51	of things wanting to attend to each
25:53	other and becoming more likely well
25:55	things are the form
25:57	we have our tokens we attended back to
25:59	and we have some
26:00	some skip of some number of tokens they
26:02	don't have to be adjacent but then later
26:03	on we see the token r and it attends
26:05	back to perfect and increases the
26:07	probability of perfect
26:09	so you can you can think of these as
26:10	being like we're sort of creating
26:12	changing the probability of what we
26:13	might call might call skip trigrams
26:15	where we have you know we skip over a
26:17	bunch of tokens in the middle but we're
26:18	affecting the probability really of of
26:20	trigrams
26:22	so perfect our perfect perfect look
26:23	super
26:25	um we can look at another one so we have
26:26	the token large
26:28	um these tokens contains using specify
26:30	want to go and look back to it and it
26:32	increases probability of large and small
26:34	and the skip trigrams that are affected
26:36	are things like large using large
26:39	large contained small
26:42	and things like this
26:44	um if we see the number two and we
26:46	increase the probability of other
26:47	numbers and we affect probable tokens or
26:51	skipped diagrams like two one two
26:53	two
26:54	has three
26:56	um
26:57	now you're you're all in uh in a
27:00	technical field so you'll probably
27:01	recognize this one we have uh have
27:02	lambda and then we see backslash and
27:06	then we want to increase the probability
27:07	of lambda and sorted and lambda and
27:09	operator so it's all fall latex
27:12	and it wants to and it's if it sees
27:14	lambda it thinks that you know maybe
27:15	next time i use a backslash i should go
27:17	and put in some latex
27:19	math symbol
27:21	um
27:22	also
27:22	same thing for html we see nbsp for
27:25	non-breaking space and then we see an
27:27	ampersand we want to go and make that
27:28	more likely
27:29	the takeaway from all this is that a one
27:31	layer attenuating transformer is totally
27:33	acting on these skip trigrams
27:36	um
27:37	every everything that it does i mean i
27:38	guess it also has this pathway by which
27:39	it affects bi-grams but mostly it's just
27:41	affecting these skiff trigrams
27:43	um and there's lots of them it's just
27:44	like these giant tables of skip trigrams
27:46	that are made more or less likely
27:50	um
27:51	there's lots of other fun things that
27:52	does sometimes the tokenization will
27:53	split up a word in multiple ways so um
27:56	like we have indie
27:57	well that's that's not a good example we
27:59	have like the word pike and then we
28:01	we see the the token p and then we
28:03	predict ike
28:05	um when we predict spikes and stuff like
28:07	that um
28:08	or
28:09	these these ones are kind of fun maybe
28:10	they're actually worth talking about for
28:11	a second so we see
28:13	the token void
28:15	and then we see an l and maybe we
28:17	predict lloyd
28:18	um or r and we predict ralph
28:21	and c catherine
28:23	and but we'll see in a second then well
28:25	yeah we'll come back to that in a sec so
28:27	we increase the probability of things
28:28	like lloyd's lloyd and lloyd catherine
28:30	or pixmap
28:32	if anyone's worked with qt
28:34	um it's we see pics map and we increase
28:36	the probability of um p
28:39	xmap again but also
28:41	q
28:42	canvas um
28:45	yeah
28:47	but of course there's a problem with
28:48	this which is um
28:50	it doesn't get to pick which one of
28:51	these goes with which one
28:52	so if you want to go and make pixmap
28:55	pixmap
28:56	and pixmap q canvas more probable you
28:59	also have to go and create make pixmac
29:01	pixmap p canvas
29:03	more probable
29:04	and if you want to make lloyd lloyd and
29:06	lloyd catherine
29:08	more probable you also have to make
29:09	lloyd cloyd and lloyd lathron
29:13	more probable
29:14	and so there's actually like bugs that
29:16	transformers have like weird at least
29:17	and you know and these these really tiny
29:19	one-layer attention only transformers
29:21	there there's these bugs that you know
29:22	they seem weird until you realize that
29:24	it's this giant table of skip trigrams
29:26	that's that's operating
29:28	um
29:28	and the the nature of that is that
29:30	you're going to be
29:31	um
29:33	uh yeah you it sort of forces you if you
29:35	want to go and do this to go in and also
29:37	make some weird predictions
29:40	chris
29:42	is there a reason why the source tokens
29:44	here have a space before the first
29:45	character
29:46	yes um that's just the i was giving
29:49	examples where the tokenization breaks
29:51	in a particular way and okay um because
29:54	spaces get included in the tokenization
29:57	um
29:58	when there's a space in front of
29:59	something and then there's an example
30:01	where the space isn't in front of it
30:02	they can get tokenized in different ways
30:04	got it cool thanks
30:06	great question
30:10	um
30:11	okay so some just to abstract away some
30:13	common patterns that we're seeing i
30:14	think um
30:16	one pretty common thing is what you
30:17	might describe as like d
30:19	a b so you're you go and you you see
30:22	some token and then you see another
30:23	token that might precede that token then
30:25	you're like ah probably the token that i
30:26	saw earlier is going to occur again
30:28	um or sometimes you you predict a
30:31	slightly different token so like maybe
30:33	maybe an example the first one is two
30:35	one two
30:36	but you could also do two
30:38	has three
30:39	and so three isn't the same as two but
30:41	it's kind of similar so that's that's
30:42	one thing another one is this this
30:44	example where you've a token that
30:45	something it's tokenized together one
30:46	time and that's split apart so you see
30:49	the token and then you see something
30:50	that might be the first part of the
30:51	token and then you predict the second
30:53	part
30:55	and
30:56	i think the thing that's really striking
30:58	about this is these are all in some ways
31:01	a really crude kind of in context
31:03	learning
31:04	and and
31:06	in particular these models get about 0.1
31:08	nouns rather than about 0.4
31:10	of incontex learning and they never go
31:12	through the phase change so they're
31:13	doing some kind of really crude and
31:15	context learning and also they're
31:16	dedicating almost all their attention
31:18	heads to this kind of recruiting context
31:20	learning so they're not very good at it
31:21	but they're they're they're dedicating
31:23	their um their capacity to it
31:26	uh i'm noticing that it's 10 37 um
31:29	i i want to just check how long i can go
31:31	because i maybe i should like super
31:32	accelerate because this is
31:34	chris uh i think it's fine because like
31:36	students are also asking questions in
31:38	between such uh you should be good
31:40	okay so maybe my plan will be that i'll
31:42	talk until like 10 55 or 11 and then if
31:45	you want i can go and answer questions
31:48	for a while after after that
31:50	yeah it works
31:51	fantastic
31:53	so you can see this as a very crude kind
31:55	of in context learning like basically
31:56	what we're saying is it's sort of all
31:58	this flavor of okay well i saw this
31:59	token probably these other tokens the
32:01	same token or similar tokens are more
32:03	likely to go and occur later and look
32:05	this is an opportunity that sort of
32:06	looks like i can inject the token that i
32:08	saw earlier i'm going to inject it here
32:09	and say that it's more likely that's
32:10	like that's basically what it's doing
32:13	and it's dedicating almost all of its
32:14	capacity to that so you know these it's
32:16	sort of the opposite of what we thought
32:17	with rnn's in the past like used to be
32:19	that everyone was like oh you know rnn's
32:20	it's so hard to care about long distance
32:23	contacts you know maybe we need to go
32:25	and like use dams or something no if you
32:27	if you train a transformer it dedicates
32:29	and you give it a long a long enough
32:30	context it's dedicating almost all of
32:32	its capacity um to this type of stuff um
32:35	just kind of interesting
32:38	um there are some attentions which are
32:40	more primarily positional um usually we
32:43	in the model that i've been training
32:44	that has two layer or it's only a one
32:46	layer model has twelve attention units
32:48	and usually around two or three of those
32:49	will become these more positional sort
32:51	of shorter term things that do something
32:52	more like like local trigram statistics
32:55	and then everything else becomes these
32:56	skipped programs
33:00	um yeah so uh some takeaways from this
33:03	uh yeah you can you can understand one
33:06	layer eventually transformers in terms
33:07	of these ov and qk circuits um
33:10	transformers desperately want to do
33:12	in-context learning they desperately
33:14	desperately desperately want to go and
33:16	and look at these long distance contacts
33:18	and go and predict things there's just
33:19	so much so much entropy that they can go
33:21	and reuse out of that
33:23	the constraints of a when they are
33:25	intentionally transformer force it to
33:26	make certain bugs but it wants to do the
33:27	right thing
33:28	um
33:29	and if you freeze the attention patterns
33:31	these models are linear
33:33	okay
33:34	um
33:35	a quick aside because so far this type
33:38	of work has required us to do a lot of
33:40	very manual inspection like we're
33:41	walking through these giant matrices but
33:43	there's a way that we can escape that we
33:44	don't have to use look at these giant
33:46	matrices if we don't want to
33:47	um we can use eigenvalues and
33:49	eigenvectors so recall that an
33:50	eigenvalue
33:52	and an eigenvector just means that if
33:54	you if you multiply that vector by the
33:56	matrix um it's equivalent to just
33:58	scaling
34:00	and
34:01	uh often
34:02	in my experience those haven't been very
34:03	useful for interpretability because
34:04	we're usually mapping between different
34:06	spaces but if you're mapping onto the
34:07	same space either values either vectors
34:09	are a beautiful way to think about this
34:11	um so we're going to draw them um
34:14	on a
34:15	a radial plot
34:17	um and we're going to have a log
34:19	uh radial scale because they're gonna
34:20	vary their magnitude's gonna vary in by
34:22	many orders of magnitude
34:25	um okay so we can just go and you know
34:27	our ob circuit maps from tokens to
34:29	tokens that's the same vector space and
34:30	the input and the output and we can ask
34:32	you know what does it mean if we see
34:34	eigenvalues of a particular kind well
34:36	positive eigenvalues and this is really
34:38	the most important part mean copying so
34:40	if you have a positive eigenvalue it
34:41	means that there's some set of of tokens
34:44	where if you if you see them you
34:46	increase their probability
34:47	and if you have a lot of positive
34:48	eigenvalues um you're doing a lot of
34:50	copying if you only have positive
34:52	eigenvalues everything you do is copying
34:54	um now imaginary eigenvalues mean that
34:56	you see a token and then you want to go
34:58	and increase the probability of
34:59	unrelated tokens and finally negative
35:01	eigenvalues are anti-copying they're
35:02	like if you see this token you make it
35:04	less probable in the future
35:07	well that's really nice because now we
35:08	don't have to go and dig through these
35:09	giant matrices that are vocab size by
35:11	vocab size we can just look at the
35:13	eigenvalues
35:14	um and so these are the eigenvalues for
35:16	our one layer attention only transformer
35:18	and we can see that you know
35:21	for
35:22	many of these they're almost entirely
35:24	positive these events are are sort of
35:26	entirely positive these ones are almost
35:28	entirely positive and really these ones
35:30	are even almost entirely positive and
35:31	there's only two
35:33	that have a significant number of
35:34	imaginary and negative eigenvalues
35:37	um and so what this is telling us is
35:39	it's just in one picture we can see you
35:40	know okay they're really you know
35:44	10 out of 12 of these of these attention
35:46	heads are just doing copying they just
35:47	they just are doing this long distance
35:48	you know well i saw a token probably
35:50	it's going to occur again type stuff um
35:52	that's kind of cool we can we can
35:53	summarize it really quickly
35:56	okay
35:57	um
35:58	now the other thing that you can yeah so
36:00	this is this is for a second we're gonna
36:01	look at a two-layer model in a second
36:03	and we'll we'll see that also a lot of
36:04	its heads are doing this kind of copying
36:06	or stuff they have large positive
36:07	eigenvalues
36:10	um you can do a histogram like you know
36:12	one one thing that's cool is you can
36:13	just add up the uh the eigenvalues and
36:15	divide them by their absolute values and
36:16	you get a number between zero and one
36:18	which is like how copying how copying is
36:19	just the head or between negative one
36:21	and one how copying is just the head you
36:22	can just do a histogram you can see oh
36:24	yeah almost all the heads are doing
36:26	doing lots of copying
36:28	you know it's nice to be able to go and
36:29	summarize your model you know uh and i
36:31	think this is this is sort of like we've
36:32	gone for a very bottom-up way and we
36:35	didn't start with assumptions about what
36:36	model is doing we tried to understand
36:37	its structure and then we were able to
36:38	summarize it in useful ways and now
36:40	we're able to go and say something about
36:41	it
36:43	um now another thing you might ask is
36:45	what what do the the eigenvalues of the
36:46	qk circuit mean and in our example so
36:48	far they haven't been that they wouldn't
36:50	have been that interesting but in a
36:52	minute they will be and so i'll briefly
36:53	describe what they mean a positive
36:55	eigenvalue would mean you want to attend
36:56	to the same tokens
36:58	and imagine your eigenvalue and this is
37:00	what you would mostly see in our models
37:01	we've seen so far means you want to go
37:03	in and attend to a unrelated or
37:05	different token
37:06	and a negative eigenvalue would mean you
37:08	want to avoid attending to the same
37:09	triplet
37:11	so that will be relevant in a second
37:14	um yes so those are going to mostly be
37:16	useful to think about in in multi-layer
37:18	potentially transformers when we kind of
37:19	change the attention heads and so we can
37:21	ask you know well i'll get to that in a
37:23	second yeah so there's a table
37:24	summarizing that um unfortunately this
37:27	this approach completely breaks down
37:28	once you have mlp layers mlp layers you
37:30	know now you have have these
37:32	non-linearities and so you don't get
37:33	this property where your model is mostly
37:34	linear and you can you can just look at
37:36	a matrix but if you're working with only
37:37	attention only transformers this is a
37:38	very nice way to think about effects
37:40	okay so recall that one that you're
37:42	intentionally transformers don't undergo
37:44	this phase change that we talked about
37:45	in the beginning like right now we're on
37:46	a hunt we're trying to go and answer
37:48	this mystery of how what the hell is
37:50	going on in that phase change where
37:51	models suddenly get good at in context
37:52	learning um we want to answer that and
37:55	one layer attention only transformers
37:56	don't undergo that phase change but two
37:58	layer attenuation transformers do so
38:00	we'd like to know what's different about
38:01	two layer attention only transformers
38:06	um
38:07	okay well so in our in our previous when
38:09	we're dealing with one layer attention
38:11	transformers we're able to go and
38:12	rewrite them in this
38:13	this form and it gave us a lot of struct
38:15	ability to go and understand the model
38:17	because we could go and say well you
38:18	know this is bi-grams and then each one
38:20	of these is looking somewhere and we
38:22	have this matrix that describes how it
38:23	affects things and
38:26	and yeah so that gave us a lot of a lot
38:28	of ability to think about this thing
38:30	these things and we we can also just
38:32	write in this factored form where we
38:33	have the embedding and then we have the
38:34	attention heads and then we have the
38:35	unembedding
38:37	okay well
38:39	um
38:40	oh and for simplicity we often go and
38:42	write wov for wowv because they always
38:45	come together it's always the case like
38:47	it's it's in some sense an illusion that
38:48	w o and wv are different matrices
38:50	they're just one low rank matrix they're
38:51	never they're they're always used
38:53	together and similarly w q and w k it's
38:55	sort of an illusion that they're they're
38:56	different matrices um they're they're
38:58	always just used together and and keys
39:00	and queries are just sort of they're
39:01	just an artifact of this of these low
39:03	rank matrices
39:05	so in any case it's useful to go and
39:06	write those together
39:07	um okay great so um a two-layer
39:09	attention only transformer what we do is
39:11	we we go through the embedding matrix
39:14	then we go through the layer one
39:16	attention heads then we go through the
39:17	layer two attenuates
39:19	and then we go through the unembedding
39:20	and for the the attention is we always
39:22	have this identity as well which
39:24	corresponds just going down the residual
39:25	string so we can
39:27	uh go down the residual stream or we can
39:28	go through an attention head
39:31	next step we can also go down the
39:32	residual stream or we can go through an
39:33	attention head
39:37	um and there's this useful identity uh
39:40	the mixed product identity that um any
39:42	tensor product or or other ways of
39:44	interpreting this um obey which is that
39:47	if you have an attention head
39:49	um and we have say you know we have the
39:50	weights and the attention pattern and
39:52	the wov matrix and the attention pattern
39:54	the attention patterns multiply together
39:56	and the ov circuits multiply together
39:58	um and they behave nicely okay great so
40:02	and we can just expand out that equation
40:04	we can just take that big product we had
40:05	at the beginning we just expanded out
40:07	and we get three different kinds of
40:08	terms so one thing we do is we get this
40:10	this path that just goes directly
40:12	through the residual stream where we
40:13	embed and unembed and that's going to
40:14	want to represent some bigram statistics
40:18	um then we get things that look like
40:21	the attention head terms that we had
40:23	previously
40:26	and finally
40:28	we get these terms that correspond to
40:31	going through two attention heads
40:36	and
40:37	now it's worth noting that these terms
40:39	are not actually the same as they're
40:41	because the attention head uh the
40:42	attention patterns in the next in the
40:43	second layer can be computed from the
40:45	outputs of the first layer there those
40:47	are also going to be more expressive but
40:48	at a high level you can think of there
40:50	as being these three different kinds of
40:51	terms and we sometimes call these terms
40:53	virtual attention nets because they they
40:55	don't exist in the sun like they aren't
40:56	sort of explicitly represented in the
40:57	model but um they in fact you know they
41:00	have an attention pattern they have no
41:02	circuit they're sort of in almost all
41:04	functional ways like a tiny little
41:05	attention head and there's exponentially
41:07	many of them
41:08	um
41:09	it turns out they're not going to be
41:10	that important in this model but in
41:11	other models they can be important
41:14	um right so one one thing that i said it
41:16	allows us to think about attention in a
41:17	really principled way we don't have to
41:19	go and think about um
41:22	you know i think there's like people
41:23	people look at attention patterns all
41:25	the time and i think a concern you could
41:27	have is well you know
41:28	there's multiple attention patterns like
41:30	you know the information that's been
41:31	moved by one intention it might have
41:32	been moved there by another attention
41:33	ahead and not originally there it might
41:35	still be moved somewhere else um but in
41:38	fact this gives us a way to avoid all
41:39	those concerns and just think about
41:40	things in a single principled way
41:43	um okay in any case um an important
41:45	question to ask is how important are
41:47	these different terms well we could
41:49	study all of them how important are they
41:51	um
41:52	and it turns out um you can just
41:54	there's an algorithm you can use where
41:55	you knock out attention
41:57	knock out these terms and you go and you
41:59	ask how important are they
42:00	um and the it turns out the by far the
42:03	most important thing is these individual
42:05	attention head terms in this model far
42:07	by far the most important thing the
42:08	virtual tension heads basically
42:10	don't matter that much
42:12	they only have an effect of 0.3 not
42:14	using to be the above ones and the
42:15	bigrams are still pretty useful so if we
42:17	want to try it on channel's model we
42:19	should probably go and focus our
42:20	attention on you know the virtual
42:21	attention heads are not going to be the
42:23	best way to go in and go in our uh focus
42:26	our attention especially since there's
42:27	there's a lot of them there's 124 of
42:29	them for 0.3 knots it's very little that
42:31	you would understand for prosthetic one
42:33	of those terms
42:34	so the thing that we probably want to do
42:36	we know that these are bigram statistics
42:38	so what we really want to do is we want
42:39	to understand
42:40	the
42:41	the individual tension head terms
42:46	um this is the algorithm i'm going to
42:48	skip over it for time
42:49	we can ignore that term because it's
42:50	small
42:51	um
42:52	and it turns out also that the layer two
42:54	attention heads are doing way more than
42:56	layer one attention so that's that's not
42:58	that surprising like the layer two
43:00	intense are more expressive because they
43:01	can use the layer one attention to
43:02	construct their attention patterns
43:05	okay so uh if we could just go and
43:07	understand the layer two attention heads
43:08	we probably understand a lot of what's
43:10	going on in this model
43:12	um
43:14	and the trick is that the attention
43:15	heads are now constructed from the
43:17	previous layer rather than just from the
43:18	tokens so this is still the same but the
43:20	attention head the attention pattern is
43:22	more more complex and if you write it
43:24	out you get this complex equation that
43:26	says you know you embed the tokens then
43:28	you're going to shuffle things around
43:30	using the attention edits for the keys
43:31	then you multiply by w uk then you
43:33	multiply shuffle things around again for
43:34	the queries and then you go and multiply
43:36	by the embedding again because they were
43:38	embedded and then you get back to the
43:39	tokens um
43:42	uh
43:43	but let's actually look at them so
43:46	uh one thing that's remember that when
43:47	we see positive eigenvalues in the ob
43:49	circuit we're doing copying so one thing
43:51	we can say is well 7 out of 12 and in
43:53	fact the ones with the largest
43:54	eigenvalues um are doing coffeeing so we
43:57	still have a lot of attention they're
43:58	doing copying
44:02	um
44:04	and yeah the qk circuit so one one thing
44:06	you could do is you could try to
44:07	understand things in terms of this more
44:08	complex qk equation you could also just
44:10	try to understand what the attention
44:11	patterns are doing empirically so let's
44:13	look at one of these copying ones
44:15	um i've given it the first paragraph of
44:17	harry potter and we can just look at
44:19	where it attends
44:23	and something really happened
44:24	interesting happens so almost all the
44:25	time
44:26	we just attend back to
44:28	the first token we have this this
44:30	special token at the beginning of the
44:31	sequence
44:32	and we usually think of that as just
44:33	being um a null attention operation it's
44:35	a way for it to not do anything in fact
44:37	if you if you look the value vector is
44:38	basically zero it's just not copying any
44:40	information from that
44:42	um
44:44	but when whenever we see repeated text
44:46	something interesting happens so when we
44:47	get to mr
44:49	tries to look at and it's a little bit
44:51	weak then we get to d
44:54	and it tends to errors
44:56	that's interesting
44:57	and then we get to ers
45:00	and it tends to lean
45:03	um and so it's not attending to
45:06	the same token it's attending to the
45:08	same token
45:10	shifted one forward
45:12	well that's really interesting and
45:14	there's actually a lot of attention nets
45:15	that are doing this so here we have one
45:17	where now we hit the potter's pot and we
45:19	attend deters maybe that's the same
45:21	tension i don't remember when i was
45:22	constructing this example
45:24	um it turns out this is a super common
45:25	thing so you you go and you you look for
45:28	the previous example you shift one
45:29	forward and you're like okay well last
45:30	time i solved this this is what happened
45:32	probably the same thing's gonna happen
45:36	um and we can we can go and look at the
45:39	effect that the attention head has on
45:41	the logits most of the time it's not
45:42	affecting things but in these cases it's
45:44	able to go and predict when it's doing
45:46	us this thing of going and looking
45:47	forward to go and predict the next token
45:51	um so we call this an induction an
45:52	induction head looks for the previous
45:54	copy looks forward and says ah probably
45:56	the same thing that happened last time
45:57	it's gonna happen you can think of this
45:59	as being a nearest neighbor's it's like
46:00	an in-context nearest neighbor's
46:02	algorithm it's going and searching
46:04	through your context finding similar
46:05	things and then predicting that's what's
46:07	gonna happen next
46:10	um
46:11	the way that these actually work is uh i
46:14	mean there's actually two ways but in a
46:17	model that uses rotary attention or
46:18	something like this you only only have
46:19	one
46:20	and
46:22	you shift your key first you have
46:25	an earlier attention head shifts your
46:26	key forward once you you like take the
46:28	value of the previous token and you
46:30	embed it in your present token
46:32	and then you have your query and your
46:33	key go and look
46:35	at
46:36	uh
46:37	yeah try to go and match so you look for
46:38	the same thing
46:40	um and then you go and you predict that
46:42	whatever you saw is going to be the next
46:44	token so that's the the high-level
46:45	algorithm um sometimes you can do clever
46:48	things where actually it'll care about
46:49	multiple earlier tokens and it'll look
46:51	for like short phrases and so on so
46:52	induction heads can really vary in in
46:54	how much they of the previous context
46:56	they care about or what aspects of the
46:57	previous context they care about but
46:59	this general trick of looking for the
47:00	same thing shift forward predict that is
47:03	what induction has been
47:06	um lots of examples of this
47:08	and the cool thing is you can now
47:10	you can use the qk eigenvalues to
47:12	characterize this you can say well you
47:14	know we we're looking for the same thing
47:16	shifted by one but looking for the same
47:17	thing if you expand through the
47:18	attention notes in the right way that'll
47:19	work out
47:20	um and we're copying and so an induction
47:22	head is one which has both positive ov
47:25	eigenvalues and also positive qk
47:27	eigenvalues
47:32	um and so you can just put that on a
47:33	plot and you have your induction heads
47:36	in the corner see
47:37	here ov eigen values your qk eigenvalues
47:40	i think actually ov is this axis qk is
47:42	this one access doesn't matter um and in
47:44	the corner you have your your icon
47:46	values
47:47	or your um your induction heads
47:51	um yeah and so this seems to be uh well
47:54	okay we now have a natural hypothesis
47:55	the hypothesis is the way that that
47:57	phase change we're seeing the phase
47:59	changes this is the discovery of these
48:00	induction heads that would be um the
48:02	hypothesis uh and these are way more
48:05	effective than regular you know than
48:06	this first algorithm we had which was
48:08	just sort of blindly copy things
48:09	wherever it could be plausible now we
48:11	can go and like actually recognize
48:13	patterns and look at what happened and
48:14	break that similar things are going to
48:15	happen again that's a way better
48:16	algorithm
48:19	um
48:21	yeah so there's other attention heads
48:22	that are doing more local things i'm
48:23	gonna go and skip over that and return
48:25	to our mystery because i am running out
48:26	of time i have five more minutes okay so
48:28	what what is going on with this in
48:30	context learning well now now we've
48:31	hypothesis let's check it um so we think
48:34	it might be induction heads
48:36	um
48:37	and there's a few reasons we believe
48:39	this so one thing is going to be that
48:40	inductive uh induction heads
48:43	well okay i'll just go over to the other
48:45	so one thing you can do is you can just
48:47	ablate the attention ends
48:49	and it turns it um you can color here we
48:51	have attention heads colored by how much
48:53	they are in induction head
48:55	and this is the start of the bump this
48:57	is the end of the bump here
48:59	and we can see that they first of all
49:01	induction heads are forming like
49:02	previously we didn't have induction
49:04	heads here now they're just starting to
49:05	form here and then we have really
49:07	intense induction heads here and here
49:11	and the attention heads where if you
49:12	ablate them and you get
49:15	a uh
49:16	you get a a loss and so we're we're
49:18	lucky not lost but this this meta
49:20	learning score the difference between or
49:22	in context learning store the the
49:23	difference between the 500th token and
49:25	the 50th token
49:26	and that's all explained by induction
49:29	heads
49:30	now we actually have one induction head
49:32	that doesn't contribute to it actually
49:33	it does the opposite so that's kind of
49:34	interesting uh maybe it's doing
49:36	something shorter shorter distance um
49:38	and there's also this interesting thing
49:39	where like they all rush to be induction
49:41	heads and then they they discover only
49:43	only a few went out in the end so
49:44	there's some interesting dynamics going
49:46	on there but it really seems like in
49:47	these small models
49:49	all of in context learning is explained
49:51	by these induction hats
49:53	um okay
49:55	what about large models well in large
49:57	models it's going to be harder to go and
49:58	ask this but one thing you can do is you
49:59	can ask okay you know
50:01	we can look at our uh our induction or
50:04	our in-context learning score over time
50:06	we get this sharp phase change oh look
50:08	induction heads form at exactly the same
50:10	point in time
50:12	so that's only correlational evidence
50:14	but it's pretty suggestive correlational
50:16	evidence even especially given that we
50:17	have an obvious you know like the
50:19	obvious the fact that induction heads
50:20	should have is is this um i guess it
50:22	could be that there's other mechanisms
50:23	being discovered at the same time in
50:25	large models but it has to be in a very
50:26	small window
50:28	so
50:29	really suggest the thing that's driving
50:31	that change is in context learning
50:34	um okay so
50:35	uh
50:36	obviously induction heads can go and
50:38	copy text but a question you might ask
50:40	is you know can they can they do
50:42	translation like there's all these
50:43	amazing things that models can do that
50:45	it's not obvious and you know in context
50:47	learning um or this sort of copying
50:49	mechanism could do so i just want to
50:51	very quickly
50:52	um
50:53	look at a few fun examples
50:56	so here we have
50:58	um
50:59	an attention pattern
51:01	oh i guess i need to open lexus scopes
51:08	hmm let me try doing that again
51:10	sorry i should have thought this through
51:12	a bit more before this talk
51:14	um
51:15	chris could you zoom in a little please
51:17	yeah yeah thank you
51:20	um
51:38	okay i'm not my french isn't that great
51:41	but
51:42	um my name is christopher i'm from
51:43	canada
51:45	um what we can do here is we can look at
51:47	where this attention head attends as we
51:49	go and we do this and
51:51	um it'll become especially clear on the
51:53	second sentence so here we're on the
51:55	period
51:56	and we tend to shift
51:58	now we're on
51:59	um and jus is i in french okay now we're
52:02	on the eye and we attend to speed
52:05	uh now we're on the am and we are trends
52:08	to do which is from and then from to
52:10	canada
52:12	and so we're doing a cross-lingual
52:14	induction head which we can use for
52:16	translation um and indeed if you look at
52:19	examples this is this is where it seems
52:21	to you know it seems to be a major
52:22	driving force in the model's ability to
52:24	go um and correctly do translation
52:28	another fun example is
52:30	um i think maybe maybe the most
52:33	impressive thing about in context
52:34	learning to me has been the model's
52:36	ability to go and learn arbitrary
52:37	functions like you can just show the
52:38	model a function it can start mimicking
52:40	that function well okay
52:43	yes yeah so do these induction heads
52:45	only do kind of a look ahead copy or
52:48	like can they also do some sort of like
52:50	a complex
52:52	uh structure recognition
52:54	yeah yeah so they can they can both use
52:56	a larger context previous context and
52:59	they can copy more abstract things so
53:01	like the translation one is showing you
53:02	that they can copy rather than the
53:03	literal token a translated version so
53:05	it's what i call a soft induction head
53:07	um and yeah you can you can have them
53:10	copy similar words you can have them
53:11	look at longer contexts it can look for
53:13	more structural things um
53:15	the way that we usually characterize
53:16	them is is whether in in large models
53:19	just whether they empirically behave
53:20	like an induction head so the the the
53:22	bound the definition gets a little bit
53:23	blurry when you try to encompass these
53:25	more this sort of blurry a blurry
53:26	boundary um but yeah there seem to be a
53:28	lot of attention heads that are doing
53:30	sort of more and more abstract versions
53:33	and yeah my my favorite version is this
53:35	one that i'm about to show you which is
53:37	um used let's isolate a single one of
53:40	these which can do pattern recognition
53:42	so it can learn functions in the context
53:44	and learn how to do it so i've just made
53:46	up a nonsense function here um
53:49	we're going to encode one binary
53:50	variable with a choice of whether to do
53:53	a color or a month as the first word
53:56	then
53:57	we're gonna so we have green or june
53:58	here
53:59	um let's zoom in more
54:02	so we have color or month
54:05	and animal or fruit and then we have to
54:07	map it to either true or false
54:09	so that's our goal and it's going to be
54:10	an xor so we have a binary variable
54:12	represented in this way we do an xor
54:14	i'm
54:15	pretty confident this was never in the
54:17	training set because i just made it up
54:18	and it seems like a nonsense problem
54:20	um okay so then we can go and ask you
54:22	know can the model go and push that well
54:24	it can and it uses induction heads to do
54:26	it and what we can do is we can look at
54:28	the so we look at a colon where it's
54:30	going to go and try and predict the next
54:31	word
54:32	and for instance here
54:34	um we have april dog so it's a month and
54:37	then an animal and it should be true
54:39	and what it does is it looks for a
54:40	previous previous cases where there was
54:42	an animal
54:43	a month and then an animal especially
54:45	one where the month was the same and
54:47	goes and looks and says that it's true
54:49	and so the model can go and learn learn
54:51	a function a completely arbitrary
54:53	function
54:54	by going and doing this kind of pattern
54:55	recognition
54:57	induction head
54:59	and this to me made it a lot more
55:00	plausible but these models actually
55:04	can do
55:06	can do in context learning like the
55:07	generality of all these amazing things
55:09	we see these large language models do um
55:11	uh can be explained by inductionists we
55:14	don't know that it could be that there's
55:15	other things going on um it's very
55:16	possible that there's lots of other
55:17	things going on um but it seems seems a
55:20	lot more plausible to me than it did
55:21	when when we started
55:24	i'm conscious that i am actually over
55:26	time i mean just quickly go through
55:27	these last few slides yeah so i think
55:29	thinking that this is like an in context
55:30	in your sneakers i think is a really
55:32	useful way to think about this um other
55:34	things could absolutely be contributing
55:36	um this might explain why uh
55:38	transformers do in-context learning uh
55:41	over long contacts better than uh lstms
55:44	and lstm can't do this because it's it's
55:46	not linear in the amount of compute it
55:47	needs it's like quadratic or n log n if
55:49	it was really clever
55:50	um so transformers are lstm is
55:52	impossible to do this transformers um do
55:54	do this and actually they diverge at the
55:56	same point but if you if you look well
55:59	i can go into this in more detail after
56:01	if you want um
56:03	there's a really nice paper by marcus
56:04	hutter explaining trying to predict and
56:06	explain why we observe scaling laws and
56:08	models it's worth noting that the
56:09	arguments in this paper go exactly
56:11	through to uh this example this theory
56:14	in fact they sort of work better for the
56:16	the case of thinking about these this in
56:18	context learning with with essentially a
56:19	nearest neighbor's algorithm um than
56:21	they do in in the regular case so
56:24	um yeah uh i'm happy to answer questions
56:26	i can go into as much detail as people
56:28	want about any of this and i can also if
56:31	you send me an email sending more
56:32	information about all this um and
56:34	uh yeah and you know again this this
56:36	work is not yet published and you don't
56:38	have to keep it secret but um you know
56:40	just if you could be thoughtful about
56:41	the fact that um it's unpublished work
56:43	and probably a month or two away from
56:44	coming out um i'd be really grateful for
56:46	that uh thank you so much for your time
56:49	yeah thanks a lot chris this was a great
56:50	talk
56:53	um so i'll just open up like some
56:54	general questions and then we can do
56:56	like a round of questions from the
56:57	students
56:58	so i was very excited to know like so
57:00	what is the
57:01	like the line of work that you're
57:02	currently working on is it like
57:03	extending this uh so what do you think
57:05	is like the next things you try to do to
57:08	make it more independable what are the
57:09	next yeah
57:10	i mean i want to just reverse engineer
57:12	language models i want to figure out the
57:14	entirety of what's going on in these
57:15	language models
57:16	um
57:17	and uh
57:20	you know like
57:21	one thing that we totally don't
57:22	understand is
57:24	mlp layers um more we understand
57:27	something about them um but we we don't
57:29	really understand mlp layers very well
57:31	uh there's a lot of stuff going on in
57:32	large models that we don't understand i
57:34	want to know how models do arithmetic um
57:36	i want to know um another thing that i'm
57:38	very interested in is what's going on
57:39	when you have multiple speakers the
57:41	model can clearly represent like it has
57:42	it has like a basic theory of mind
57:44	multiple speakers in a dialogue i
57:45	understand what's going on with that um
57:47	but honestly there's just so much we
57:49	don't understand um it's really it's
57:51	sort of hard to answer the question
57:52	because there's just so much to to
57:54	figure out um and we have a lot of
57:56	different threads of research and doing
57:58	this but um yeah and
58:00	uh the interpreted team at anthropic is
58:02	just sort of
58:03	has a bunch of threads trying to go and
58:05	figure out what's going on inside these
58:06	models and sort of a similar flavor to
58:08	this of just trying to figure out how do
58:10	the parameters actually encode
58:11	algorithms and can we reverse engineer
58:13	those into into meaningful computer
58:16	programs that we can we can understand
58:18	well
58:19	uh another question that is like so you
58:20	were talking about like how
58:22	like the transformers are trying to do
58:23	metal learning inherently so it's like
58:26	and you already spent a lot of time
58:27	talking about like uh the induction hats
58:29	and like that was like interesting but
58:30	like can you formalize the sort of metal
58:32	learning algorithm they might be
58:33	learning is it like possible to say like
58:35	oh maybe this is a sort of like uh like
58:37	internal algorithm that's going that's
58:38	making them like good metal learners or
58:40	something like that
58:41	i don't know i mean i think i think so i
58:43	think that there's roughly two
58:44	algorithms one is this algorithm we saw
58:46	in the one layer model and we see it in
58:48	other models too especially early on
58:49	which is just you know try to copy you
58:51	know you saw a word probably a similar
58:53	word is going to happen uh later um look
58:55	for places that it might fit in and
58:57	increase the probability so that's
58:58	that's one thing that we see and the
59:00	other thing we see is uh induction heads
59:02	which you can just summarize as as in
59:04	context nearest neighbors basically um
59:06	and it seems you know possibly there's
59:08	other things but it seems like those two
59:10	algorithms um and you know the specific
59:12	instantiations that we are looking at uh
59:14	seem to be what's driving in context
59:16	learning that would be my present theory
59:18	yeah sounds very interesting
59:20	um yeah okay um so let's open like a
59:23	round of first two questions so yeah
59:25	feel free to go ahead
59:26	for those questions
59:33	you

Why Isn’t the Climate Movement Voting? | Nathaniel Stinnett | TED

A Better Treatment for Overdose Is Coming

The Most Important Explosion in History

Why You See Faces in Things

The Real Reason Robots Shouldn’t Look Like Humans

Video Transcript