The following is a summary and article by AI based on a transcript of the video "Jacob Andreas | What Learning Algorithm is In-Context Learning?". Due to the limitations of AI, please be careful to distinguish the correctness of the content.
00:00 | hi everyone uh I'm Jacob I work on uh |
---|---|
00:03 | mostly natural language processing at |
00:04 | MIT |
00:05 | um today I'm going to be talking about |
00:07 | something a little different which is uh |
00:09 | the worst possible way to fit uh or to |
00:11 | estimate the parameters of a linear |
00:12 | model and what this can teach us about |
00:14 | uh sort of modern NLP systems and |
00:17 | language generation systems |
00:19 | um here we go okay so to start off right |
00:21 | everybody's favorite introductory |
00:23 | estimation problem linear regression we |
00:25 | have some x's we have some y's a new X |
00:27 | comes along and we want to be able to |
00:29 | predict uh the corresponding y from this |
00:31 | x under whatever the data data |
00:33 | generating process is |
00:35 | um right and so as every undergraduate |
00:37 | now knows uh the way to do this is you |
00:39 | take all of your x's and you put them |
00:41 | into a matrix you take all your y's you |
00:42 | put them into a matrix and then you open |
00:45 | up chat GPT and you paste your access |
00:46 | and your y's into chatgpt and you ask it |
00:49 | to give you a solution to this problem |
00:51 | um I actually I wasn't going to do this |
00:53 | for real but I did this when I was |
00:54 | putting these slides together and it |
00:55 | produces this very long very |
00:57 | authoritative sounding answer there's a |
00:59 | bunch of you know sort of intermediate |
01:00 | computations and math here and if you |
01:03 | take the actual predictor that it gives |
01:04 | you at the end and you drop it back into |
01:06 | this problem |
01:08 | um it looks a little bit like this so |
01:11 | not necessarily a reliable way to uh to |
01:14 | estimate linear regression models right |
01:16 | now and you know more generally I guess |
01:19 | this is obviously a kind of ridiculous |
01:21 | thing to do we have already uh very very |
01:24 | old uh very very good algorithms for |
01:27 | solving these kinds of problems and so |
01:28 | it doesn't seem like the sort of thing |
01:30 | uh that we might necessarily uh need to |
01:34 | ask a language model to do or or where |
01:36 | it would be an interesting question to |
01:37 | ask whether language models text |
01:40 | generation systems uh like the kind that |
01:42 | we have today are able to do it at all |
01:46 | um nevertheless uh the kinds of things |
01:49 | that we can do with these big open-ended |
01:51 | text generation systems uh are becoming |
01:55 | you know sort of like more and more |
01:57 | General uh these models themselves are |
01:59 | becoming more and more capable right so |
02:01 | here's an example from what's now you |
02:03 | know an oldish paper 2020 uh showing |
02:05 | that these models can do real sort of |
02:07 | open-ended uh text generation so here's |
02:09 | a particular language model and we'll |
02:11 | talk more specifically about what these |
02:12 | language models are in a minute uh but |
02:14 | that is being prompted right it's asked |
02:16 | to predict what text should come next |
02:18 | after uh this text that we have in light |
02:21 | gray up at the top of the slide so we |
02:23 | give it the title of uh sort of |
02:25 | hypothetical news article we give it the |
02:26 | subtitle of a hypothetical news article |
02:29 | and then we ask it to generate the text |
02:30 | of the article itself and it gives you |
02:32 | this long paragraph right that uh are |
02:35 | you know both sort of compatible with |
02:37 | the title describing the sort of thing |
02:38 | uh that you would expect to be in an |
02:40 | article with that title written in the |
02:42 | style of a news article uh and more or |
02:45 | less internally coherent um it's also |
02:47 | worth noting that uh this article is or |
02:50 | the sort of factual content of this |
02:51 | article is wrong uh that it's describing |
02:54 | you know as prompted a sort of real |
02:55 | split that happened in the Methodist |
02:57 | Church but the identities of the |
02:58 | individual sort of groups that came out |
03:00 | of the Schism uh while plausible |
03:02 | sounding are totally made up |
03:05 | um now we can do more with these text |
03:08 | generation systems than just generate |
03:10 | long blocks of text |
03:12 | um one of the other surprising |
03:14 | capabilities uh that we've seen as |
03:17 | people have started to play around with |
03:18 | these things is that they can do |
03:20 | instruction following right so you can |
03:22 | take again a sort of generic uh next |
03:25 | word prediction system uh you can say |
03:28 | give it as input the text explained the |
03:30 | moon landing to a six-year-old in a few |
03:31 | sentences |
03:33 | um and a good enough text predictor will |
03:35 | actually respond by following this |
03:37 | instruction and by generating some text |
03:38 | that specifically explains the moon |
03:40 | landing or answers a question why do |
03:42 | birds migrate south for the winter |
03:44 | um and so on and so forth |
03:46 | um |
03:47 | and maybe the last sort of surprising |
03:50 | capability that we've seen in these |
03:52 | models as they have scaled up and as |
03:56 | people have started to play around with |
03:57 | them more |
03:58 | um is a phenomenon that has is coming to |
04:01 | be referred to as in context learning |
04:03 | and so the What's Happening Here is that |
04:05 | rather than just giving this model uh |
04:08 | the sort of beginning of a document that |
04:09 | we want it to complete and rather than |
04:11 | giving it an explicit textual |
04:12 | description of the task that we wanted |
04:14 | to perform we just give it some examples |
04:17 | of that task being performed on other |
04:19 | kinds of inputs so here we're asking it |
04:21 | to do some sort of grammatical error |
04:23 | correction task and we do this by |
04:26 | placing you know as the input to this |
04:28 | model the string poor English input I |
04:30 | eated the purple berries good English |
04:32 | output I ate the purple berries poor |
04:35 | English input thank you for picking me |
04:36 | as your designer I'd appreciate it but |
04:38 | good English output and we fill in all |
04:40 | of these pieces like this so again all |
04:42 | of this text that we're showing in light |
04:43 | gray on the slide is uh text that we as |
04:46 | a human user of this system are |
04:48 | providing and notice that the sort of |
04:50 | structure of this input right is that we |
04:52 | have some human written bad examples |
04:54 | good examples some human written inputs |
04:56 | and outputs |
04:57 | um uh We've Ended here with an input and |
05:01 | what the model is going to generate next |
05:02 | is a desired output and in this case |
05:04 | what we get out of the model is in fact |
05:07 | a corrected version of the last sentence |
05:09 | that we asked for and so we can specify |
05:11 | tasks that we want these language |
05:14 | generation systems these next word |
05:16 | prediction systems to perform uh again |
05:18 | not sort of in in the form of natural |
05:20 | language instructions but in the form of |
05:22 | tiny tiny data sets for these new tasks |
05:24 | that we're asking these models to |
05:26 | generate and this turns out uh you know |
05:29 | and this was something that was not even |
05:31 | necessarily planned out ahead of time |
05:32 | but discovered in these models uh around |
05:34 | 2020 uh and it's something that you can |
05:36 | do for a wide variety of tasks uh you |
05:39 | can do it even for simple arithmetic |
05:41 | kinds of things you can do it for |
05:43 | spelling correction you can do it for |
05:44 | translation you can do it for that |
05:46 | grammatical error correction and we'll |
05:47 | see a couple more examples later on um |
05:52 | um and I think you know the important |
05:53 | thing to note here is that in some cases |
05:55 | this is uh |
05:58 | more effective right than providing a |
06:01 | precise natural language description of |
06:03 | the task that you want the model to |
06:04 | perform that they actually get better at |
06:06 | doing these kinds of things when you |
06:08 | give them uh explicit examples in the |
06:11 | input uh and just ask them to generate |
06:13 | more examples or generate you know the |
06:15 | output pieces of those examples uh from |
06:17 | that same distribution um and so this |
06:19 | you know like I said before this |
06:21 | phenomenon is coming to be referred to |
06:22 | as in context learning and what this is |
06:25 | meant to evoke is that there is a kind |
06:27 | of learning that looks like machine |
06:30 | learning as we're used to thinking of it |
06:32 | or parameter estimation is we're used to |
06:33 | thinking of it but that is happening now |
06:35 | not as part of training one of these |
06:37 | prediction systems but actually uh in |
06:41 | the course of figuring out how to |
06:42 | generate text or figuring out how to |
06:44 | complete one of these documents |
06:47 | um and so what this talk is going to be |
06:49 | about the sort of big question that I |
06:51 | want to ask today is basically what on |
06:53 | Earth is going on here uh what is the |
06:55 | relationship between this particular |
06:57 | phenomenon and learning as we're used to |
06:59 | thinking about it in a machine learning |
07:01 | setting and what can we say about the |
07:03 | the sort of you know reliability of |
07:05 | these Pro this process or the limits of |
07:06 | this process |
07:08 | um I want to note also before I go on |
07:10 | here that uh everything that I'm going |
07:11 | to be talking about today was led by my |
07:13 | student accurac with a big collaboration |
07:15 | from some folks at Google and Stanford |
07:17 | as well uh Dale Sherman's and Denny Joe |
07:19 | at Google and Tanya ma who's joined |
07:21 | between Google and Stanford |
07:23 | and the sort of broad uh outline of this |
07:25 | talk is to First figure out uh or you |
07:28 | know discuss just what this in context |
07:29 | learning phenomenon is uh and what |
07:32 | possible hypotheses we might build about |
07:34 | what's going on when models are doing |
07:36 | this text generation process |
07:39 | um and then we're going to ask with the |
07:41 | kinds of models that we have today |
07:43 | algorithmically what kinds of things |
07:44 | might they be able to do what kinds of |
07:46 | things might they not be able to do when |
07:48 | presented with these sorts of in-context |
07:50 | learning problems |
07:51 | um and this is all going to be kind of |
07:52 | in principle arguments just from the |
07:54 | structure of uh language models as they |
07:57 | exist today uh you know without looking |
07:59 | at any real models and then we'll |
08:01 | actually go out and train some models |
08:02 | and try to figure out what's what's |
08:03 | really going on under the hood and then |
08:06 | talk about sort of the implications of |
08:07 | this uh more more broadly for research |
08:10 | on NLP research on language models |
08:13 | so starting with this first question uh |
08:17 | what exactly is this uh in context |
08:20 | learning phenomenon and how might we be |
08:22 | able to describe this behavior |
08:24 | um and so to talk about this right let's |
08:26 | start by being a little bit more precise |
08:28 | about what these language models are |
08:29 | what something like uh you know chat apt |
08:32 | or whatever is |
08:33 | um and fundamentally what one of these |
08:34 | models is uh is just a next word |
08:37 | prediction system right like the thing |
08:39 | that sits on your phone as you're |
08:41 | writing a text message and tries to |
08:42 | generate what word you're going to say |
08:43 | next |
08:44 | um and so what that means in practice is |
08:46 | that we're modeling a distribution over |
08:48 | strings typically natural language |
08:50 | strings but now in practice kind of |
08:51 | anything that you can write in Unicode |
08:53 | um and that that distribution is |
08:55 | represented uh via just the sort of |
08:58 | normal chain rule distribution one token |
09:00 | at a time with some kind of parametric |
09:02 | model that's just trying to predict |
09:03 | every word given all of the words that |
09:05 | came before right so concretely right |
09:07 | thinking about what one of these in |
09:09 | context learning examples looks like uh |
09:12 | we're going to try to model a joint |
09:13 | distribution over for example input |
09:17 | output pairs with some separator tokens |
09:18 | and some input markers and whatever by |
09:21 | just asking what's the probability of |
09:24 | the first input and then the probability |
09:25 | of some separator symbol given that |
09:27 | first input and then the probability of |
09:29 | the first output and so on and so forth |
09:31 | all right and as you might expect uh the |
09:34 | very beginning of this generation |
09:36 | procedure the sort of first step in this |
09:37 | in this decomposition is going to be |
09:39 | super high entropy the model doesn't |
09:40 | know what it's supposed to be doing at |
09:42 | all if it has actually learned by the |
09:43 | learn the task that we're asking it to |
09:45 | perform by the end when we're asking it |
09:48 | to predict the X's it'll know the |
09:49 | distribution of x's and when we're |
09:50 | asking it to predict the Y's it'll you |
09:52 | know know sort of precisely the |
09:54 | distribution of y's given X's |
09:56 | um so how do we actually model this |
10:00 | distribution uh in practice with you |
10:04 | know sort of in in the kinds of language |
10:06 | models that we're using today |
10:07 | um and pretty much every language model |
10:10 | that is in in widespread use that you'll |
10:13 | encounter that people are working with |
10:14 | in in the research Community right now |
10:16 | um is a neural network and it's a neural |
10:18 | network of a very specific kind called a |
10:21 | Transformer |
10:22 | um there's a lot of low-level details |
10:23 | about these Transformer models that I'm |
10:25 | not going to talk about at all right now |
10:27 | but that but sorry but there are some |
10:29 | high level details that are important |
10:32 | for uh for sort of understanding how |
10:34 | these models work and so I want to talk |
10:35 | through those things |
10:36 | um at a high level when I'm given an |
10:39 | input like this one or that contains |
10:41 | goat misspelled Arrow goat spelled |
10:43 | correctly comma snake misspelled Arrow |
10:47 | um what we're going to try to do uh I |
10:49 | guess some of the alignments are off |
10:50 | here but hopefully the correspondence is |
10:52 | right the first thing that we're going |
10:53 | to do is we're going to take each of |
10:55 | these natural language uh tokens right |
10:58 | the word goat or the word or you know |
11:01 | the the symbol Arrow or whatever |
11:03 | um and we're going to assign them what's |
11:04 | called an embedding we're gonna you know |
11:06 | map each of these words using some fixed |
11:08 | dictionary that isn't self learned to |
11:11 | some high dimensional parameter vector |
11:13 | and so if you've heard of word |
11:14 | embeddings right every you know what |
11:15 | we're giving this model is input uh is |
11:18 | an embedding of each of these words in |
11:19 | sequence |
11:21 | um and the first thing that we're going |
11:22 | to do once we've represented each of |
11:24 | these words in the input uh as some high |
11:27 | dimensional Vector |
11:29 | um is that we're going to perform what's |
11:30 | called attention so every word uh is |
11:34 | going to sort of pick up its or you know |
11:35 | at every position in this input we're |
11:37 | going to pick up the embedding that |
11:39 | we've computed so far |
11:41 | um and we're going to take |
11:42 | a linear combination of all of the word |
11:45 | embeddings that are in the sentence uh |
11:47 | up until this word that we've predicted |
11:49 | right now where the sort of Weights in |
11:51 | that linear embedding are themselves |
11:53 | some learned function of the input so |
11:54 | effectively you know when we're looking |
11:56 | at this word goat it's going to decide |
11:59 | how strongly you know and the the reason |
12:02 | this uh is called an attention mechanism |
12:04 | think of it as deciding how strongly it |
12:06 | wants to look at all of the other words |
12:08 | uh in the input up until this point uh |
12:10 | pool them according to sort of how |
12:12 | strongly it's looking at each of them |
12:13 | and then build some new representation |
12:15 | in context of this word goat that |
12:18 | incorporates all of this other |
12:19 | information from all of the other words |
12:20 | that have come before um and we can do a |
12:22 | little extra processing on top of this |
12:24 | so we don't like lose the identity of |
12:26 | the the word that we had as input but |
12:27 | you know at a high level think of this |
12:29 | the first thing that a Transformer does |
12:31 | this attention mechanism uh is just |
12:33 | taking some weighted combination of all |
12:35 | of the other representations that we've |
12:36 | built up to this point |
12:38 | um and we're going to do this in |
12:39 | parallel for every token in the input |
12:42 | right so we're going to compute a |
12:43 | representation of goat in this way and |
12:45 | go gets to look at all three tokens to |
12:47 | the left we're going to compute a |
12:48 | representation of for example this last |
12:50 | Arrow character and that last Arrow |
12:51 | character gets to look at the entire |
12:53 | sentence that we've seen so far and so |
12:55 | we're going to get as a result of this |
12:56 | entire process a sequence of now how |
12:59 | these attention vectors that are derived |
13:02 | from our original input |
13:07 | once we've done this uh we're just going |
13:11 | to apply some other uh learned |
13:13 | non-linear transformation to each of |
13:15 | these new vectors that we've built up uh |
13:17 | this now doesn't look at any other piece |
13:19 | of the input at all it's happening again |
13:20 | sort of locally in parallel at every |
13:22 | step of the input |
13:24 | um you know and works basically the way |
13:27 | a normal feed forward neural network |
13:28 | works if you've seen those uh but we're |
13:30 | doing that now for every word |
13:31 | simultaneously and these two steps |
13:35 | together uh this attention step followed |
13:37 | by this multi-layer perceptron this sort |
13:40 | of generic non-linear transformation |
13:41 | we're going to refer to as a Transformer |
13:44 | layer and we can actually take these |
13:47 | Transformer layers and we can stack a |
13:48 | bunch of them on top of each other in |
13:50 | sequence so we can do this once and then |
13:52 | we can do it a second time where now |
13:54 | when we do that attention step we're |
13:56 | attending not to the original word |
13:57 | embeddings but to all of the your you |
14:00 | know all of the representations that we |
14:01 | built at the last step of this procedure |
14:03 | uh and you know in practice for big |
14:05 | models you're doing this tens of times |
14:07 | or hundreds of times uh to eventually |
14:10 | build up a representation of the entire |
14:12 | input sequence that you have and what |
14:14 | we're going to get here at the end uh |
14:16 | again is a sort of sequence of uh of the |
14:18 | outputs of those final uh MLP steps uh |
14:23 | one for every token in this input |
14:25 | sentence and the last thing that we're |
14:27 | going to do is we're going to take the |
14:28 | very last hidden representation the last |
14:30 | output from the last of these |
14:31 | Transformer layers and we're just going |
14:33 | to try to decode it into a distribution |
14:36 | over possible next tokens using |
14:39 | basically something that looks like a |
14:40 | log linear model right so again every |
14:42 | word in my vocabulary just like at the |
14:44 | embedding layer is associated with |
14:46 | some vector and we're just going to sort |
14:49 | of compute dot products between those |
14:50 | vectors and this last hidden |
14:51 | representation and then exponentially |
14:53 | normalized to get a distribution over |
14:55 | y's and so what we hope right in the |
14:58 | context of these kinds of in context |
15:00 | learning problems is that this |
15:02 | distribution that we get over the end |
15:03 | places a lot of Mass on the correct |
15:06 | spelling of snake and not a lot of Mass |
15:08 | on on all of the other words in our |
15:09 | vocabulary |
15:12 | good so how uh having defined this model |
15:15 | architecture do we train them uh and you |
15:19 | know in the most basic sense the |
15:20 | training is super easy you go out and |
15:22 | you scrape as much text Data as you can |
15:24 | get your hands on uh you crawl the |
15:25 | entire internet you pay a bunch of |
15:27 | people to sit in a room and write you |
15:28 | know sort of nicely structured documents |
15:30 | for you |
15:31 | um and then you just do maximum |
15:32 | likelihood estimation with this model on |
15:34 | that gigantic uh gigantic data set |
15:37 | um a lot of the challenges here are |
15:39 | really engineering challenges uh more |
15:41 | than uh than kind of modeling challenges |
15:42 | and a lot of the reason that you know |
15:44 | sort of open AI uh is able to train |
15:46 | these models and people in Academia are |
15:48 | not uh has to do with just like the cost |
15:51 | of uh computational resources that you |
15:53 | need to really do this at a scale that |
15:55 | makes it effective |
15:57 | um one thing that I'll say that's not |
15:58 | going to be part of this talk at all is |
16:00 | that in modern most modern language |
16:03 | models especially the ones that sort of |
16:04 | look and talk like chat Bots there's |
16:06 | another step that goes on after this |
16:08 | process of sort of learning uh via |
16:11 | something that looks like reinforcement |
16:12 | learning from human feedback |
16:14 | that's not really going to be relevant |
16:16 | for what we're talking about today the |
16:17 | relationship between that and uh some of |
16:19 | the sort of capabilities that I was |
16:21 | talking about at the beginning is still |
16:22 | a little unclear |
16:24 | um |
16:25 | but is part of kind of the modern recipe |
16:28 | okay so this is the basic uh Transformer |
16:31 | architecture |
16:32 | um what happens when we actually do this |
16:34 | when we you know fit this uh maximum |
16:37 | likelihood objective uh to the entire |
16:39 | internet |
16:40 | um and what happens is you get a model |
16:41 | that is very very good at a very large |
16:44 | number of different things right so here |
16:45 | are some examples uh from a recent paper |
16:48 | showing uh good performance on the bar |
16:50 | exam although apparently these bar |
16:52 | numbers are actually really sketchy and |
16:53 | this is you know maybe not something to |
16:55 | take seriously |
16:56 | um but pretty good performance if not |
16:58 | like top human level performance across |
17:00 | a wide variety of uh things requiring |
17:03 | both a little bit of reasoning and uh |
17:06 | quite a lot of knowledge right the LSAT |
17:08 | the SAT the GRE uh both the math |
17:10 | sections and the verbal sections |
17:13 | um and you know this has been in the |
17:14 | news I assume many people have seen this |
17:16 | before |
17:17 | um and I think you know it was |
17:19 | surprising surprising is maybe not even |
17:21 | a strong enough word but uh you know to |
17:24 | many of us in the field both that uh |
17:28 | just doing this at a large enough scale |
17:30 | would get you here |
17:32 | um and that we were going to be able to |
17:33 | do this kind of as quickly as possible |
17:35 | and that the large enough scale turns |
17:37 | out to be you know only all of the text |
17:39 | that Humanity has ever written and not |
17:41 | something much more than that |
17:43 | okay good so among the things that |
17:46 | happens when you train a model in this |
17:49 | way like we were talking about before |
17:51 | um is this in context learning |
17:53 | phenomenon right so here now we can talk |
17:55 | a little bit more precisely about what |
17:56 | ICL uh is or what it would take to be a |
17:59 | good in context learner and it's |
18:01 | something like being able to take a |
18:03 | sequence of inputs that looks like this |
18:05 | um and you know have as your output |
18:08 | distribution something that places a lot |
18:10 | of probability uh on uh in this context |
18:13 | the French word for cheese and you know |
18:15 | maybe other sort of plausible variations |
18:17 | on it |
18:18 | um similarly if we want to do not |
18:20 | machine translation but sentiment |
18:21 | analysis we want to be able to plug in |
18:23 | some uh examples of sentences associated |
18:27 | with with sentiment uh assign them |
18:29 | natural language labels uh and and be |
18:31 | able to predict the appropriate ones I |
18:32 | guess there's some text getting cut off |
18:33 | here but uh but in the right way |
18:36 | um and finally uh one of the kind of |
18:38 | surprising things that happens is that |
18:40 | you can do this not just with |
18:43 | um you know sort of well-defined |
18:46 | problems that you expect to see |
18:47 | occurring many times in the training |
18:48 | data like the sentiment analysis problem |
18:50 | or the machine translation problem uh |
18:52 | but with weird made up problems that |
18:54 | surely nobody in the world has ever |
18:55 | asked one of these models to solve |
18:57 | before uh so you know this thing on the |
18:59 | slide this particular combination of |
19:00 | symbols and meanings |
19:02 | um is something that ekken made up the |
19:03 | first time he gave a version of this |
19:05 | talk uh and uh models are at least the |
19:08 | you know particular language model that |
19:09 | he uh asked this question to was able to |
19:12 | produce the right output here uh the |
19:15 | first time |
19:16 | so coming back to the original question |
19:18 | at the very beginning of the talk |
19:20 | um what is it that's actually going on |
19:22 | here under the hood in virtue of which |
19:25 | these models are able to solve problems |
19:26 | like this |
19:28 | um and I think you know the initial |
19:29 | reaction from most people in the |
19:32 | language processing and machine learning |
19:34 | communities uh was that learning was |
19:36 | probably not the right way to describe |
19:38 | what was actually going on in these |
19:40 | models and maybe better to think of it |
19:42 | as something like uh identification and |
19:45 | here's a quote from Reynolds and |
19:48 | McDonald that I think captures this sort |
19:50 | of hypothesis about what's going on here |
19:52 | pretty nicely |
19:53 | um fuchsia prompting is not really |
19:54 | learning of skills but just locating the |
19:56 | skills that the model already has this |
19:59 | is most obvious for translation where |
20:00 | you can't learn a new language from five |
20:02 | examples and you know I think the |
20:05 | translation case is a clear example of |
20:07 | uh of where this is the right way of |
20:09 | thinking about these models in fact you |
20:10 | can't learn French from uh |
20:13 | five examples like skipping ahead here |
20:17 | you know certainly if there's an English |
20:19 | word that you've never seen before and |
20:20 | you can produce the right French output |
20:21 | some of the knowledge about the |
20:23 | relationship between English and French |
20:24 | does not reside in the sort of in |
20:26 | context training set here but has to |
20:28 | come from the model's background |
20:29 | knowledge |
20:30 | but when we think about these more kind |
20:34 | of algebraic puzzle solving problems um |
20:37 | it's a little less clear that this is |
20:38 | the right way of thinking about things |
20:40 | uh right I think we can be pretty |
20:42 | confident that at least then maybe not |
20:45 | now this particular never particular |
20:47 | example didn't appear in the training |
20:48 | data for any of these models you know |
20:50 | maybe other things with a similar flavor |
20:52 | a similar grammar induction structure |
20:54 | um but uh but not exactly this one and |
20:58 | so competing with that hypothesis that |
21:01 | we had on the previous slide is another |
21:03 | hypothesis that the Zim context learning |
21:06 | thing is quote unquote real learning |
21:08 | right that it's capable of sort of |
21:09 | taking a training set that fully |
21:11 | determines the behavior that you want to |
21:12 | get out and using that kind of training |
21:15 | set to index within some reasonably |
21:17 | large hypothesis class the actual |
21:19 | function that you want this model to |
21:21 | implement even if that's a function that |
21:22 | it never saw executed anywhere at |
21:25 | training time |
21:27 | and so the thing that we set out to do |
21:29 | in this project was basically to answer |
21:31 | that question to figure out |
21:33 | um whether it is possible even in |
21:35 | principle uh for something like one of |
21:38 | these Transformer models uh to make a |
21:40 | version of this hypothesis true to get |
21:42 | them to do real learning and then to |
21:43 | actually figure out what's going on in |
21:45 | Real Models Yeah question |
21:56 | um that is a great question I am |
21:59 | uh not sure we could try it |
22:02 | yeah I mean and like this is I I so I'm |
22:06 | not going to be able to like produce off |
22:07 | of my top of my head though like full |
22:08 | exploration that we did here but no I |
22:10 | mean that's a good point and you know uh |
22:12 | there are also examples of cases where |
22:14 | providing this kind of extraneous |
22:15 | information actually confuses the model |
22:17 | and |
22:18 | um uh you know I'm sure it's also not |
22:20 | the case that uh you can get 100 |
22:22 | accuracy on things in this General class |
22:28 | um |
22:29 | other questions before we go on |
22:32 | Okay cool so yeah the the first question |
22:34 | that we're gonna ask here then is |
22:37 | whether it's possible for these |
22:38 | Transformer type models uh to do real |
22:41 | learning whatever that means uh even in |
22:43 | principle so to sort of take this |
22:45 | hypothesis that we had before |
22:47 | um and figure out whether there's some |
22:49 | class of uh functions and some class of |
22:51 | models for which this is true |
22:54 | um and we're going to do this by looking |
22:55 | actually at those uh linear models that |
22:59 | we looked at at the very beginning of |
23:01 | the talk uh just because this is a case |
23:03 | where we know exactly what the kind of |
23:05 | space of algorithmic solutions looks |
23:06 | like it's very easy to generate data and |
23:09 | it's a simple enough problem that you |
23:10 | don't have to go out and use something |
23:12 | trained on the entire internet you can |
23:13 | actually do all of the learning yourself |
23:16 | um one other sort of important caveat to |
23:18 | make here uh is that you know in some |
23:21 | sense this question of uh can a |
23:24 | Transformer Implement function X where |
23:26 | even function X is you know training |
23:27 | some little machine learning model on |
23:28 | the inside |
23:30 | um was already answered decades ago we |
23:32 | know that neural networks are general |
23:34 | purpose function approximators when made |
23:37 | sufficiently large and trained on enough |
23:38 | data so really the question here is |
23:41 | whether uh you can do but but the |
23:43 | constructions that uh you use to do that |
23:46 | kind of universal approximation are kind |
23:48 | of ridiculous they blow up uh |
23:50 | exponentially in the the size of the |
23:51 | input and the complexity of the function |
23:52 | so really what we're trying to figure |
23:54 | out here |
23:56 | um is whether you can do this not just |
23:58 | at all but with models that are the kind |
24:00 | of size and shape of the ones that we're |
24:02 | actually using uh today and of the size |
24:05 | and shape that are uh showing all those |
24:07 | real world results that I was showing |
24:09 | you before |
24:09 | um citation is cut off here but it is to |
24:12 | paper by Garg at all at Stanford who |
24:15 | around the same time we were doing this |
24:16 | uh asked a similar set of Behavioral |
24:19 | questions about just like what kinds of |
24:20 | functions extensionally are you able to |
24:22 | learn in this framework okay so uh the |
24:25 | way we're going to go about looking at |
24:26 | this right rather than taking our |
24:28 | translation problems from before and |
24:30 | asking models to solve translation |
24:31 | problems by via this word embedding step |
24:34 | or whatever we're just going to generate |
24:36 | some input output pairs we're going to |
24:37 | sample uh you know a random weight |
24:38 | Vector here I guess I'm showing this as |
24:40 | one dimensional but these are going to |
24:42 | be higher dimensional problems later on |
24:44 | uh and we're just going to train the |
24:45 | model right in exactly whoops uh in |
24:48 | exactly the same way and I guess for now |
24:49 | we're not even training models we're |
24:51 | just asking how we can parameterize them |
24:52 | to begin with so can we uh parameterize |
24:55 | some Transformer in such a way that you |
24:58 | know we give this zero as input it |
24:59 | predicts a zero as output |
25:01 | um one thing to note here you know we |
25:04 | can also and in all in the experiments |
25:06 | that I'm going to show later on we are |
25:07 | going to train these models both to |
25:10 | predict wise or y's given x's and to |
25:13 | predict X's uh given the whole history |
25:16 | of interactions that have happened |
25:17 | before uh here we're not going to worry |
25:19 | about actually modeling the input |
25:21 | distribution at all we're just going to |
25:22 | worry about modeling that conditional |
25:24 | distribution |
25:25 | um and I think it's actually an open |
25:26 | question how much this matters for uh |
25:28 | for real training things |
25:30 | um good so you know |
25:32 | not to belabor this point too much uh |
25:34 | you know we've had linear regression and |
25:36 | solutions to linear regression problems |
25:37 | since the 1800s uh we know that there |
25:40 | are lots of algorithms that you might |
25:41 | pack into one of these Transformers that |
25:43 | will solve them |
25:44 | um and you know you can do this by just |
25:46 | sort of directly in inverting this |
25:47 | relevant Matrix you can do this by |
25:50 | gradient descent on uh some kind of |
25:52 | least squares objective or something |
25:54 | else that looks like that and so the |
25:56 | question is whether we can take any of |
25:58 | these algorithms and pack them into one |
26:00 | of these models like this |
26:03 | um |
26:04 | and you know it turns out that you can |
26:06 | do this uh |
26:08 | the details are sort of mechanical and |
26:10 | I'm just going to give a sort of high |
26:11 | level flavor for one of these but it |
26:14 | turns out that you can show uh both that |
26:17 | if you're iterative trying to fit these |
26:19 | linear models via SGD or via batch |
26:21 | gradient descent you can do that you can |
26:24 | do that using a relatively shallow model |
26:26 | uh and you know using a model that has |
26:29 | depth that scales proportionally to the |
26:31 | number of steps of SGD that you want to |
26:33 | do here |
26:34 | um using you know sort of a reasonable |
26:36 | number of these attention mechanisms a |
26:38 | reasonable number of these layers and |
26:40 | second you can do this not just via SGD |
26:43 | but just sort of by directly |
26:44 | constructing uh that final predictor via |
26:47 | a sequence of rank 1 updates to |
26:49 | um uh that uh xdx inverse Matrix |
26:54 | um you know again there's a lot of sort |
26:57 | of low-level mechanical stuff but just |
26:58 | to give a high level like Flavor of how |
27:01 | this works |
27:02 | um we're going to define a little |
27:03 | calculus of operations that we can |
27:05 | Implement inside a transformer for sort |
27:07 | of moving data around and applying |
27:09 | linear transformations to it and then |
27:11 | once you have these things you can just |
27:12 | chain them together uh and Implement uh |
27:15 | really any algorithm uh that you choose |
27:17 | that bottoms out in these operations |
27:19 | right uh so what are the things that we |
27:21 | need to do for example paradigmatically |
27:24 | to implement uh gradient descent or even |
27:26 | just the first step of gradient descent |
27:28 | on this objective |
27:29 | um well one thing that we're going to |
27:30 | need to be able to do if uh most of the |
27:33 | processing is going to get done by these |
27:35 | multi-layer perceptron units these feed |
27:37 | forward layers that we were talking |
27:38 | about before is just to consolidate our |
27:41 | data right so we need to get our x's and |
27:43 | our y's into the same kind of part of |
27:45 | the representation space so that the |
27:46 | model can work on them uh we're going to |
27:48 | call this a move operation and it turns |
27:50 | out that you can implement this move |
27:51 | operation uh pretty simply using uh that |
27:54 | attention mechanism that we saw before |
27:56 | they can just sort of pick up a piece of |
27:57 | the Hidden State uh from one input and |
28:00 | move it to the next time step so the |
28:01 | first thing that we're going to do for |
28:03 | sdd here for example is just accumulate |
28:05 | the X's uh into our y's |
28:07 | another thing that you need to do here |
28:09 | is take offline Transformations right |
28:13 | for example on this first step we have |
28:15 | some initial guess at the weight Vector |
28:16 | we need to dot that by our initial |
28:18 | feature Vector to figure out what our |
28:19 | error is and so we need some way of |
28:21 | doing affine Transformations uh and you |
28:24 | know affine Transformations are in some |
28:25 | sense like the only thing that these |
28:27 | multi-layer perceptron uh units can do |
28:29 | and so this is very very easy to |
28:31 | implement using that and MLP layer that |
28:34 | we were showing before |
28:35 | um and then the last thing and the thing |
28:36 | that turns out to be the sort of |
28:38 | fussiest piece of this |
28:39 | um is Computing dot products between |
28:42 | pieces of these feature vectors or |
28:44 | between feature vectors across time you |
28:46 | need this for example to scale your |
28:49 | input by your prediction error when |
28:50 | you're doing an SGD step |
28:52 | um and it turns out that you can also do |
28:54 | this uh by exploiting a bunch of very |
28:57 | low level details in the way these MLP |
29:01 | layers are defined in practice for sort |
29:03 | of real world Transformer models and |
29:05 | this is like I I don't know very inside |
29:07 | baseball |
29:08 | um probably not worth going into too |
29:10 | much details and if they would make |
29:11 | sense you've already read this paper but |
29:12 | basically you can get uh the |
29:15 | non-linearity in that MLP to pretty well |
29:18 | approximate uh element-wise |
29:20 | multiplication of a couple of these |
29:22 | vectors they're scaling by a scalar |
29:24 | um if you get things very close to zero |
29:25 | and do things in the right way |
29:27 | um and once you do this right once |
29:29 | you've sort of defined uh Transformer |
29:31 | chunk that does this move operation a |
29:33 | Transformer chunk uh that does this |
29:35 | affine operation and Transformer chunk |
29:37 | that does this thought operation |
29:39 | um then all you need to do to show that |
29:40 | a Transformer can Implement SGD is write |
29:43 | the program in terms of these operations |
29:44 | uh that actually uh does so right so |
29:48 | here for example is the one that does |
29:49 | SGD you can write a corresponding one uh |
29:52 | for doing the sort of sequence of rank |
29:54 | one updates with the the Sherman |
29:55 | Morrison formula |
29:57 | um and that's all you need to do to |
29:59 | generate these kind of in-principle uh |
30:01 | demonstrations that Transformers can do |
30:03 | real in context learning |
30:06 | um one other nice kind of detail here is |
30:09 | that all of those three things that I |
30:10 | was showing before you can actually |
30:11 | consolidate into a big generic kind of |
30:13 | like uh read transform and write |
30:16 | operation uh that you can you can |
30:19 | implement it in a sort of generic way |
30:21 | um and there has been some follow-up |
30:23 | work since we put this out sort of |
30:24 | developing just generalized versions of |
30:26 | that like uh generic Transformer read |
30:28 | attend write operator uh to make in some |
30:31 | sense nicer programming languages for |
30:34 | describing what goes on or what what |
30:35 | kinds of functions you can Implement |
30:37 | with these Transformers uh and coming up |
30:39 | with tweaks to this architecture |
30:42 | good and you know so uh maybe |
30:44 | unsurprisingly but it's nice to see that |
30:46 | in fact uh it is possible to |
30:48 | parameterize these models uh so that |
30:50 | they Implement real algorithms that |
30:52 | we've written down ahead of time uh and |
30:54 | thus it's possible at least in theory |
30:55 | that these Transformers that we're |
30:57 | seeing out in the real world for at |
30:58 | least some subset of these uh in context |
31:00 | learning problems uh are doing real ICL |
31:03 | um and so you know of course the natural |
31:05 | question now is what is happening in |
31:09 | practice when we actually train these |
31:11 | models on the distribution that we |
31:13 | assumed we were going to get in the |
31:14 | previous part of this talk uh do they |
31:17 | actually converge to the kinds of |
31:18 | solutions that we were expecting to find |
31:20 | here um and so we can ask this in a |
31:22 | couple of different ways uh one is you |
31:25 | know just do you get the same kinds of |
31:27 | generalizations that would be predicted |
31:28 | if you were implementing some version of |
31:31 | real learning as we've described it |
31:33 | before and second what's the |
31:35 | relationship between a model's ability |
31:37 | to discover the solution and uh lower |
31:40 | level details about the capacity of the |
31:42 | model how big it is and other |
31:44 | implementational things |
31:46 | um so we're going to try to answer both |
31:47 | of these things and we're going to do it |
31:49 | right by now just taking this basic |
31:51 | learning setup that we assumed for |
31:53 | constructing some in context Learners |
31:55 | before and actually now just training |
31:57 | models on this data right so we're going |
31:59 | to sample |
32:00 | um some weight vectors we're going to |
32:02 | sample some input vectors we're going to |
32:04 | generate some y's using the weight |
32:06 | vectors and the input vectors that we |
32:07 | sampled and in particular right we're |
32:09 | going to construct a sequence or a data |
32:12 | set of sequences where each of these |
32:14 | sequences uh was generated by a single |
32:17 | one of these weight vectors right so uh |
32:19 | you know here I have a bunch of y's that |
32:21 | were generated from some particular W |
32:24 | sampling my x's from from some normal |
32:27 | distribution that's going to stay the |
32:28 | same |
32:29 | um and I'm going to present the model |
32:31 | with the sequence and ask it to make a |
32:33 | prediction at every step of the sequence |
32:35 | and then when it gets to the end of the |
32:37 | sequence I'm going to sample a new |
32:38 | weight vector and I'm going to give it a |
32:39 | sequence of X Y pairs that were |
32:41 | generated by that new weight Vector |
32:44 | um and you know again we can do this |
32:46 | over and over again this data is fake |
32:48 | it's easy to generate we can generate |
32:49 | lots of it and you can train uh actually |
32:52 | smallish models as we're going to see |
32:53 | later on to do pretty well on this |
32:55 | objective |
32:57 | so what happens when you do this |
33:01 | um and so the first thing that we're |
33:02 | going to look at here is just the |
33:04 | quality of the predictions that the |
33:06 | model is making |
33:07 | um so notice here that when we're |
33:08 | actually |
33:10 | um oh the axis labels got chopped off |
33:12 | but uh uh we'll we'll explain these in a |
33:14 | minute |
33:15 | um when we are |
33:17 | uh |
33:20 | uh sampling from the model right at no |
33:22 | point are we asking it to explicitly |
33:24 | exhibit uh an inferred parameter Vector |
33:26 | we're just asking it to predict uh you |
33:28 | know our y's from our X's |
33:30 | um and for all of the plots that I'm |
33:32 | going to show now we're going to look at |
33:33 | eight dimensional problems so |
33:34 | unsurprisingly right with fewer than |
33:36 | eight examples in the input the model |
33:38 | can't possibly know exactly what uh |
33:40 | noiseless linear function is trying to |
33:41 | learn and so you get some high error |
33:43 | rate that declines uh gradually over |
33:45 | time |
33:46 | um so this is just the you know sort of |
33:49 | uh squared error for the predictions |
33:52 | made by the model all of the other lines |
33:54 | that I'm going to show right now are not |
33:56 | agreement with ground truth but |
33:57 | agreement with other uh prediction |
34:00 | algorithms that we can fit to these |
34:01 | datas the the these data sets that we're |
34:04 | handing to our Transformer model to do |
34:06 | ICL right uh so we might oh yeah |
34:08 | question |
34:13 | sorry |
34:16 | yeah |
34:18 | you just won just one Epoch yeah so it |
34:20 | just sees you know X Y X Y X Y and |
34:22 | sequence without repetition yeah |
34:26 | other questions about the setup |
34:28 | okay |
34:29 | um good so what we're going to ask now |
34:32 | is you know not just what's the fit of |
34:34 | this model to the sort of ground truth |
34:36 | labeling function but to other learning |
34:38 | algorithms that we might train on the |
34:40 | same data uh we might hypothesize and in |
34:43 | fact people in the NLP Community have |
34:44 | hypothesized that the right way to think |
34:46 | about ICL is that it's doing something |
34:48 | like nearest neighbors so we can fit |
34:49 | some nearest neighbors models and those |
34:51 | don't actually seem to do a very good |
34:53 | job at all of describing uh the kinds of |
34:56 | predictions that this Transformer makes |
34:58 | you can do SGD and you know various |
35:02 | versions of of SGD and gradient descent |
35:05 | where you look at all the examples in a |
35:06 | batch or you look at them one at a time |
35:08 | um all of these things early on seem to |
35:11 | agree relatively well with the model's |
35:12 | predictions uh and you know sort of |
35:15 | diverge around that critical example and |
35:18 | then start to fit again better later on |
35:21 | um you can ask about you know more sort |
35:23 | of Old School uh I guess not old school |
35:25 | but but standard regression algorithms |
35:27 | here we're looking at uh regularized |
35:31 | you know Ridge regression here uh this |
35:33 | is a noiseless problem so you shouldn't |
35:35 | need this in principle nevertheless this |
35:37 | gives us a really pretty good fit to our |
35:38 | data but maybe unsurprisingly right if |
35:41 | you just do ordinarily squares uh taking |
35:43 | the mid-norm solution in this uh less |
35:45 | than eight regime uh this is almost a |
35:47 | perfect fit to the predictions that |
35:49 | these models are making so they really |
35:50 | do seem to be uh behaving uh now without |
35:54 | saying about the sort of anything about |
35:55 | the computations that support them uh or |
35:58 | support that but behaving like |
36:00 | um uh like these OLS type models um and |
36:03 | this you know maybe should not surprise |
36:04 | us right if this is really our data |
36:06 | generation process we know that the Min |
36:09 | Bayes risk uh predictions are going to |
36:11 | come from a weight Vector that looks |
36:12 | like this and so if we could really |
36:14 | drive this |
36:16 | loss function all the way down to zero |
36:18 | in some sense what we would have to do |
36:20 | would be to make predictions that look |
36:22 | like this in this noiseless regime |
36:25 | um and you know we can sort of stress |
36:27 | test this a little bit by changing the |
36:28 | data generation process such that the |
36:30 | minrisk predictor looks a little bit |
36:32 | different uh we can make it for example |
36:34 | we can add some noise to our y's uh so |
36:36 | that now the right thing is to behave uh |
36:39 | as though you're regularized a little |
36:40 | bit right and we know in particular for |
36:42 | uh particular uh particular distribution |
36:46 | of weight vectors uh particular |
36:47 | distribution of noise being added to |
36:50 | these y's uh that there's a nice closed |
36:52 | form for uh the grand truth predictor |
36:54 | here in terms of these two things and so |
36:56 | you know you can do uh something that |
36:58 | looks like Ridge regression uh with a |
37:00 | parameter that's determined by these two |
37:02 | uh |
37:03 | noises |
37:05 | um and when you do this and you ask |
37:06 | these questions about what's the sort of |
37:08 | best fit to the model uh you again see |
37:10 | exactly the pattern that you want to see |
37:11 | here right that when there's no noise at |
37:12 | all uh the fit to OLS is better than |
37:15 | everything else as you start to add |
37:17 | noise either to the data generation |
37:19 | process or sorry either do your exes uh |
37:22 | or sorry either to your W's or to your |
37:25 | uh wise given W's |
37:27 | um you see this nice pattern where uh |
37:29 | the predictor that that best fits the |
37:31 | predictions that you actually get out of |
37:33 | this Transformer are exactly the ones |
37:34 | that you would expect if you were |
37:37 | uh you know so this is just saying that |
37:39 | this sort of finding before is robust |
37:41 | that basically these models at least at |
37:43 | the scale that we're training them do |
37:44 | actually give you exactly the predictor |
37:46 | that you want for these kinds of linear |
37:49 | regression problems and this agrees with |
37:51 | uh that uh Garg at all paper that I was |
37:54 | talking about before that does similar |
37:55 | kinds of experiments also in the |
37:57 | presence of uh you know sort of sparse |
37:59 | X's sparse W's things like that |
38:02 | um a more interesting question that we |
38:04 | can ask here |
38:06 | um is whether this always happens or |
38:08 | whether in the sort of uh presence of |
38:13 | stronger computational constraints than |
38:15 | we have assumed up to this point uh |
38:17 | models are still able to sort of |
38:19 | perfectly fit these distributions or |
38:20 | whether they do something different |
38:22 | um and the cool thing that happens here |
38:23 | is that you actually get different fits |
38:25 | to different algorithms as a function of |
38:27 | model size so as we make you know for |
38:29 | the sort of very shallowest models if we |
38:31 | only give them one layer uh they are not |
38:34 | perfectly described but best described |
38:36 | as doing something that looks like a |
38:39 | single step of I think I've covered this |
38:41 | up on the legend but of actually bash |
38:42 | gradient descent on uh the input as you |
38:46 | make these models bigger uh they look a |
38:48 | little bit less like they're doing |
38:49 | gradient descent and a little bit more |
38:51 | like they're doing uh proper properly |
38:53 | squares uh and you know as you make them |
38:55 | uh big enough like we've been doing |
38:57 | before uh you converge to that OLS |
38:59 | solution uh you can also look at this as |
39:01 | a function of uh the hidden size of the |
39:04 | model the size of these embedding |
39:05 | vectors that we're passing up and here |
39:07 | you see less clear but also a similar |
39:09 | kind of trend uh where for very very |
39:11 | small hidden sizes |
39:13 | um uh you have a slightly better fit but |
39:16 | not a great fit from these SGD type |
39:18 | predictors uh and at bigger sizes you |
39:20 | have a better fit from uh from the right |
39:22 | ones |
39:23 | um and so we can think of there being a |
39:24 | couple kind of uh phase phases or |
39:28 | regimes in uh model parameter space uh |
39:31 | or sorry in in model architecture space |
39:33 | that describe the kinds of algorithmic |
39:35 | solutions that these models find to |
39:38 | um to this regression problem |
39:40 | um and we're going to look uh later on |
39:42 | specifically at uh at this sort of SGD |
39:44 | regime uh and try to figure out better |
39:46 | what's going on there |
39:48 | the last question that we are going to |
39:53 | ask about these train models before we |
39:55 | do that |
39:56 | um is just whether we can figure out |
39:57 | anything at all about what's going on |
39:59 | under the hood uh you know so far all of |
40:02 | the characterization of real models that |
40:04 | we've done uh has been uh sort of |
40:07 | extensional right in terms of their |
40:09 | functional form and not in terms of |
40:10 | their uh internal computations |
40:13 | um but it's reasonable to ask given that |
40:15 | we have these sort of constructions |
40:16 | lying around for what intermediate |
40:17 | quantities you might need to compute in |
40:19 | order to solve these problems whether we |
40:21 | can see any evidence that trained models |
40:23 | are actually Computing the relevant |
40:25 | intermediate quantities |
40:26 | um and so to do this we're going to do |
40:27 | what's called I guess now in the the ml |
40:30 | literature A probing experiment we're |
40:31 | going to take our original trained model |
40:33 | we're going to freeze its parameters and |
40:36 | we're going to fit some teeny little |
40:37 | model uh either uh just like a single |
40:41 | linear readout or a little tiny little |
40:43 | multi-layer perceptron and we're going |
40:45 | to train this probe model to try to |
40:47 | predict what the real uh |
40:50 | optimal W hat was uh for the problem |
40:54 | that's being presented in the input from |
40:56 | the internal states of the Transformer |
40:58 | model so basically can I just look at |
41:01 | one of these hidden States and recover |
41:02 | with some predictor that's not powerful |
41:04 | enough to like solve the layer |
41:06 | regression problem on its own uh the |
41:08 | predictor that we think the model is |
41:10 | using uh to actually make predictions |
41:12 | um and so you can do this for different |
41:14 | kinds of intermediate quantities that |
41:16 | that you might want to look for here the |
41:18 | natural one is just this weight Vector |
41:20 | right we said before we're never |
41:21 | actually showing the model any weight |
41:23 | vectors we're never asking it to |
41:24 | generate a weight Vector but if you try |
41:26 | to probe for this weight Vector in its |
41:28 | intermediate States uh you find that you |
41:29 | can do that and you can actually do it |
41:31 | pretty well with linear readout right |
41:34 | near the end of the network which I |
41:36 | think is exactly what we would have |
41:37 | expected from the little construction |
41:38 | that we gave before |
41:40 | um as some sanity checks here right you |
41:42 | can try doing this on a Model that is |
41:45 | trained to perform some task that |
41:46 | requires you to look at all of the |
41:47 | inputs and outputs but is not linear |
41:50 | regression and for those what we're |
41:53 | calling control tasks |
41:55 | um uh no matter where you look in the |
41:57 | network you don't acquire the ability to |
42:00 | recover W as accurately as we were here |
42:02 | so some evidence that you know in the |
42:04 | course of making the predictions that we |
42:06 | were looking at before |
42:08 | um our weight Vector actually gets |
42:10 | encoded by uh by the models that are |
42:13 | making these predictions which is a cool |
42:14 | thing to find |
42:15 | um we can ask about other relevant |
42:16 | intermediate quantities right just like |
42:18 | the the product of our X Matrix and our |
42:20 | y Matrix and if we do this the story is |
42:22 | a little muddier |
42:23 | um it's definitely not linearly encoded |
42:26 | maybe it's non-linearly encoded and the |
42:28 | evidence that it's non-linearly encoded |
42:30 | is the sort of gap between this control |
42:32 | task and the real task which is not as |
42:34 | big as we were looking at here |
42:36 | um but to the extent that there's |
42:37 | anything going on |
42:38 | um it's going on much earlier in the |
42:40 | network that the model sort of computes |
42:42 | this around layer 7 or layer eight and |
42:44 | then hangs on to it |
42:46 | um and you know again this is a little |
42:48 | bit more speculative I wouldn't take |
42:49 | this as uh like dispositive of of models |
42:53 | Computing this quantity but it is what |
42:55 | we would expect if it were implementing |
42:57 | one of these algorithms that we looked |
42:58 | at before because this is a sort of |
43:00 | First Step that you need to do in order |
43:01 | to compute the W later on |
43:05 | um good so to come back to all of these |
43:08 | questions uh that we were sort of asking |
43:11 | at the beginning of this talk uh right |
43:13 | what is in context learning and the sort |
43:14 | of main hypotheses that we want to |
43:16 | discriminate here are whether it's task |
43:18 | identification or quote unquote real |
43:20 | learning um and in the context of a very |
43:22 | very very simple real learning problem |
43:24 | uh We've shown that you know it's |
43:26 | possible at least in principle that |
43:27 | models are are really doing it that you |
43:29 | can get smallish Transformers uh to |
43:32 | implement real learning algorithms and |
43:34 | we've presented both sort of Behavioral |
43:35 | evidence and at least preliminary |
43:38 | um uh representational evidence that |
43:41 | this is actually what's being |
43:42 | implemented by these models at least at |
43:44 | the scale that we're looking at |
43:45 | so in I guess the little bit of time |
43:47 | that I have left before I switch over to |
43:49 | questions |
43:50 | um one thing that's been really fun this |
43:51 | was obviously a question that was like |
43:52 | very much on people's mind uh I guess a |
43:57 | year ago last summer when we started |
43:58 | doing this project and there's been a |
44:00 | ton of work that's come out in this |
44:02 | space both kind of concurrently with uh |
44:04 | with what we were doing here uh and |
44:05 | since this paper came out |
44:07 | um so you know some natural questions |
44:09 | that you might have at the end of |
44:11 | everything that I've been showing here |
44:12 | is whether these constructions we've |
44:15 | given are actually the right ones or |
44:16 | whether especially given that you know |
44:18 | we were getting pretty good results from |
44:20 | like two layer Transformers four layer |
44:22 | Transformers whether we can do these |
44:23 | things more efficiently whether we can |
44:25 | say anything uh you know sort of |
44:27 | theoretically about the conditions under |
44:28 | which we're actually going to recover |
44:30 | the the real learning solution |
44:32 | um and how this relates to data |
44:35 | distributions both in these kinds of |
44:37 | synthetic models and in Real Models um |
44:39 | and what's nice is that we're starting |
44:40 | to get answers actually to all of these |
44:42 | questions so shortly after |
44:45 | um uh we did this uh there was a paper |
44:48 | actually also from another group at |
44:49 | Google that I guess had not been talking |
44:51 | to our Google collaborators uh showing a |
44:53 | very similar thing showing that you |
44:54 | could get standard Transformers uh to |
44:57 | implement in this case uh fitting these |
45:00 | linear linear regression problems via a |
45:03 | single step of gradient descent uh and |
45:06 | they do it in a very different way they |
45:07 | use the attention mechanism actually to |
45:09 | compute all of the dot products that you |
45:10 | need to compute and this allows them to |
45:12 | do it in a single layer with a single |
45:15 | attention head |
45:16 | um and so if you think back to those |
45:18 | kind of experimental results we had |
45:19 | before right showing that in the single |
45:22 | layer regime we're looking more like a |
45:24 | one step of gradient descent model than |
45:26 | anything else here is now some evidence |
45:28 | or here is an example of a way in which |
45:31 | you can solve this problem by |
45:34 | parameterizing a Transformer that would |
45:36 | generate exactly that behavior for you |
45:38 | um there's one kind of important caveat |
45:39 | here which is that this is not actually |
45:41 | using the kind of real models that |
45:42 | people train and practice but something |
45:44 | with a slightly simpler and slightly |
45:46 | less expressive attention mechanism and |
45:49 | this might account for the little Gap |
45:51 | that we were seeing between our SGD |
45:53 | predictors and uh and the real Model |
45:55 | Behavior |
45:56 | um even cooler very recently some folks |
45:58 | at Berkeley I think Spencer is now at UC |
46:01 | Davis but showed that not only can you |
46:03 | do this but you are under certain |
46:04 | conditions guarantee to converge to |
46:08 | exactly this linear self-attention one |
46:10 | step of gradient descent solution so uh |
46:13 | you know we can say now a little bit |
46:15 | more precisely uh the conditions under |
46:17 | which in context learning really is real |
46:19 | learning and and you can actually uh |
46:22 | guarantee this up front using this |
46:23 | linear self-attention model |
46:25 | um finally all of these interesting |
46:27 | questions about data sets |
46:30 | um so one and I think this is a very |
46:32 | recent paper but but cool thing that |
46:34 | came out right |
46:35 | um You can imagine that if you only ever |
46:38 | trained uh your model at training time |
46:41 | on outputs from a single weight Vector |
46:44 | uh the right thing to do kind of no |
46:46 | matter what is to not do any in context |
46:47 | learning just memorize that weight |
46:49 | vector and use that weight Vector to |
46:50 | predict whatever a wise you're going to |
46:52 | see at test time |
46:54 | um and so there's a sort of question of |
46:55 | whether like how much diversity in that |
46:57 | training set you really need to see |
46:59 | before you switch over to being uh a |
47:03 | real learner as opposed to something |
47:04 | that's memorized a fixed family of wheat |
47:06 | vectors |
47:07 | um and so now some empirical evidence |
47:08 | that you do get exactly that kind of |
47:10 | again phase transition as a function of |
47:13 | the diversity of the data set that for |
47:15 | small W's you remember a small number of |
47:17 | W's you memorize all the W's for a |
47:19 | sufficiently large number uh even finite |
47:22 | of W's that you get to see over and over |
47:24 | again during training time eventually uh |
47:26 | you learn to uh to do in context |
47:29 | learning instead I think this is |
47:31 | especially interesting because it sort |
47:32 | of goes against that |
47:34 | um you know Min Bay's risk story that I |
47:35 | was talking about before when you're in |
47:37 | the regime of having a finite number of |
47:39 | uh W's that you've ever seen at training |
47:41 | time uh probably the right thing to do |
47:43 | is eventually you know at least sort of |
47:45 | in a Bayesian sense |
47:46 | um is to just memorize that large finite |
47:48 | set of W's models don't do that and at |
47:51 | some point they learn the the solution |
47:52 | that generalizes instead |
47:55 | um and finally we can start to ask these |
47:57 | kinds of questions about Real Models |
47:58 | thinking back to the very beginning of |
48:00 | the talk right we have this kind of back |
48:01 | and forth between is this just task |
48:03 | identification uh is this uh real |
48:06 | learning uh and now some empirical |
48:08 | evidence that in fact uh it's it's a |
48:10 | mixture of both and that you can buy |
48:11 | changing the label distribution or uh |
48:14 | the kinds of instructions that you |
48:15 | provide up front uh actually induce |
48:18 | models to behave more like task |
48:19 | identifiers or more like uh in context |
48:22 | Learners uh again just by sort of |
48:24 | manipulating the inputs |
48:26 | um finally oh man everything moved |
48:28 | around on the slide but one thing that |
48:29 | we started to look at uh that's a sort |
48:31 | of next direction that I'm super excited |
48:32 | about |
48:33 | um is generalizing to more interesting |
48:36 | kinds of prediction problems right so we |
48:38 | can ask now not just to produce a single |
48:40 | categorical label for these things uh |
48:42 | but more structured kinds of outputs |
48:44 | that maybe start to look a little bit |
48:45 | more uh like the kind of real machine |
48:47 | translation examples and and other text |
48:49 | generation examples uh that we saw at |
48:51 | the beginning of this uh of this talk |
48:53 | um uh what we're looking at specifically |
48:55 | is in context learning of uh finite |
48:58 | automata and other sort of formal |
48:59 | languages and here it turns out that at |
49:02 | least empirically right these models are |
49:03 | also very very good at doing this uh in |
49:05 | context you can get close to perfect |
49:07 | accuracy |
49:08 | um and interestingly this seems to be a |
49:10 | task that separates uh these |
49:13 | Transformers the sort of current |
49:14 | architecture from a lot of the other uh |
49:18 | work that has been done in recent years |
49:19 | uh trying to propose some new new kinds |
49:23 | of models that are that are easier to |
49:24 | train uh have lower computational |
49:26 | budgets things like that |
49:28 | um and so |
49:30 | ICL right this this in context learning |
49:33 | thing seems not only to be a sort of |
49:35 | surprising property of these uh these |
49:37 | models but not something that you |
49:39 | necessarily get for free at scale in any |
49:41 | sufficiently large neural sequence |
49:43 | predictor uh there is something special |
49:45 | about Transformers and so the question |
49:46 | is what what is that thing uh and |
49:49 | ultimately if we can answer that |
49:50 | question uh we will hopefully know what |
49:53 | we need to do to to figure out what the |
49:54 | next generation of model architectures |
49:56 | looks like |
49:57 | um with that I will wrap up uh you know |
49:59 | as always most of the credit goes to |
50:01 | ecken and and the rest equally |
50:03 | distributed among our collaborators this |
50:05 | was a super fun collaboration |
50:06 | um and happy to answer any questions |
50:10 | foreign |