The following is a summary and article by AI based on a transcript of the video "Why AI doesn't speak every language". Due to the limitations of AI, please be careful to distinguish the correctness of the content.
00:00 | Okay. |
---|---|
00:01 | I want you to check this out |
00:02 | because it represents a big challenge |
00:05 | for large language models like GPT-3... |
00:09 | and now GPT-4. |
00:11 | But it is not code. |
00:14 | It is a list that countries around the world |
00:17 | are grappling with. |
00:20 | Before we get into the problems |
00:22 | with large language models, let's review at a basic level... |
00:25 | how they work. |
00:27 | You've probably heard of ChatGPT. |
00:30 | It's not really a model |
00:31 | but an app that sits on top of a large language model. |
00:34 | In this case, a version of GPT. |
00:38 | One thing models like ChatGPT do |
00:40 | is natural language processing. |
00:43 | It's used in everything |
00:44 | from telephone customer service to auto completion. |
00:47 | GPT is like a used bookstore. |
00:50 | It's never left the room |
00:52 | and only learned through the books that are at the store. |
00:56 | Essentially, these large language models... |
00:59 | they scan a lot of text and try to learn a language. |
01:03 | They can check their process by covering up the answers... |
01:06 | and then seeing if they got it right. |
01:08 | Then they can use that knowledge to recognize sentiment... |
01:12 | summarize, translate |
01:14 | and generate responses or recommendations |
01:17 | based on the analyzed data. |
01:19 | And yes, ChatGPT wrote that last line. |
01:22 | You've got to do that in videos like this. |
01:24 | This is an amazing ability |
01:26 | but that's because it's read a lot of stuff. |
01:28 | You can ask ChatGPT |
01:30 | to rephrase something to Shakespeare |
01:32 | but that's because it's read all the Shakespeare |
01:34 | and that is where my waste of paper comes in. |
01:37 | Actually, I want to show you something. |
01:43 | Yeah? Good? |
01:44 | Okay. |
01:45 | So this is a print out of Common Crawl |
01:48 | from 2008 to the present. |
01:51 | Common crawl basically means that they go over all the websites |
01:55 | and index them. |
01:56 | And on this list, they put every language |
01:58 | that they think that they've indexed. |
02:01 | Here, you notice right away all the English. |
02:04 | Every crawl is like more than 40% just English. |
02:08 | German: DEU. |
02:10 | See the indexes here... |
02:11 | you know, it's about 6% every time |
02:13 | which doesn't sound like a lot, but it's kind of a lot. |
02:18 | But look here, 2023. |
02:20 | FIN: Finnish. Lot of pages. |
02:24 | But it's just 0.4% of the entire scan. |
02:28 | This bookstore, it's got an inventory problem. |
02:31 | All of the focus is on only a very small set of languages. |
02:36 | There was a paper that stated that of the 7,000 languages spoken globally |
02:40 | about 20 of those languages |
02:42 | make up the bulk of NLP research. |
02:44 | Okay, so let's back up a bit. |
02:46 | This is Ruth-Ann Armstrong. |
02:48 | She's a researcher who I interviewed and she's doing something |
02:51 | that a lot of researchers are trying to do... |
02:54 | make new data sets. |
02:56 | Those 20 languages fall into our category called |
02:58 | high-resource languages |
03:00 | and the others fall into a category |
03:02 | called low-resource languages. |
03:03 | Those low-resource languages |
03:05 | don't show up on the Internet as text as much |
03:07 | which means they don't make it into language datasets. |
03:11 | They become unintelligible to the AI. |
03:14 | Imagine our used bookstore again. |
03:16 | It has a ton of Dan Brown books |
03:18 | or James Patterson or Anne Tyler. |
03:20 | This is like English and German and Chinese. |
03:23 | The high-resource languages. |
03:25 | Then there are the rare books. |
03:27 | These are the low-resource languages. |
03:30 | So, many models just don't know as much about them... |
03:33 | or have anything at all. |
03:35 | I’m someone from Jamaica. |
03:37 | The language primarily spoken in Jamaica is English... |
03:40 | but we also speak a Creole language called Jamaican patois. |
03:43 | Armstrong and her coauthors wanted to create a dataset |
03:46 | that can explain this largely spoken language |
03:49 | but they weren't trying to generate texts like ChatGPT. |
03:52 | Instead, they wanted their model to understand it. |
03:55 | In this case, to do that, Armstrong went through |
03:57 | a bunch of examples of Jamaican patois and lined them up. |
04:00 | Two columns. |
04:01 | And she labeled whether the statements |
04:03 | entailed or agreed, contradicted, or were neutral. |
04:08 | You can try it in this one. |
04:09 | A has a fever. |
04:11 | B has a high temperature. |
04:14 | So it’s entailment. |
04:15 | They agree. |
04:16 | Try this one: Entailment or contradiciton? |
04:20 | Contradiction. |
04:22 | One more. |
04:24 | Neutral. |
04:25 | The two statements don't really relate. |
04:27 | She did that for almost 650 examples. |
04:31 | You can probably see that this was a ton of work. |
04:34 | And Jamaican patois is not on my big list of |
04:37 | Common Crawl languages. |
04:40 | I also talked to some Catalan researchers... |
04:43 | who are trying to evaluate how well these big language models |
04:46 | do on stuff like Catalan. |
04:49 | It is the most spoken in this autonomous community of Spain. |
04:53 | In GPT-3, the percentage of English words is 92%. |
04:59 | For German, there's 1.4% words. |
05:03 | Spanish appears in 0.7%. |
05:07 | And finally, Catalan. |
05:09 | The amount of Catalan words in the whole training set is 0.01%. |
05:15 | And it still performs very, very well. |
05:17 | So the problem here is a little bit different, right? |
05:20 | They've got some Catalan in the dataset. |
05:23 | Common Crawl says Catalan is 0.2335% of their survey. |
05:28 | Not a lot, but some. |
05:30 | In the big company models like GPT-3 and presumably GPT-4 in the future |
05:34 | were proven to do pretty well on little data. |
05:38 | For example, the research team got GPT-3 |
05:40 | to generate 3 Catalan sentences. |
05:43 | And then they mixed them up with real sentences. |
05:46 | Three native speakers then evaluated them. |
05:48 | So, that was our test. |
05:50 | And their results were very good for the machine. |
05:54 | But there is still a catch. |
05:55 | It performs reasonably well. |
05:57 | But it's worth it to build a language-specific model |
06:01 | that has been specifically trained and evaluated for that language. |
06:05 | So the problem here isn't performance |
06:07 | it’s transparency and it's the amount of data. |
06:10 | I mean, Common Crawl says that they indexed... |
06:13 | millions of examples of Catalan words. |
06:16 | But GPT-3 says that they only read |
06:19 | about 140 pages of Catalan. |
06:22 | Imagine like a novella. |
06:23 | It's a problem being dependent |
06:26 | on the performance or even the goodwill |
06:29 | of a few institutions or a few companies. |
06:32 | You can easily imagine a world where one of these companies |
06:35 | just cuts out Catalan. |
06:38 | The same way Catalan News complained Google was cutting out |
06:42 | Catalan links in searches. |
06:45 | Common Crawl is just a percent of what |
06:48 | GPT-3 was trained for. |
06:49 | We don't know the details about GPT-4. |
06:52 | And that means a lot of other stuff |
06:54 | went into this language model that we just don't know about. |
06:57 | Right now all these bookstores are actually run by |
07:00 | Meta or Microsoft or Baidu or Open AI or Google. |
07:03 | They decide which books go in there |
07:06 | and don't tell anyone where they came from or who wrote them. |
07:09 | Some people are trying to build a library |
07:12 | next to the bookstore. |
07:15 | This is Paris, where the French have a supercomputer |
07:18 | that wasn't being used a lot. |
07:20 | It’s like almost down the road |
07:21 | and I was discussing with the people... |
07:23 | who built it and they're like, “Nobody uses this GPU.” |
07:26 | Basically, what can we do? |
07:28 | Thomas Wolf is a co-founder of Hugging Face |
07:30 | which is like a hub for AI research on the Internet |
07:34 | and they ended up working on Big Science’s BLOOM... |
07:37 | a project to create an open-source multilingual model. |
07:41 | And the more we thought about it |
07:42 | we thought it's also a lot better, in fact, that we trained it |
07:45 | in a lot of other languages, not just English. |
07:47 | And if we try to involve many people |
07:49 | and so it started from a small Hugging Face project |
07:52 | to become a very big collaboration. |
07:55 | Where we tried to open this to everyone. |
07:57 | They basically went down the Wikipedia list |
07:59 | of most spoken languages and covered those. |
08:01 | But also added low-resource languages when possible. |
08:04 | So we have very, very low-resource languages there. |
08:07 | Mostly in African languages. |
08:09 | And so here to gather the data there |
08:11 | what we decided was to partner |
08:13 | as much as possible with local communities |
08:15 | and ask them basically what they thought were good data |
08:18 | and how we could get it. |
08:20 | As importantly, we know where the data comes from |
08:22 | and how it was obtained. |
08:24 | That's the difference in open-source. |
08:27 | You know the books in the library. |
08:30 | All right, let me find English. |
08:32 | Okay, so let's being honest as an English speaker... |
08:35 | I'm kind of the target audience |
08:37 | for these big companies in these big models. |
08:40 | English represents |
08:41 | more than 40% of the Common Crawl |
08:44 | but there are reasons for even the target audience |
08:46 | to want all languages to be well represented. |
08:51 | I am an English speaker but I have my Jamaican accent |
08:54 | and I remember that... |
08:56 | initially like when Siri came out, I had a harder time using it |
09:01 | because it couldn't understand my accent. |
09:04 | So expanding even the training dataset |
09:07 | for voice assistants include... |
09:09 | more accents has been helpful. |
09:12 | So imagine what would happen |
09:13 | if we tried to expand another piece of that. |
09:16 | We're building technologies for more languages as well. |
09:19 | So if you want to have this model everywhere |
09:21 | you need to be able to trust them. |
09:23 | So if you trust Microsoft, that's fine. |
09:25 | But if you don't trust them... |
09:27 | yeah. |
09:28 | It's our language. |
09:30 | So we speak— we are Catalan speakers. |
09:33 | So whereas because of a small language |
09:35 | or of a moderately small language |
09:37 | because you may have languages that have a... |
09:41 | a sizable amount of speakers in the real world |
09:44 | but that have very, very little digital footprint. |
09:46 | So they are bound to just... |
09:49 | disappear. |