Transcript

Transcript of YouTube Video: Why AI doesn't speak every language

The following is a summary and article by AI based on a transcript of the video "Why AI doesn't speak every language". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Video on Youtube

Article By AIVideo Transcript

00:00	Okay.
00:01	I want you to check this out
00:02	because it represents a big challenge
00:05	for large language models like GPT-3...
00:09	and now GPT-4.
00:11	But it is not code.
00:14	It is a list that countries around the world
00:17	are grappling with.
00:20	Before we get into the problems
00:22	with large language models, let's review at a basic level...
00:25	how they work.
00:27	You've probably heard of ChatGPT.
00:30	It's not really a model
00:31	but an app that sits on top of a large language model.
00:34	In this case, a version of GPT.
00:38	One thing models like ChatGPT do
00:40	is natural language processing.
00:43	It's used in everything
00:44	from telephone customer service to auto completion.
00:47	GPT is like a used bookstore.
00:50	It's never left the room
00:52	and only learned through the books that are at the store.
00:56	Essentially, these large language models...
00:59	they scan a lot of text and try to learn a language.
01:03	They can check their process by covering up the answers...
01:06	and then seeing if they got it right.
01:08	Then they can use that knowledge to recognize sentiment...
01:12	summarize, translate
01:14	and generate responses or recommendations
01:17	based on the analyzed data.
01:19	And yes, ChatGPT wrote that last line.
01:22	You've got to do that in videos like this.
01:24	This is an amazing ability
01:26	but that's because it's read a lot of stuff.
01:28	You can ask ChatGPT
01:30	to rephrase something to Shakespeare
01:32	but that's because it's read all the Shakespeare
01:34	and that is where my waste of paper comes in.
01:37	Actually, I want to show you something.
01:43	Yeah? Good?
01:44	Okay.
01:45	So this is a print out of Common Crawl
01:48	from 2008 to the present.
01:51	Common crawl basically means that they go over all the websites
01:55	and index them.
01:56	And on this list, they put every language
01:58	that they think that they've indexed.
02:01	Here, you notice right away all the English.
02:04	Every crawl is like more than 40% just English.
02:08	German: DEU.
02:10	See the indexes here...
02:11	you know, it's about 6% every time
02:13	which doesn't sound like a lot, but it's kind of a lot.
02:18	But look here, 2023.
02:20	FIN: Finnish. Lot of pages.
02:24	But it's just 0.4% of the entire scan.
02:28	This bookstore, it's got an inventory problem.
02:31	All of the focus is on only a very small set of languages.
02:36	There was a paper that stated that of the 7,000 languages spoken globally
02:40	about 20 of those languages
02:42	make up the bulk of NLP research.
02:44	Okay, so let's back up a bit.
02:46	This is Ruth-Ann Armstrong.
02:48	She's a researcher who I interviewed and she's doing something
02:51	that a lot of researchers are trying to do...
02:54	make new data sets.
02:56	Those 20 languages fall into our category called
02:58	high-resource languages
03:00	and the others fall into a category
03:02	called low-resource languages.
03:03	Those low-resource languages
03:05	don't show up on the Internet as text as much
03:07	which means they don't make it into language datasets.
03:11	They become unintelligible to the AI.
03:14	Imagine our used bookstore again.
03:16	It has a ton of Dan Brown books
03:18	or James Patterson or Anne Tyler.
03:20	This is like English and German and Chinese.
03:23	The high-resource languages.
03:25	Then there are the rare books.
03:27	These are the low-resource languages.
03:30	So, many models just don't know as much about them...
03:33	or have anything at all.
03:35	I’m someone from Jamaica.
03:37	The language primarily spoken in Jamaica is English...
03:40	but we also speak a Creole language called Jamaican patois.
03:43	Armstrong and her coauthors wanted to create a dataset
03:46	that can explain this largely spoken language
03:49	but they weren't trying to generate texts like ChatGPT.
03:52	Instead, they wanted their model to understand it.
03:55	In this case, to do that, Armstrong went through
03:57	a bunch of examples of Jamaican patois and lined them up.
04:00	Two columns.
04:01	And she labeled whether the statements
04:03	entailed or agreed, contradicted, or were neutral.
04:08	You can try it in this one.
04:09	A has a fever.
04:11	B has a high temperature.
04:14	So it’s entailment.
04:15	They agree.
04:16	Try this one: Entailment or contradiciton?
04:20	Contradiction.
04:22	One more.
04:24	Neutral.
04:25	The two statements don't really relate.
04:27	She did that for almost 650 examples.
04:31	You can probably see that this was a ton of work.
04:34	And Jamaican patois is not on my big list of
04:37	Common Crawl languages.
04:40	I also talked to some Catalan researchers...
04:43	who are trying to evaluate how well these big language models
04:46	do on stuff like Catalan.
04:49	It is the most spoken in this autonomous community of Spain.
04:53	In GPT-3, the percentage of English words is 92%.
04:59	For German, there's 1.4% words.
05:03	Spanish appears in 0.7%.
05:07	And finally, Catalan.
05:09	The amount of Catalan words in the whole training set is 0.01%.
05:15	And it still performs very, very well.
05:17	So the problem here is a little bit different, right?
05:20	They've got some Catalan in the dataset.
05:23	Common Crawl says Catalan is 0.2335% of their survey.
05:28	Not a lot, but some.
05:30	In the big company models like GPT-3 and presumably GPT-4 in the future
05:34	were proven to do pretty well on little data.
05:38	For example, the research team got GPT-3
05:40	to generate 3 Catalan sentences.
05:43	And then they mixed them up with real sentences.
05:46	Three native speakers then evaluated them.
05:48	So, that was our test.
05:50	And their results were very good for the machine.
05:54	But there is still a catch.
05:55	It performs reasonably well.
05:57	But it's worth it to build a language-specific model
06:01	that has been specifically trained and evaluated for that language.
06:05	So the problem here isn't performance
06:07	it’s transparency and it's the amount of data.
06:10	I mean, Common Crawl says that they indexed...
06:13	millions of examples of Catalan words.
06:16	But GPT-3 says that they only read
06:19	about 140 pages of Catalan.
06:22	Imagine like a novella.
06:23	It's a problem being dependent
06:26	on the performance or even the goodwill
06:29	of a few institutions or a few companies.
06:32	You can easily imagine a world where one of these companies
06:35	just cuts out Catalan.
06:38	The same way Catalan News complained Google was cutting out
06:42	Catalan links in searches.
06:45	Common Crawl is just a percent of what
06:48	GPT-3 was trained for.
06:49	We don't know the details about GPT-4.
06:52	And that means a lot of other stuff
06:54	went into this language model that we just don't know about.
06:57	Right now all these bookstores are actually run by
07:00	Meta or Microsoft or Baidu or Open AI or Google.
07:03	They decide which books go in there
07:06	and don't tell anyone where they came from or who wrote them.
07:09	Some people are trying to build a library
07:12	next to the bookstore.
07:15	This is Paris, where the French have a supercomputer
07:18	that wasn't being used a lot.
07:20	It’s like almost down the road
07:21	and I was discussing with the people...
07:23	who built it and they're like, “Nobody uses this GPU.”
07:26	Basically, what can we do?
07:28	Thomas Wolf is a co-founder of Hugging Face
07:30	which is like a hub for AI research on the Internet
07:34	and they ended up working on Big Science’s BLOOM...
07:37	a project to create an open-source multilingual model.
07:41	And the more we thought about it
07:42	we thought it's also a lot better, in fact, that we trained it
07:45	in a lot of other languages, not just English.
07:47	And if we try to involve many people
07:49	and so it started from a small Hugging Face project
07:52	to become a very big collaboration.
07:55	Where we tried to open this to everyone.
07:57	They basically went down the Wikipedia list
07:59	of most spoken languages and covered those.
08:01	But also added low-resource languages when possible.
08:04	So we have very, very low-resource languages there.
08:07	Mostly in African languages.
08:09	And so here to gather the data there
08:11	what we decided was to partner
08:13	as much as possible with local communities
08:15	and ask them basically what they thought were good data
08:18	and how we could get it.
08:20	As importantly, we know where the data comes from
08:22	and how it was obtained.
08:24	That's the difference in open-source.
08:27	You know the books in the library.
08:30	All right, let me find English.
08:32	Okay, so let's being honest as an English speaker...
08:35	I'm kind of the target audience
08:37	for these big companies in these big models.
08:40	English represents
08:41	more than 40% of the Common Crawl
08:44	but there are reasons for even the target audience
08:46	to want all languages to be well represented.
08:51	I am an English speaker but I have my Jamaican accent
08:54	and I remember that...
08:56	initially like when Siri came out, I had a harder time using it
09:01	because it couldn't understand my accent.
09:04	So expanding even the training dataset
09:07	for voice assistants include...
09:09	more accents has been helpful.
09:12	So imagine what would happen
09:13	if we tried to expand another piece of that.
09:16	We're building technologies for more languages as well.
09:19	So if you want to have this model everywhere
09:21	you need to be able to trust them.
09:23	So if you trust Microsoft, that's fine.
09:25	But if you don't trust them...
09:27	yeah.
09:28	It's our language.
09:30	So we speak— we are Catalan speakers.
09:33	So whereas because of a small language
09:35	or of a moderately small language
09:37	because you may have languages that have a...
09:41	a sizable amount of speakers in the real world
09:44	but that have very, very little digital footprint.
09:46	So they are bound to just...
09:49	disappear.

Why Isn’t the Climate Movement Voting? | Nathaniel Stinnett | TED

A Better Treatment for Overdose Is Coming

The Most Important Explosion in History

Why You See Faces in Things

The Real Reason Robots Shouldn’t Look Like Humans

Video Transcript