Transcript

Transcript of YouTube Video: Why AI doesn't speak every language

Transcript of YouTube Video: Why AI doesn't speak every language

The following is a summary and article by AI based on a transcript of the video "Why AI doesn't speak every language". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Article By AIVideo Transcript
00:00

Okay.

00:01

I want you to check this out

00:02

because it represents a big challenge

00:05

for large language models like GPT-3...

00:09

and now GPT-4.

00:11

But it is not code.

00:14

It is a list that countries around the world

00:17

are grappling with.

00:20

Before we get into the problems

00:22

with large language models, let's review at a basic level...

00:25

how they work.

00:27

You've probably heard of ChatGPT.

00:30

It's not really a model

00:31

but an app that sits on top of a large language model.

00:34

In this case, a version of GPT.

00:38

One thing models like ChatGPT do

00:40

is natural language processing.

00:43

It's used in everything

00:44

from telephone customer service to auto completion.

00:47

GPT is like a used bookstore.

00:50

It's never left the room

00:52

and only learned through the books that are at the store.

00:56

Essentially, these large language models...

00:59

they scan a lot of text and try to learn a language.

01:03

They can check their process by covering up the answers...

01:06

and then seeing if they got it right.

01:08

Then they can use that knowledge to recognize sentiment...

01:12

summarize, translate

01:14

and generate responses or recommendations

01:17

based on the analyzed data.

01:19

And yes, ChatGPT wrote that last line.

01:22

You've got to do that in videos like this.

01:24

This is an amazing ability

01:26

but that's because it's read a lot of stuff.

01:28

You can ask ChatGPT

01:30

to rephrase something to Shakespeare

01:32

but that's because it's read all the Shakespeare

01:34

and that is where my waste of paper comes in.

01:37

Actually, I want to show you something.

01:43

Yeah? Good?

01:44

Okay.

01:45

So this is a print out of Common Crawl

01:48

from 2008 to the present.

01:51

Common crawl basically means that they go over all the websites

01:55

and index them.

01:56

And on this list, they put every language

01:58

that they think that they've indexed.

02:01

Here, you notice right away all the English.

02:04

Every crawl is like more than 40% just English.

02:08

German: DEU.

02:10

See the indexes here...

02:11

you know, it's about 6% every time

02:13

which doesn't sound like a lot, but it's kind of a lot.

02:18

But look here, 2023.

02:20

FIN: Finnish. Lot of pages.

02:24

But it's just 0.4% of the entire scan.

02:28

This bookstore, it's got an inventory problem.

02:31

All of the focus is on only a very small set of languages.

02:36

There was a paper that stated that of the 7,000 languages spoken globally

02:40

about 20 of those languages

02:42

make up the bulk of NLP research.

02:44

Okay, so let's back up a bit.

02:46

This is Ruth-Ann Armstrong.

02:48

She's a researcher who I interviewed and she's doing something

02:51

that a lot of researchers are trying to do...

02:54

make new data sets.

02:56

Those 20 languages fall into our category called

02:58

high-resource languages

03:00

and the others fall into a category

03:02

called low-resource languages.

03:03

Those low-resource languages

03:05

don't show up on the Internet as text as much

03:07

which means they don't make it into language datasets.

03:11

They become unintelligible to the AI.

03:14

Imagine our used bookstore again.

03:16

It has a ton of Dan Brown books

03:18

or James Patterson or Anne Tyler.

03:20

This is like English and German and Chinese.

03:23

The high-resource languages.

03:25

Then there are the rare books.

03:27

These are the low-resource languages.

03:30

So, many models just don't know as much about them...

03:33

or have anything at all.

03:35

I’m someone from Jamaica.

03:37

The language primarily spoken in Jamaica is English...

03:40

but we also speak a Creole language called Jamaican patois.

03:43

Armstrong and her coauthors wanted to create a dataset

03:46

that can explain this largely spoken language

03:49

but they weren't trying to generate texts like ChatGPT.

03:52

Instead, they wanted their model to understand it.

03:55

In this case, to do that, Armstrong went through

03:57

a bunch of examples of Jamaican patois and lined them up.

04:00

Two columns.

04:01

And she labeled whether the statements

04:03

entailed or agreed, contradicted, or were neutral.

04:08

You can try it in this one.

04:09

A has a fever.

04:11

B has a high temperature.

04:14

So it’s entailment.

04:15

They agree.

04:16

Try this one: Entailment or contradiciton?

04:20

Contradiction.

04:22

One more.

04:24

Neutral.

04:25

The two statements don't really relate.

04:27

She did that for almost 650 examples.

04:31

You can probably see that this was a ton of work.

04:34

And Jamaican patois is not on my big list of

04:37

Common Crawl languages.

04:40

I also talked to some Catalan researchers...

04:43

who are trying to evaluate how well these big language models

04:46

do on stuff like Catalan.

04:49

It is the most spoken in this autonomous community of Spain.

04:53

In GPT-3, the percentage of English words is 92%.

04:59

For German, there's 1.4% words.

05:03

Spanish appears in 0.7%.

05:07

And finally, Catalan.

05:09

The amount of Catalan words in the whole training set is 0.01%.

05:15

And it still performs very, very well.

05:17

So the problem here is a little bit different, right?

05:20

They've got some Catalan in the dataset.

05:23

Common Crawl says Catalan is 0.2335% of their survey.

05:28

Not a lot, but some.

05:30

In the big company models like GPT-3 and presumably GPT-4 in the future

05:34

were proven to do pretty well on little data.

05:38

For example, the research team got GPT-3

05:40

to generate 3 Catalan sentences.

05:43

And then they mixed them up with real sentences.

05:46

Three native speakers then evaluated them.

05:48

So, that was our test.

05:50

And their results were very good for the machine.

05:54

But there is still a catch.

05:55

It performs reasonably well.

05:57

But it's worth it to build a language-specific model

06:01

that has been specifically trained and evaluated for that language.

06:05

So the problem here isn't performance

06:07

it’s transparency and it's the amount of data.

06:10

I mean, Common Crawl says that they indexed...

06:13

millions of examples of Catalan words.

06:16

But GPT-3 says that they only read

06:19

about 140 pages of Catalan.

06:22

Imagine like a novella.

06:23

It's a problem being dependent

06:26

on the performance or even the goodwill

06:29

of a few institutions or a few companies.

06:32

You can easily imagine a world where one of these companies

06:35

just cuts out Catalan.

06:38

The same way Catalan News complained Google was cutting out

06:42

Catalan links in searches.

06:45

Common Crawl is just a percent of what

06:48

GPT-3 was trained for.

06:49

We don't know the details about GPT-4.

06:52

And that means a lot of other stuff

06:54

went into this language model that we just don't know about.

06:57

Right now all these bookstores are actually run by

07:00

Meta or Microsoft or Baidu or Open AI or Google.

07:03

They decide which books go in there

07:06

and don't tell anyone where they came from or who wrote them.

07:09

Some people are trying to build a library

07:12

next to the bookstore.

07:15

This is Paris, where the French have a supercomputer

07:18

that wasn't being used a lot.

07:20

It’s like almost down the road

07:21

and I was discussing with the people...

07:23

who built it and they're like, “Nobody uses this GPU.”

07:26

Basically, what can we do?

07:28

Thomas Wolf is a co-founder of Hugging Face

07:30

which is like a hub for AI research on the Internet

07:34

and they ended up working on Big Science’s BLOOM...

07:37

a project to create an open-source multilingual model.

07:41

And the more we thought about it

07:42

we thought it's also a lot better, in fact, that we trained it

07:45

in a lot of other languages, not just English.

07:47

And if we try to involve many people

07:49

and so it started from a small Hugging Face project

07:52

to become a very big collaboration.

07:55

Where we tried to open this to everyone.

07:57

They basically went down the Wikipedia list

07:59

of most spoken languages and covered those.

08:01

But also added low-resource languages when possible.

08:04

So we have very, very low-resource languages there.

08:07

Mostly in African languages.

08:09

And so here to gather the data there

08:11

what we decided was to partner

08:13

as much as possible with local communities

08:15

and ask them basically what they thought were good data

08:18

and how we could get it.

08:20

As importantly, we know where the data comes from

08:22

and how it was obtained.

08:24

That's the difference in open-source.

08:27

You know the books in the library.

08:30

All right, let me find English.

08:32

Okay, so let's being honest as an English speaker...

08:35

I'm kind of the target audience

08:37

for these big companies in these big models.

08:40

English represents

08:41

more than 40% of the Common Crawl

08:44

but there are reasons for even the target audience

08:46

to want all languages to be well represented.

08:51

I am an English speaker but I have my Jamaican accent

08:54

and I remember that...

08:56

initially like when Siri came out, I had a harder time using it

09:01

because it couldn't understand my accent.

09:04

So expanding even the training dataset

09:07

for voice assistants include...

09:09

more accents has been helpful.

09:12

So imagine what would happen

09:13

if we tried to expand another piece of that.

09:16

We're building technologies for more languages as well.

09:19

So if you want to have this model everywhere

09:21

you need to be able to trust them.

09:23

So if you trust Microsoft, that's fine.

09:25

But if you don't trust them...

09:27

yeah.

09:28

It's our language.

09:30

So we speak— we are Catalan speakers.

09:33

So whereas because of a small language

09:35

or of a moderately small language

09:37

because you may have languages that have a...

09:41

a sizable amount of speakers in the real world

09:44

but that have very, very little digital footprint.

09:46

So they are bound to just...

09:49

disappear.