Transcript

Transcript of YouTube Video: Why AI Needs a “Nutrition Label” | Kasia Chmielinski | TED

Transcript of YouTube Video: Why AI Needs a “Nutrition Label” | Kasia Chmielinski | TED

The following is a summary and article by AI based on a transcript of the video "Why AI Needs a “Nutrition Label” | Kasia Chmielinski | TED". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Article By AIVideo Transcript
00:04

Now, I haven't met most of you or really any of you,

00:07

but I feel a really good vibe in the room.

00:09

(Laughter)

00:10

And so I think I'd like to treat you all to a meal.

00:13

What do you think?

00:14

Yes? Great, so many new friends.

00:17

So we're going to go to this cafe,

00:19

they serve sandwiches.

00:20

And the sandwiches are really delicious.

00:22

But I have to tell you that sometimes they make people really, really sick.

00:27

(Laughter)

00:29

And we don't know why.

00:30

Because the cafe won't tell us how they make the sandwich,

00:33

they won't tell us about the ingredients.

00:35

And then the authorities have no way to fix the problem.

00:38

But the offer still stands.

00:39

So who wants to get a sandwich?

00:41

(Laughter)

00:42

Some brave souls, we can talk after.

00:45

But for the rest of you, I understand.

00:47

You don't have enough information

00:48

to make good choices about your safety

00:50

or even fix the issue.

00:52

Now, before I further the anxiety here, I'm not actually trying to make you sick,

00:56

but this is an analogy to how we're currently making algorithmic systems,

00:59

also known as artificial intelligence or AI.

01:04

Now, for those who haven't thought about the relationship

01:06

between AI and sandwiches, don't worry about it,

01:09

I'm here for you, I'm going to explain.

01:11

You see, AI systems, they provide benefit to society.

01:15

They feed us,

01:16

but they're also inconsistently making us sick.

01:20

And we don't have access to the ingredients that go into the AI.

01:25

And so we can't actually address the issues.

01:28

We also can't stop eating AI

01:30

like we can just stop eating a shady sandwich

01:32

because it's everywhere,

01:33

and we often don't even know that we're encountering a system

01:36

that's algorithmically based.

01:38

So today, I'm going to tell you about some of the AI trends that I see.

01:42

I'm going to draw on my experience building these systems

01:44

over the last two decades to tell you about the tools

01:47

that I and others have built to look into these AI ingredients.

01:51

And finally, I'm going to leave you with three principles

01:54

that I think will give us a healthier relationship

01:56

to the companies that build artificial intelligence.

02:00

I'm going to start with the question, how did we get here?

02:03

AI is not new.

02:06

We have been living alongside AI for two decades.

02:10

Every time that you apply for something online,

02:12

you open a bank account or you go through passport control,

02:16

you're encountering an algorithmic system.

02:19

We've also been living with the negative repercussions of AI for 20 years,

02:23

and this is how it makes us sick.

02:25

These systems get deployed on broad populations,

02:28

and then certain subsets end up getting negatively disparately impacted,

02:33

usually on the basis of race or gender or other characteristics.

02:37

We need to be able to understand the ingredients to these systems

02:40

so that we can address the issues.

02:43

So what are the ingredients to an AI system?

02:46

Well, data fuels the AI.

02:49

The AI is going to look like the data that you gave it.

02:52

So for example,

02:54

if I want to make a risk-assessment system for diabetes,

02:58

my training data set might be adults in a certain region.

03:02

And so I'll build that system,

03:04

it'll work really well for those adults in that region.

03:07

But it does not work for adults in other regions

03:09

or maybe at all for children.

03:10

So you can imagine if we deploy this for all those populations,

03:13

there are going to be a lot of people who are harmed.

03:16

We need to be able to understand the quality of the data before we use it.

03:22

But I'm sorry to tell you that we currently live

03:24

in what I call the Wild West of data.

03:26

It's really hard to assess quality of data before you use it.

03:31

There are no global standards for data quality assessment,

03:34

and there are very few data regulations around how you can use data

03:37

and what types of data you can use.

03:40

This is kind of like in the food safety realm.

03:43

If we couldn't understand where the ingredients were sourced,

03:46

we also had no idea whether they were safe for us to consume.

03:50

We also tend to stitch data together,

03:52

and every time we stitch this data together,

03:55

which we might find on the internet, scrape, we might generate it,

03:58

we could source it.

03:59

We lose information about the quality of the data.

04:03

And the folks who are building the models

04:05

are not the ones that found the data.

04:07

So there's further information that's lost.

04:10

Now, I've been asking myself a lot of questions

04:12

about how can we understand the data quality before we use it.

04:16

And this emerges from two decades of building these kinds of systems.

04:21

The way I was trained to build systems is similar to how people do it today.

04:25

You build for the middle of the distribution.

04:27

That's your normal user.

04:29

So for me, a lot of my training data sets

04:31

would include information about people from the Western world who speak English,

04:35

who have certain normative characteristics.

04:37

And it took me an embarrassingly long amount of time

04:40

to realize that I was not my own user.

04:43

So I identify as non-binary, as mixed race,

04:46

I wear a hearing aid

04:47

and I just wasn't represented in the data sets that I was using.

04:51

And so I was building systems that literally didn't work for me.

04:55

And for example, I once built a system that repeatedly told me

04:58

that I was a white Eastern-European lady.

05:02

This did a real number on my identity.

05:05

(Laughter)

05:06

But perhaps even more worrying,

05:08

this was a system to be deployed in health care,

05:11

where your background can determine things like risk scores for diseases.

05:17

And so I started to wonder,

05:19

can I build tools and work with others to do this

05:22

so that I can look inside of a dataset before I use it?

05:25

In 2018, I was part of a fellowship at Harvard and MIT,

05:29

and I, with some colleagues, decided to try to address this problem.

05:33

And so we launched the Data Nutrition Project,

05:36

which is a research group and also a nonprofit

05:39

that builds nutrition labels for datasets.

05:43

So similar to food nutrition labels,

05:46

the idea here is that you can look inside of a data set before you use it.

05:49

You can understand the ingredients,

05:51

see whether it's healthy for the things that you want to do.

05:54

Now this is a cartoonified version of the label.

05:56

The top part tells you about the completion of the label itself.

06:01

And underneath that you have information about the data,

06:03

the description, the keywords, the tags,

06:05

and importantly, on the right hand side,

06:07

how you should and should not use the data.

06:10

If you could scroll on this cartoon,

06:12

you would see information about risks and mitigation strategies

06:15

across a number of vectors.

06:17

And we launched this with two audiences in mind.

06:20

The first audience are folks who are building AI.

06:24

So they’re choosing datasets.

06:25

We want to help them make a better choice.

06:27

The second audience are folks who are building datasets.

06:31

And it turns out

06:32

that when you tell someone they have to put a label on something,

06:35

they think about the ingredients beforehand.

06:38

The analogy here might be,

06:39

if I want to make a sandwich and say that it’s gluten-free,

06:42

I have to think about all the components as I make the sandwich,

06:45

the bread and the ingredients, the sauces.

06:47

I can't just put it on a sandwich and put it in front of you

06:50

and tell you it's gluten-free.

06:52

We're really proud of the work that we've done.

06:54

We launched this as a design and then a prototype

06:57

and ultimately a tool for others to make their own labels.

07:01

And we've worked with experts at places like Microsoft Research,

07:04

the United Nations and professors globally

07:07

to integrate the label and the methodology

07:09

into their work flows and into their curricula.

07:13

But we know it only goes so far.

07:15

And that's because it's actually really hard to get a label

07:17

on every single dataset.

07:20

And this comes down to the question

07:22

of why would you put a label on a dataset to begin with?

07:25

Well, the first reason is not rocket science.

07:27

It's that you have to.

07:29

And this is, quite frankly, why food nutrition labels exist.

07:32

It's because if they didn't put them on the boxes, it would be illegal.

07:36

However, we don't really have AI regulation.

07:39

We don't have much regulation around the use of data.

07:42

Now there is some on the horizon.

07:44

For example, the EU AI Act just passed this week.

07:48

And although there are no requirements around making the training data available,

07:53

they do have provisions for creating transparency labeling

07:57

like the dataset nutrition label, data sheets, data statements.

08:01

There are many in the space.

08:02

We think this is a really good first step.

08:05

The second reason that you might have a label on a dataset

08:08

is because it is a best practice or a cultural norm.

08:13

The example here might be how we're starting to see

08:15

more and more food packaging and menus at restaurants

08:19

include information about whether there's gluten.

08:22

This is not required by law,

08:24

although if you do say it, it had better be true.

08:27

And the reason that people are adding this to their menus

08:29

and their food packaging

08:31

is because there's an increased awareness of the sensitivity

08:33

and kind of the seriousness of that kind of an allergy or condition.

08:39

So we're also seeing some movement in this area.

08:42

Folks who are building datasets are starting to put nutrition labels,

08:45

data sheets on their datasets.

08:47

And people who are using data are starting to request the information.

08:50

This is really heartening.

08:52

And you might say, "Kasia, why are you up here?

08:54

Everything seems to be going well, seems to be getting better."

08:57

In some ways it is.

08:58

But I'm also here to tell you that our relationship to data

09:01

is getting worse.

09:03

Now the last few years have seen a supercharged interest

09:07

in gathering datasets.

09:09

Companies are scraping the web.

09:11

They're transcribing millions of hours of YouTube videos into text.

09:15

By some estimates, they'll run out of information on the internet by 2026.

09:20

They're even considering buying publishing houses

09:23

so they can get access to printed text and books.

09:27

So why are they gathering this information?

09:30

Well, they need more and more information

09:32

to train a new technique called generative AI.

09:35

I want to tell you about the size of these datasets.

09:38

If you look at GPT-3, which is a model that launched in 2020,

09:41

the training dataset included 300 billion words, or parts of words.

09:47

Now for context, the English language contains less than a million words.

09:52

Just three years later, DBRX was launched,

09:55

which was trained on eight trillion words.

09:58

So 300 billion to eight trillion in three years.

10:01

And the datasets are getting bigger.

10:04

Now with each successive model launch,

10:06

the datasets are actually less and less transparent.

10:09

And even we have access to the information,

10:12

it's so big, it's so hard to look inside without any kind of transparency tooling.

10:18

And the generative AI itself is also causing some worries.

10:23

And you've probably encountered this technique through ChatGPT.

10:26

I don't need to know what you do on the internet,

10:29

that's between you and the internet,

10:30

but you probably know, just like I do,

10:32

how easy it is to create information using ChatGPT

10:35

and other generative AI technologies

10:36

and to put that out onto the web.

10:38

And so we're looking at a situation

10:40

in which we're going to encounter lots of information

10:43

that's algorithmically generated but we won't know it

10:45

and we won't know whether it's true.

10:47

And this increases the scale of the potential risks and harms from AI.

10:51

Not only that, I'm sorry,

10:53

but the models themselves are getting controlled

10:56

by a smaller and smaller number of private actors in US tech firms.

11:00

So this is the models that were launched last year, in 2023.

11:04

And you can see most of them are pink, meaning they came out of industry.

11:08

And if you look at this over time, more and more are coming out of industry

11:11

and fewer and fewer are coming out of all the other sectors combined,

11:14

including academia and government,

11:16

where technology is often launched in a way

11:18

that's more easy to be scrutinized.

11:20

So if we go back to our cafe analogy,

11:22

this is like you have a small number of private actors

11:25

who own all the ingredients,

11:27

they make all the sandwiches globally,

11:30

and there's not a lot of regulation.

11:33

And so at this point you're probably scared

11:35

and maybe feeling a little uncomfortable.

11:37

Which is ironic because a few minutes ago, I was going to get you all sandwiches

11:40

and you said yes.

11:42

This is why you should not accept food from strangers.

11:44

But I wouldn't be up here if I weren't also optimistic.

11:47

And that's because I think we have momentum

11:49

behind the regulation and the culture changes.

11:52

Especially if we align ourselves with three basic principles

11:55

about how corporations should engage with data.

11:58

The first principle is that companies that gather data should tell us

12:02

what they're gathering.

12:04

This would allow us to ask questions like, is it copyrighted material?

12:08

Is that information private?

12:09

Could you please stop?

12:11

It also opens up the data to scientific inquiry.

12:15

The second principle is that companies that are gathering our data should tell us

12:19

what they're going to do with it before they do anything with it.

12:23

And by requiring that companies tell us their plan,

12:26

this means that they have to have a plan,

12:28

which would be a great first step.

12:31

It also probably would lead to the minimization of data capture,

12:35

because they wouldn't be able to capture data

12:37

if they didn't know what they were already going to do with it.

12:40

And finally, principle three,

12:41

companies that build AI should tell us about the data

12:44

that they use to train the AI.

12:47

And this is where dataset nutrition labels

12:49

and other transparency labeling comes into play.

12:52

You know, in the case where the data itself won't be made available,

12:56

which is most of the time, probably,

12:58

the labeling is critical for us to be able to investigate the ingredients

13:02

and start to find solutions.

13:05

So I want to leave you with the good news,

13:07

and that is that the data nutrition projects and other projects

13:10

are just a small part of a global movement

13:14

towards AI accountability.

13:16

Dataset Nutrition Label and other projects are just a first step.

13:21

Regulation's on the horizon,

13:23

the cultural norms are shifting,

13:25

especially if we align with these three basic principles

13:28

that companies should tell us what they're gathering,

13:30

tell us what they're going to do with it before they do anything with it,

13:34

and that companies that are building AI

13:36

should explain the data that they're using to build the system.

13:40

We need to hold these organizations accountable

13:42

for the AI that they're building

13:44

by asking them, just like we do with the food industry,

13:47

what's inside and how did you make it?

13:50

Only then can we mitigate the issues before they occur,

13:53

as opposed to after they occur.

13:55

And in doing so, create an integrated algorithmic internet

13:59

that is healthier for everyone.

14:02

Thank you.

14:03

(Applause)