The following is a summary and article by AI based on a transcript of the video "Why AI Needs a “Nutrition Label” | Kasia Chmielinski | TED". Due to the limitations of AI, please be careful to distinguish the correctness of the content.
00:04 | Now, I haven't met most of you or really any of you, |
---|---|
00:07 | but I feel a really good vibe in the room. |
00:09 | (Laughter) |
00:10 | And so I think I'd like to treat you all to a meal. |
00:13 | What do you think? |
00:14 | Yes? Great, so many new friends. |
00:17 | So we're going to go to this cafe, |
00:19 | they serve sandwiches. |
00:20 | And the sandwiches are really delicious. |
00:22 | But I have to tell you that sometimes they make people really, really sick. |
00:27 | (Laughter) |
00:29 | And we don't know why. |
00:30 | Because the cafe won't tell us how they make the sandwich, |
00:33 | they won't tell us about the ingredients. |
00:35 | And then the authorities have no way to fix the problem. |
00:38 | But the offer still stands. |
00:39 | So who wants to get a sandwich? |
00:41 | (Laughter) |
00:42 | Some brave souls, we can talk after. |
00:45 | But for the rest of you, I understand. |
00:47 | You don't have enough information |
00:48 | to make good choices about your safety |
00:50 | or even fix the issue. |
00:52 | Now, before I further the anxiety here, I'm not actually trying to make you sick, |
00:56 | but this is an analogy to how we're currently making algorithmic systems, |
00:59 | also known as artificial intelligence or AI. |
01:04 | Now, for those who haven't thought about the relationship |
01:06 | between AI and sandwiches, don't worry about it, |
01:09 | I'm here for you, I'm going to explain. |
01:11 | You see, AI systems, they provide benefit to society. |
01:15 | They feed us, |
01:16 | but they're also inconsistently making us sick. |
01:20 | And we don't have access to the ingredients that go into the AI. |
01:25 | And so we can't actually address the issues. |
01:28 | We also can't stop eating AI |
01:30 | like we can just stop eating a shady sandwich |
01:32 | because it's everywhere, |
01:33 | and we often don't even know that we're encountering a system |
01:36 | that's algorithmically based. |
01:38 | So today, I'm going to tell you about some of the AI trends that I see. |
01:42 | I'm going to draw on my experience building these systems |
01:44 | over the last two decades to tell you about the tools |
01:47 | that I and others have built to look into these AI ingredients. |
01:51 | And finally, I'm going to leave you with three principles |
01:54 | that I think will give us a healthier relationship |
01:56 | to the companies that build artificial intelligence. |
02:00 | I'm going to start with the question, how did we get here? |
02:03 | AI is not new. |
02:06 | We have been living alongside AI for two decades. |
02:10 | Every time that you apply for something online, |
02:12 | you open a bank account or you go through passport control, |
02:16 | you're encountering an algorithmic system. |
02:19 | We've also been living with the negative repercussions of AI for 20 years, |
02:23 | and this is how it makes us sick. |
02:25 | These systems get deployed on broad populations, |
02:28 | and then certain subsets end up getting negatively disparately impacted, |
02:33 | usually on the basis of race or gender or other characteristics. |
02:37 | We need to be able to understand the ingredients to these systems |
02:40 | so that we can address the issues. |
02:43 | So what are the ingredients to an AI system? |
02:46 | Well, data fuels the AI. |
02:49 | The AI is going to look like the data that you gave it. |
02:52 | So for example, |
02:54 | if I want to make a risk-assessment system for diabetes, |
02:58 | my training data set might be adults in a certain region. |
03:02 | And so I'll build that system, |
03:04 | it'll work really well for those adults in that region. |
03:07 | But it does not work for adults in other regions |
03:09 | or maybe at all for children. |
03:10 | So you can imagine if we deploy this for all those populations, |
03:13 | there are going to be a lot of people who are harmed. |
03:16 | We need to be able to understand the quality of the data before we use it. |
03:22 | But I'm sorry to tell you that we currently live |
03:24 | in what I call the Wild West of data. |
03:26 | It's really hard to assess quality of data before you use it. |
03:31 | There are no global standards for data quality assessment, |
03:34 | and there are very few data regulations around how you can use data |
03:37 | and what types of data you can use. |
03:40 | This is kind of like in the food safety realm. |
03:43 | If we couldn't understand where the ingredients were sourced, |
03:46 | we also had no idea whether they were safe for us to consume. |
03:50 | We also tend to stitch data together, |
03:52 | and every time we stitch this data together, |
03:55 | which we might find on the internet, scrape, we might generate it, |
03:58 | we could source it. |
03:59 | We lose information about the quality of the data. |
04:03 | And the folks who are building the models |
04:05 | are not the ones that found the data. |
04:07 | So there's further information that's lost. |
04:10 | Now, I've been asking myself a lot of questions |
04:12 | about how can we understand the data quality before we use it. |
04:16 | And this emerges from two decades of building these kinds of systems. |
04:21 | The way I was trained to build systems is similar to how people do it today. |
04:25 | You build for the middle of the distribution. |
04:27 | That's your normal user. |
04:29 | So for me, a lot of my training data sets |
04:31 | would include information about people from the Western world who speak English, |
04:35 | who have certain normative characteristics. |
04:37 | And it took me an embarrassingly long amount of time |
04:40 | to realize that I was not my own user. |
04:43 | So I identify as non-binary, as mixed race, |
04:46 | I wear a hearing aid |
04:47 | and I just wasn't represented in the data sets that I was using. |
04:51 | And so I was building systems that literally didn't work for me. |
04:55 | And for example, I once built a system that repeatedly told me |
04:58 | that I was a white Eastern-European lady. |
05:02 | This did a real number on my identity. |
05:05 | (Laughter) |
05:06 | But perhaps even more worrying, |
05:08 | this was a system to be deployed in health care, |
05:11 | where your background can determine things like risk scores for diseases. |
05:17 | And so I started to wonder, |
05:19 | can I build tools and work with others to do this |
05:22 | so that I can look inside of a dataset before I use it? |
05:25 | In 2018, I was part of a fellowship at Harvard and MIT, |
05:29 | and I, with some colleagues, decided to try to address this problem. |
05:33 | And so we launched the Data Nutrition Project, |
05:36 | which is a research group and also a nonprofit |
05:39 | that builds nutrition labels for datasets. |
05:43 | So similar to food nutrition labels, |
05:46 | the idea here is that you can look inside of a data set before you use it. |
05:49 | You can understand the ingredients, |
05:51 | see whether it's healthy for the things that you want to do. |
05:54 | Now this is a cartoonified version of the label. |
05:56 | The top part tells you about the completion of the label itself. |
06:01 | And underneath that you have information about the data, |
06:03 | the description, the keywords, the tags, |
06:05 | and importantly, on the right hand side, |
06:07 | how you should and should not use the data. |
06:10 | If you could scroll on this cartoon, |
06:12 | you would see information about risks and mitigation strategies |
06:15 | across a number of vectors. |
06:17 | And we launched this with two audiences in mind. |
06:20 | The first audience are folks who are building AI. |
06:24 | So they’re choosing datasets. |
06:25 | We want to help them make a better choice. |
06:27 | The second audience are folks who are building datasets. |
06:31 | And it turns out |
06:32 | that when you tell someone they have to put a label on something, |
06:35 | they think about the ingredients beforehand. |
06:38 | The analogy here might be, |
06:39 | if I want to make a sandwich and say that it’s gluten-free, |
06:42 | I have to think about all the components as I make the sandwich, |
06:45 | the bread and the ingredients, the sauces. |
06:47 | I can't just put it on a sandwich and put it in front of you |
06:50 | and tell you it's gluten-free. |
06:52 | We're really proud of the work that we've done. |
06:54 | We launched this as a design and then a prototype |
06:57 | and ultimately a tool for others to make their own labels. |
07:01 | And we've worked with experts at places like Microsoft Research, |
07:04 | the United Nations and professors globally |
07:07 | to integrate the label and the methodology |
07:09 | into their work flows and into their curricula. |
07:13 | But we know it only goes so far. |
07:15 | And that's because it's actually really hard to get a label |
07:17 | on every single dataset. |
07:20 | And this comes down to the question |
07:22 | of why would you put a label on a dataset to begin with? |
07:25 | Well, the first reason is not rocket science. |
07:27 | It's that you have to. |
07:29 | And this is, quite frankly, why food nutrition labels exist. |
07:32 | It's because if they didn't put them on the boxes, it would be illegal. |
07:36 | However, we don't really have AI regulation. |
07:39 | We don't have much regulation around the use of data. |
07:42 | Now there is some on the horizon. |
07:44 | For example, the EU AI Act just passed this week. |
07:48 | And although there are no requirements around making the training data available, |
07:53 | they do have provisions for creating transparency labeling |
07:57 | like the dataset nutrition label, data sheets, data statements. |
08:01 | There are many in the space. |
08:02 | We think this is a really good first step. |
08:05 | The second reason that you might have a label on a dataset |
08:08 | is because it is a best practice or a cultural norm. |
08:13 | The example here might be how we're starting to see |
08:15 | more and more food packaging and menus at restaurants |
08:19 | include information about whether there's gluten. |
08:22 | This is not required by law, |
08:24 | although if you do say it, it had better be true. |
08:27 | And the reason that people are adding this to their menus |
08:29 | and their food packaging |
08:31 | is because there's an increased awareness of the sensitivity |
08:33 | and kind of the seriousness of that kind of an allergy or condition. |
08:39 | So we're also seeing some movement in this area. |
08:42 | Folks who are building datasets are starting to put nutrition labels, |
08:45 | data sheets on their datasets. |
08:47 | And people who are using data are starting to request the information. |
08:50 | This is really heartening. |
08:52 | And you might say, "Kasia, why are you up here? |
08:54 | Everything seems to be going well, seems to be getting better." |
08:57 | In some ways it is. |
08:58 | But I'm also here to tell you that our relationship to data |
09:01 | is getting worse. |
09:03 | Now the last few years have seen a supercharged interest |
09:07 | in gathering datasets. |
09:09 | Companies are scraping the web. |
09:11 | They're transcribing millions of hours of YouTube videos into text. |
09:15 | By some estimates, they'll run out of information on the internet by 2026. |
09:20 | They're even considering buying publishing houses |
09:23 | so they can get access to printed text and books. |
09:27 | So why are they gathering this information? |
09:30 | Well, they need more and more information |
09:32 | to train a new technique called generative AI. |
09:35 | I want to tell you about the size of these datasets. |
09:38 | If you look at GPT-3, which is a model that launched in 2020, |
09:41 | the training dataset included 300 billion words, or parts of words. |
09:47 | Now for context, the English language contains less than a million words. |
09:52 | Just three years later, DBRX was launched, |
09:55 | which was trained on eight trillion words. |
09:58 | So 300 billion to eight trillion in three years. |
10:01 | And the datasets are getting bigger. |
10:04 | Now with each successive model launch, |
10:06 | the datasets are actually less and less transparent. |
10:09 | And even we have access to the information, |
10:12 | it's so big, it's so hard to look inside without any kind of transparency tooling. |
10:18 | And the generative AI itself is also causing some worries. |
10:23 | And you've probably encountered this technique through ChatGPT. |
10:26 | I don't need to know what you do on the internet, |
10:29 | that's between you and the internet, |
10:30 | but you probably know, just like I do, |
10:32 | how easy it is to create information using ChatGPT |
10:35 | and other generative AI technologies |
10:36 | and to put that out onto the web. |
10:38 | And so we're looking at a situation |
10:40 | in which we're going to encounter lots of information |
10:43 | that's algorithmically generated but we won't know it |
10:45 | and we won't know whether it's true. |
10:47 | And this increases the scale of the potential risks and harms from AI. |
10:51 | Not only that, I'm sorry, |
10:53 | but the models themselves are getting controlled |
10:56 | by a smaller and smaller number of private actors in US tech firms. |
11:00 | So this is the models that were launched last year, in 2023. |
11:04 | And you can see most of them are pink, meaning they came out of industry. |
11:08 | And if you look at this over time, more and more are coming out of industry |
11:11 | and fewer and fewer are coming out of all the other sectors combined, |
11:14 | including academia and government, |
11:16 | where technology is often launched in a way |
11:18 | that's more easy to be scrutinized. |
11:20 | So if we go back to our cafe analogy, |
11:22 | this is like you have a small number of private actors |
11:25 | who own all the ingredients, |
11:27 | they make all the sandwiches globally, |
11:30 | and there's not a lot of regulation. |
11:33 | And so at this point you're probably scared |
11:35 | and maybe feeling a little uncomfortable. |
11:37 | Which is ironic because a few minutes ago, I was going to get you all sandwiches |
11:40 | and you said yes. |
11:42 | This is why you should not accept food from strangers. |
11:44 | But I wouldn't be up here if I weren't also optimistic. |
11:47 | And that's because I think we have momentum |
11:49 | behind the regulation and the culture changes. |
11:52 | Especially if we align ourselves with three basic principles |
11:55 | about how corporations should engage with data. |
11:58 | The first principle is that companies that gather data should tell us |
12:02 | what they're gathering. |
12:04 | This would allow us to ask questions like, is it copyrighted material? |
12:08 | Is that information private? |
12:09 | Could you please stop? |
12:11 | It also opens up the data to scientific inquiry. |
12:15 | The second principle is that companies that are gathering our data should tell us |
12:19 | what they're going to do with it before they do anything with it. |
12:23 | And by requiring that companies tell us their plan, |
12:26 | this means that they have to have a plan, |
12:28 | which would be a great first step. |
12:31 | It also probably would lead to the minimization of data capture, |
12:35 | because they wouldn't be able to capture data |
12:37 | if they didn't know what they were already going to do with it. |
12:40 | And finally, principle three, |
12:41 | companies that build AI should tell us about the data |
12:44 | that they use to train the AI. |
12:47 | And this is where dataset nutrition labels |
12:49 | and other transparency labeling comes into play. |
12:52 | You know, in the case where the data itself won't be made available, |
12:56 | which is most of the time, probably, |
12:58 | the labeling is critical for us to be able to investigate the ingredients |
13:02 | and start to find solutions. |
13:05 | So I want to leave you with the good news, |
13:07 | and that is that the data nutrition projects and other projects |
13:10 | are just a small part of a global movement |
13:14 | towards AI accountability. |
13:16 | Dataset Nutrition Label and other projects are just a first step. |
13:21 | Regulation's on the horizon, |
13:23 | the cultural norms are shifting, |
13:25 | especially if we align with these three basic principles |
13:28 | that companies should tell us what they're gathering, |
13:30 | tell us what they're going to do with it before they do anything with it, |
13:34 | and that companies that are building AI |
13:36 | should explain the data that they're using to build the system. |
13:40 | We need to hold these organizations accountable |
13:42 | for the AI that they're building |
13:44 | by asking them, just like we do with the food industry, |
13:47 | what's inside and how did you make it? |
13:50 | Only then can we mitigate the issues before they occur, |
13:53 | as opposed to after they occur. |
13:55 | And in doing so, create an integrated algorithmic internet |
13:59 | that is healthier for everyone. |
14:02 | Thank you. |
14:03 | (Applause) |