The following is a summary and article by AI based on a transcript of the video "With Spatial Intelligence, AI Will Understand the Real World | Fei-Fei Li | TED". Due to the limitations of AI, please be careful to distinguish the correctness of the content.
00:04 | Let me show you something. |
---|---|
00:06 | To be precise, |
00:07 | I'm going to show you nothing. |
00:10 | This was the world 540 million years ago. |
00:15 | Pure, endless darkness. |
00:18 | It wasn't dark due to a lack of light. |
00:22 | It was dark because of a lack of sight. |
00:27 | Although sunshine did filter 1,000 meters |
00:32 | beneath the surface of ocean, |
00:35 | a light permeated from hydrothermal vents to seafloor, |
00:40 | brimming with life, |
00:42 | there was not a single eye to be found in these ancient waters. |
00:47 | No retinas, no corneas, no lenses. |
00:52 | So all this light, all this life went unseen. |
00:57 | There was a time that the very idea of seeing didn't exist. |
01:03 | It [had] simply never been done before. |
01:06 | Until it was. |
01:09 | So for reasons we're only beginning to understand, |
01:12 | trilobites, the first organisms that could sense light, emerged. |
01:18 | They're the first inhabitants of this reality that we take for granted. |
01:24 | First to discover that there is something other than oneself. |
01:28 | A world of many selves. |
01:32 | The ability to see is thought to have ushered in Cambrian explosion, |
01:37 | a period in which a huge variety of animal species |
01:41 | entered fossil records. |
01:43 | What began as a passive experience, |
01:46 | the simple act of letting light in, |
01:50 | soon became far more active. |
01:53 | The nervous system began to evolve. |
01:56 | Sight turning to insight. |
02:00 | Seeing became understanding. |
02:03 | Understanding led to actions. |
02:05 | And all these gave rise to intelligence. |
02:10 | Today, we're no longer satisfied with just nature's gift of visual intelligence. |
02:17 | Curiosity urges us to create machines to see just as intelligently as we can, |
02:23 | if not better. |
02:25 | Nine years ago, on this stage, |
02:27 | I delivered an early progress report on computer vision, |
02:32 | a subfield of artificial intelligence. |
02:35 | Three powerful forces converged for the first time. |
02:39 | Aa family of algorithms called neural networks. |
02:43 | Fast, specialized hardware called graphic processing units, |
02:48 | or GPUs. |
02:49 | And big data. |
02:51 | Like the 15 million images that my lab spent years curating called ImageNet. |
02:57 | Together, they ushered in the age of modern AI. |
03:02 | We've come a long way. |
03:04 | Back then, just putting labels on images was a big breakthrough. |
03:09 | But the speed and accuracy of these algorithms just improved rapidly. |
03:14 | The annual ImageNet challenge, led by my lab, |
03:18 | gauged the performance of this progress. |
03:21 | And on this plot, you're seeing the annual improvement |
03:24 | and milestone models. |
03:27 | We went a step further |
03:29 | and created algorithms that can segment objects |
03:34 | or predict the dynamic relationships among them |
03:37 | in these works done by my students and collaborators. |
03:41 | And there's more. |
03:43 | Recall last time I showed you the first computer-vision algorithm |
03:47 | that can describe a photo in human natural language. |
03:52 | That was work done with my brilliant former student, Andrej Karpathy. |
03:57 | At that time, I pushed my luck and said, |
03:59 | "Andrej, can we make computers to do the reverse?" |
04:02 | And Andrej said, "Ha ha, that's impossible." |
04:05 | Well, as you can see from this post, |
04:07 | recently the impossible has become possible. |
04:11 | That's thanks to a family of diffusion models |
04:15 | that powers today's generative AI algorithm, |
04:18 | which can take human-prompted sentences |
04:22 | and turn them into photos and videos |
04:25 | of something that's entirely new. |
04:28 | Many of you have seen the recent impressive results of Sora by OpenAI. |
04:34 | But even without the enormous number of GPUs, |
04:37 | my student and our collaborators |
04:40 | have developed a generative video model called Walt |
04:44 | months before Sora. |
04:47 | And you're seeing some of these results. |
04:50 | There is room for improvement. |
04:53 | I mean, look at that cat's eye |
04:55 | and the way it goes under the wave without ever getting wet. |
04:59 | What a cat-astrophe. |
05:01 | (Laughter) |
05:04 | And if past is prologue, |
05:07 | we will learn from these mistakes and create a future we imagine. |
05:11 | And in this future, |
05:13 | we want AI to do everything it can for us, |
05:17 | or to help us. |
05:19 | For years I have been saying |
05:22 | that taking a picture is not the same as seeing and understanding. |
05:26 | Today, I would like to add to that. |
05:30 | Simply seeing is not enough. |
05:33 | Seeing is for doing and learning. |
05:36 | When we act upon this world in 3D space and time, |
05:41 | we learn, and we learn to see and do better. |
05:46 | Nature has created this virtuous cycle of seeing and doing |
05:50 | powered by “spatial intelligence.” |
05:54 | To illustrate to you what your spatial intelligence is doing constantly, |
05:58 | look at this picture. |
05:59 | Raise your hand if you feel like you want to do something. |
06:02 | (Laughter) |
06:04 | In the last split of a second, |
06:06 | your brain looked at the geometry of this glass, |
06:09 | its place in 3D space, |
06:12 | its relationship with the table, the cat |
06:15 | and everything else. |
06:16 | And you can predict what's going to happen next. |
06:20 | The urge to act is innate to all beings with spatial intelligence, |
06:27 | which links perception with action. |
06:30 | And if we want to advance AI beyond its current capabilities, |
06:36 | we want more than AI that can see and talk. |
06:39 | We want AI that can do. |
06:42 | Indeed, we're making exciting progress. |
06:46 | The recent milestones in spatial intelligence |
06:50 | is teaching computers to see, learn, do |
06:54 | and learn to see and do better. |
06:57 | This is not easy. |
06:59 | It took nature millions of years to evolve spatial intelligence, |
07:04 | which depends on the eye taking light, |
07:07 | project 2D images on the retina |
07:09 | and the brain to translate these data into 3D information. |
07:14 | Only recently, a group of researchers from Google |
07:17 | are able to develop an algorithm to take a bunch of photos |
07:22 | and translate that into 3D space, |
07:26 | like the examples we're showing here. |
07:29 | My student and our collaborators have taken a step further |
07:33 | and created an algorithm that takes one input image |
07:38 | and turn that into 3D shape. |
07:40 | Here are more examples. |
07:43 | Recall, we talked about computer programs that can take a human sentence |
07:49 | and turn it into videos. |
07:51 | A group of researchers in University of Michigan |
07:55 | have figured out a way to translate that line of sentence |
07:59 | into 3D room layout, like shown here. |
08:03 | And my colleagues at Stanford and their students |
08:06 | have developed an algorithm that takes one image |
08:10 | and generates infinitely plausible spaces |
08:14 | for viewers to explore. |
08:17 | These are prototypes of the first budding signs of a future possibility. |
08:23 | One in which the human race can take our entire world |
08:29 | and translate it into digital forms |
08:32 | and model the richness and nuances. |
08:35 | What nature did to us implicitly in our individual minds, |
08:40 | spatial intelligence technology can hope to do |
08:44 | for our collective consciousness. |
08:47 | As the progress of spatial intelligence accelerates, |
08:51 | a new era in this virtuous cycle is taking place in front of our eyes. |
08:56 | This back and forth is catalyzing robotic learning, |
09:01 | a key component for any embodied intelligence system |
09:06 | that needs to understand and interact with the 3D world. |
09:12 | A decade ago, |
09:14 | ImageNet from my lab |
09:16 | enabled a database of millions of high-quality photos |
09:20 | to help train computers to see. |
09:23 | Today, we're doing the same with behaviors and actions |
09:28 | to train computers and robots how to act in the 3D world. |
09:34 | But instead of collecting static images, |
09:37 | we develop simulation environments powered by 3D spatial models |
09:43 | so that the computers can have infinite varieties of possibilities |
09:48 | to learn to act. |
09:50 | And you're just seeing a small number of examples |
09:55 | to teach our robots |
09:57 | in a project led by my lab called Behavior. |
10:00 | We’re also making exciting progress in robotic language intelligence. |
10:06 | Using large language model-based input, |
10:09 | my students and our collaborators are among the first teams |
10:13 | that can show a robotic arm performing a variety of tasks |
10:19 | based on verbal instructions, |
10:21 | like opening this drawer or unplugging a charged phone. |
10:26 | Or making sandwiches, using bread, lettuce, tomatoes |
10:31 | and even putting a napkin for the user. |
10:34 | Typically I would like a little more for my sandwich, |
10:37 | but this is a good start. |
10:39 | (Laughter) |
10:40 | In that primordial ocean, in our ancient times, |
10:46 | the ability to see and perceive one's environment |
10:50 | kicked off the Cambrian explosion of interactions with other life forms. |
10:55 | Today, that light is reaching the digital minds. |
10:59 | Spatial intelligence is allowing machines |
11:03 | to interact not only with one another, |
11:06 | but with humans, and with 3D worlds, |
11:09 | real or virtual. |
11:12 | And as that future is taking shape, |
11:14 | it will have a profound impact to many lives. |
11:18 | Let's take health care as an example. |
11:21 | For the past decade, |
11:23 | my lab has been taking some of the first steps |
11:26 | in applying AI to tackle challenges that impact patient outcome |
11:32 | and medical staff burnout. |
11:34 | Together with our collaborators from Stanford School of Medicine |
11:38 | and partnering hospitals, |
11:40 | we're piloting smart sensors |
11:43 | that can detect clinicians going into patient rooms |
11:46 | without properly washing their hands. |
11:49 | Or keep track of surgical instruments. |
11:53 | Or alert care teams when a patient is at physical risk, |
11:57 | such as falling. |
11:59 | We consider these techniques a form of ambient intelligence, |
12:04 | like extra pairs of eyes that do make a difference. |
12:08 | But I would like more interactive help for our patients, clinicians |
12:14 | and caretakers, who desperately also need an extra pair of hands. |
12:19 | Imagine an autonomous robot transporting medical supplies |
12:24 | while caretakers focus on our patients |
12:27 | or augmented reality, guiding surgeons to do safer, faster |
12:32 | and less invasive operations. |
12:35 | Or imagine patients with severe paralysis controlling robots with their thoughts. |
12:42 | That's right, brainwaves, to perform everyday tasks |
12:46 | that you and I take for granted. |
12:49 | You're seeing a glimpse of that future in this pilot study from my lab recently. |
12:55 | In this video, the robotic arm is cooking a Japanese sukiyaki meal |
13:00 | controlled only by the brain electrical signal, |
13:05 | non-invasively collected through an EEG cap. |
13:10 | (Applause) |
13:13 | Thank you. |
13:16 | The emergence of vision half a billion years ago |
13:19 | turned a world of darkness upside down. |
13:23 | It set off the most profound evolutionary process: |
13:27 | the development of intelligence in the animal world. |
13:31 | AI's breathtaking progress in the last decade is just as astounding. |
13:37 | But I believe the full potential of this digital Cambrian explosion |
13:42 | won't be fully realized until we power our computers and robots |
13:49 | with spatial intelligence, |
13:51 | just like what nature did to all of us. |
13:55 | It’s an exciting time to teach our digital companion |
13:59 | to learn to reason |
14:00 | and to interact with this beautiful 3D space we call home, |
14:05 | and also create many more new worlds that we can all explore. |
14:11 | To realize this future won't be easy. |
14:14 | It requires all of us to take thoughtful steps |
14:18 | and develop technologies that always put humans in the center. |
14:23 | But if we do this right, |
14:26 | the computers and robots powered by spatial intelligence |
14:29 | will not only be useful tools |
14:32 | but also trusted partners |
14:34 | to enhance and augment our productivity and humanity |
14:39 | while respecting our individual dignity |
14:42 | and lifting our collective prosperity. |
14:45 | What excites me the most in the future |
14:49 | is a future in which that AI grows more perceptive, |
14:54 | insightful and spatially aware, |
14:57 | and they join us on our quest |
15:00 | to always pursue a better way to make a better world. |
15:05 | Thank you. |
15:06 | (Applause) |