Transcript

Transcript of YouTube Video: With Spatial Intelligence, AI Will Understand the Real World | Fei-Fei Li | TED

Transcript of YouTube Video: With Spatial Intelligence, AI Will Understand the Real World | Fei-Fei Li | TED

The following is a summary and article by AI based on a transcript of the video "With Spatial Intelligence, AI Will Understand the Real World | Fei-Fei Li | TED". Due to the limitations of AI, please be careful to distinguish the correctness of the content.

Article By AIVideo Transcript
00:04

Let me show you something.

00:06

To be precise,

00:07

I'm going to show you nothing.

00:10

This was the world 540 million years ago.

00:15

Pure, endless darkness.

00:18

It wasn't dark due to a lack of light.

00:22

It was dark because of a lack of sight.

00:27

Although sunshine did filter 1,000 meters

00:32

beneath the surface of ocean,

00:35

a light permeated from hydrothermal vents to seafloor,

00:40

brimming with life,

00:42

there was not a single eye to be found in these ancient waters.

00:47

No retinas, no corneas, no lenses.

00:52

So all this light, all this life went unseen.

00:57

There was a time that the very idea of seeing didn't exist.

01:03

It [had] simply never been done before.

01:06

Until it was.

01:09

So for reasons we're only beginning to understand,

01:12

trilobites, the first organisms that could sense light, emerged.

01:18

They're the first inhabitants of this reality that we take for granted.

01:24

First to discover that there is something other than oneself.

01:28

A world of many selves.

01:32

The ability to see is thought to have ushered in Cambrian explosion,

01:37

a period in which a huge variety of animal species

01:41

entered fossil records.

01:43

What began as a passive experience,

01:46

the simple act of letting light in,

01:50

soon became far more active.

01:53

The nervous system began to evolve.

01:56

Sight turning to insight.

02:00

Seeing became understanding.

02:03

Understanding led to actions.

02:05

And all these gave rise to intelligence.

02:10

Today, we're no longer satisfied with just nature's gift of visual intelligence.

02:17

Curiosity urges us to create machines to see just as intelligently as we can,

02:23

if not better.

02:25

Nine years ago, on this stage,

02:27

I delivered an early progress report on computer vision,

02:32

a subfield of artificial intelligence.

02:35

Three powerful forces converged for the first time.

02:39

Aa family of algorithms called neural networks.

02:43

Fast, specialized hardware called graphic processing units,

02:48

or GPUs.

02:49

And big data.

02:51

Like the 15 million images that my lab spent years curating called ImageNet.

02:57

Together, they ushered in the age of modern AI.

03:02

We've come a long way.

03:04

Back then, just putting labels on images was a big breakthrough.

03:09

But the speed and accuracy of these algorithms just improved rapidly.

03:14

The annual ImageNet challenge, led by my lab,

03:18

gauged the performance of this progress.

03:21

And on this plot, you're seeing the annual improvement

03:24

and milestone models.

03:27

We went a step further

03:29

and created algorithms that can segment objects

03:34

or predict the dynamic relationships among them

03:37

in these works done by my students and collaborators.

03:41

And there's more.

03:43

Recall last time I showed you the first computer-vision algorithm

03:47

that can describe a photo in human natural language.

03:52

That was work done with my brilliant former student, Andrej Karpathy.

03:57

At that time, I pushed my luck and said,

03:59

"Andrej, can we make computers to do the reverse?"

04:02

And Andrej said, "Ha ha, that's impossible."

04:05

Well, as you can see from this post,

04:07

recently the impossible has become possible.

04:11

That's thanks to a family of diffusion models

04:15

that powers today's generative AI algorithm,

04:18

which can take human-prompted sentences

04:22

and turn them into photos and videos

04:25

of something that's entirely new.

04:28

Many of you have seen the recent impressive results of Sora by OpenAI.

04:34

But even without the enormous number of GPUs,

04:37

my student and our collaborators

04:40

have developed a generative video model called Walt

04:44

months before Sora.

04:47

And you're seeing some of these results.

04:50

There is room for improvement.

04:53

I mean, look at that cat's eye

04:55

and the way it goes under the wave without ever getting wet.

04:59

What a cat-astrophe.

05:01

(Laughter)

05:04

And if past is prologue,

05:07

we will learn from these mistakes and create a future we imagine.

05:11

And in this future,

05:13

we want AI to do everything it can for us,

05:17

or to help us.

05:19

For years I have been saying

05:22

that taking a picture is not the same as seeing and understanding.

05:26

Today, I would like to add to that.

05:30

Simply seeing is not enough.

05:33

Seeing is for doing and learning.

05:36

When we act upon this world in 3D space and time,

05:41

we learn, and we learn to see and do better.

05:46

Nature has created this virtuous cycle of seeing and doing

05:50

powered by “spatial intelligence.”

05:54

To illustrate to you what your spatial intelligence is doing constantly,

05:58

look at this picture.

05:59

Raise your hand if you feel like you want to do something.

06:02

(Laughter)

06:04

In the last split of a second,

06:06

your brain looked at the geometry of this glass,

06:09

its place in 3D space,

06:12

its relationship with the table, the cat

06:15

and everything else.

06:16

And you can predict what's going to happen next.

06:20

The urge to act is innate to all beings with spatial intelligence,

06:27

which links perception with action.

06:30

And if we want to advance AI beyond its current capabilities,

06:36

we want more than AI that can see and talk.

06:39

We want AI that can do.

06:42

Indeed, we're making exciting progress.

06:46

The recent milestones in spatial intelligence

06:50

is teaching computers to see, learn, do

06:54

and learn to see and do better.

06:57

This is not easy.

06:59

It took nature millions of years to evolve spatial intelligence,

07:04

which depends on the eye taking light,

07:07

project 2D images on the retina

07:09

and the brain to translate these data into 3D information.

07:14

Only recently, a group of researchers from Google

07:17

are able to develop an algorithm to take a bunch of photos

07:22

and translate that into 3D space,

07:26

like the examples we're showing here.

07:29

My student and our collaborators have taken a step further

07:33

and created an algorithm that takes one input image

07:38

and turn that into 3D shape.

07:40

Here are more examples.

07:43

Recall, we talked about computer programs that can take a human sentence

07:49

and turn it into videos.

07:51

A group of researchers in University of Michigan

07:55

have figured out a way to translate that line of sentence

07:59

into 3D room layout, like shown here.

08:03

And my colleagues at Stanford and their students

08:06

have developed an algorithm that takes one image

08:10

and generates infinitely plausible spaces

08:14

for viewers to explore.

08:17

These are prototypes of the first budding signs of a future possibility.

08:23

One in which the human race can take our entire world

08:29

and translate it into digital forms

08:32

and model the richness and nuances.

08:35

What nature did to us implicitly in our individual minds,

08:40

spatial intelligence technology can hope to do

08:44

for our collective consciousness.

08:47

As the progress of spatial intelligence accelerates,

08:51

a new era in this virtuous cycle is taking place in front of our eyes.

08:56

This back and forth is catalyzing robotic learning,

09:01

a key component for any embodied intelligence system

09:06

that needs to understand and interact with the 3D world.

09:12

A decade ago,

09:14

ImageNet from my lab

09:16

enabled a database of millions of high-quality photos

09:20

to help train computers to see.

09:23

Today, we're doing the same with behaviors and actions

09:28

to train computers and robots how to act in the 3D world.

09:34

But instead of collecting static images,

09:37

we develop simulation environments powered by 3D spatial models

09:43

so that the computers can have infinite varieties of possibilities

09:48

to learn to act.

09:50

And you're just seeing a small number of examples

09:55

to teach our robots

09:57

in a project led by my lab called Behavior.

10:00

We’re also making exciting progress in robotic language intelligence.

10:06

Using large language model-based input,

10:09

my students and our collaborators are among the first teams

10:13

that can show a robotic arm performing a variety of tasks

10:19

based on verbal instructions,

10:21

like opening this drawer or unplugging a charged phone.

10:26

Or making sandwiches, using bread, lettuce, tomatoes

10:31

and even putting a napkin for the user.

10:34

Typically I would like a little more for my sandwich,

10:37

but this is a good start.

10:39

(Laughter)

10:40

In that primordial ocean, in our ancient times,

10:46

the ability to see and perceive one's environment

10:50

kicked off the Cambrian explosion of interactions with other life forms.

10:55

Today, that light is reaching the digital minds.

10:59

Spatial intelligence is allowing machines

11:03

to interact not only with one another,

11:06

but with humans, and with 3D worlds,

11:09

real or virtual.

11:12

And as that future is taking shape,

11:14

it will have a profound impact to many lives.

11:18

Let's take health care as an example.

11:21

For the past decade,

11:23

my lab has been taking some of the first steps

11:26

in applying AI to tackle challenges that impact patient outcome

11:32

and medical staff burnout.

11:34

Together with our collaborators from Stanford School of Medicine

11:38

and partnering hospitals,

11:40

we're piloting smart sensors

11:43

that can detect clinicians going into patient rooms

11:46

without properly washing their hands.

11:49

Or keep track of surgical instruments.

11:53

Or alert care teams when a patient is at physical risk,

11:57

such as falling.

11:59

We consider these techniques a form of ambient intelligence,

12:04

like extra pairs of eyes that do make a difference.

12:08

But I would like more interactive help for our patients, clinicians

12:14

and caretakers, who desperately also need an extra pair of hands.

12:19

Imagine an autonomous robot transporting medical supplies

12:24

while caretakers focus on our patients

12:27

or augmented reality, guiding surgeons to do safer, faster

12:32

and less invasive operations.

12:35

Or imagine patients with severe paralysis controlling robots with their thoughts.

12:42

That's right, brainwaves, to perform everyday tasks

12:46

that you and I take for granted.

12:49

You're seeing a glimpse of that future in this pilot study from my lab recently.

12:55

In this video, the robotic arm is cooking a Japanese sukiyaki meal

13:00

controlled only by the brain electrical signal,

13:05

non-invasively collected through an EEG cap.

13:10

(Applause)

13:13

Thank you.

13:16

The emergence of vision half a billion years ago

13:19

turned a world of darkness upside down.

13:23

It set off the most profound evolutionary process:

13:27

the development of intelligence in the animal world.

13:31

AI's breathtaking progress in the last decade is just as astounding.

13:37

But I believe the full potential of this digital Cambrian explosion

13:42

won't be fully realized until we power our computers and robots

13:49

with spatial intelligence,

13:51

just like what nature did to all of us.

13:55

It’s an exciting time to teach our digital companion

13:59

to learn to reason

14:00

and to interact with this beautiful 3D space we call home,

14:05

and also create many more new worlds that we can all explore.

14:11

To realize this future won't be easy.

14:14

It requires all of us to take thoughtful steps

14:18

and develop technologies that always put humans in the center.

14:23

But if we do this right,

14:26

the computers and robots powered by spatial intelligence

14:29

will not only be useful tools

14:32

but also trusted partners

14:34

to enhance and augment our productivity and humanity

14:39

while respecting our individual dignity

14:42

and lifting our collective prosperity.

14:45

What excites me the most in the future

14:49

is a future in which that AI grows more perceptive,

14:54

insightful and spatially aware,

14:57

and they join us on our quest

15:00

to always pursue a better way to make a better world.

15:05

Thank you.

15:06

(Applause)