VR’s Grand Challenge: Michael Abrash on the Future of Human Interaction
Oculus Blog
Posted by Michael Abrash
July 24, 2017

Oculus Chief Scientist Michael Abrash recently presented at the third Global Grand Challenges Summit, sponsored by the US National Academy of Engineering, the UK Royal Academy of Engineering, and the Chinese Academy of Engineering, in Washington, DC. Today, we’re excited to share his full, unedited talk with the VR community.

I’m delighted to be here today to talk about the grand challenge of virtual reality—and it truly is a grand challenge, in terms of time scale, breadth and depth of technology, and potential impact on the way we live.

VR dates all the way back to Ivan Sutherland’s Sword of Damocles in 1968, and yet almost half a century later, we’ve only just started down the long road to what VR will ultimately be. And it’s when we discuss what VR will ultimately be that it becomes clear how great its potential impact is. The place to start is with the nature of human experience.

The reality we experience is constructed in our minds, based on a great many assumptions built into our genes and learned during our lifetimes, together with the very sparse data that comes in through our senses.

All Reality Is Virtual.
That’s a strong statement, and it’s not obvious if you haven’t thought about it before, so I’ll say it again—the reality we experience is a construct in our minds, based on highly incomplete data. It generally matches the real world well, which isn’t surprising, evolutionarily speaking, but it’s not a literal reflection of reality—it’s just an inference of the most probable state of the world, given what we know at any one time.

Let’s look at a few cases that illustrate that our perception of reality is actually just a best guess.

See the white tile under the desk, and the black tiles just outside the desk?

Now let’s mask off everything else.

They’re all exactly the same shade of grey.

But if one of them is in shadow and is that shade of grey, it must be white, and if another is in bright light and is that shade of grey, it must be black. Intensity is an inference based on context; your visual system does the inference for you automatically, and what you actually see is white and black rather than grey.

Here’s another example. Take a moment and figure out which of the two tabletops is wider, as measured as 2D shapes on the screen, and which is longer, assuming you rotated them to line up.


They’re exactly the same size. Like intensity, size is an inference based on context.

Now let’s look at a few entertaining cases of higher-level inferences that don’t match reality.

Clearly what we’re seeing can’t be right.

Several cues on the window imply a perspective that doesn’t exist, so your visual system concludes that the window must be spinning backward for half of a full rotation. In order for that to be correct, the straw has to rotate right through the window, so that’s what you see, even though it’s not only not happening, it’s actually impossible.

Several cues on the window imply a perspective that doesn’t exist, so your visual system concludes that the window must be spinning backward for half of a full rotation. In order for that to be correct, the straw has to rotate right through the window, so that’s what you see, even though it’s not only not happening, it’s actually impossible.

Let’s look at another.

Again, this can’t be right.

Once more, our perceptual system is making a very reasonable assumption—in this case that objects, especially faces, tend to be convex.

Here’s the really fun part—try to keep the face from turning convex again.

Some people can do it, but it’s hard to keep it from snapping back, even though you know the real shape.

Finally, we come to what I consider to be the single most convincing illustration that the reality we experience is nothing more than a best guess. First, play this video:

Obviously, she’s saying “Bar, bar, bar.” Now let’s watch another clip:

Here, we clearly hear her saying “Far, far, far.” The interesting thing, though, is that she isn’t saying “Far”—the video shows her saying “Far,” but the audio track is of her saying “Bar,” exactly as she does in the first video. I’ll repeat that, because it’s hard to believe—she’s saying “Bar,” not “Far.” And yet we hear her saying “Far,” because the visuals imply that sound.

That may be a little confusing, or it may feel like a trick, so let’s look at this a different way. Once again the soundtrack will say “Bar,” but this time there will be a split screen, with a face saying “Far” on one side and a face saying “Bar” on the other. As this plays, move your eyes from one side to the other, and observe how what you hear changes. Again—move your eyes back and forth between the faces, and see what you hear.

In my opinion, it’s impossible to experience the McGurk effect and not believe that the reality you experience is an inference, not a literal reflection of the real world. When you heard “Far,” a Fourier transform would have revealed no trace of that sound in this room—the sound “Far” never hit your eardrums, but it was the most probable sound given both the visual and auditory evidence, so you heard it.

This is the critical point for VR: The reality we experience is whatever our minds infer it to be based on perceptual inputs, regardless of the source. So if VR can provide the right perceptual inputs, we can have whatever experiences we want, and those experiences will feel real—they’ll be genuine experiences.

I came to understand this the first time I stood at the edge of a virtual drop, felt my knees lock, and had an overpowering urge to back up. My conscious mind knew I was nowhere near a real drop, but nonetheless, my personal reality was that I was at risk of falling. Now extend that to teleporting around the world, using virtual objects, and interacting with anyone, anywhere, and the potential power of VR starts to become obvious.

One example that I personally want, and that I think resonates with many people, is a virtual workspace, with completely configurable virtual displays, holograms, and the ability to switch between workspaces instantly. Other people could teleport in to talk and work with me, and I could teleport into their workspaces. I’d be much more productive, and work would be more fun—just like when I first got a personal computer.

In fact, there’s a direct analogy with the personal computer here. More than 40 years ago, JCR Licklider’s vision and Xerox PARC’s work to create the personal computer—especially that of the Computer Science Lab under the late Bob Taylor—led to the computing devices we all use today. That was the first great leap in human-oriented computing.

I believe that VR will be the second great leap. Instead of interacting with the digital world through flat screens, we’ll be able to live in the digital space whenever we want.

What It’s Going to Take to Get There
So that’s my vision of how VR can change the world—but a great many enormously challenging technical advances will be needed in order to get there. Let’s look what it’s going to take for VR to continue down the path to being a key part of how we work, play, and connect with one another.

Since VR is about driving the perceptual system, the place to start is with the senses: vision, audio, haptics, smell, taste, and the vestibular sense. In my opinion, VR isn’t going to implement the last three in the foreseeable future, but vision, audio, and haptics work to varying extents today, and there are potential paths forward for all three.

For vision, we need to increase the field of view to the full human range, increase resolution and clarity to the retinal limit, increase dynamic range to real-world levels, and implement proper depth of focus.

Audio needs proper spatialization (your sense of where sound is coming from), full spatial propagation (how sound moves around a virtual space), and synthesis (generating sounds from modeling of physical motions and collisions).

Haptics is particularly challenging. The haptics that would matter most would be for the hands, which are our primary means of interaction with the world, and which rely on haptics for their feedback loop. All we can do right now is produce crude vibrations and forms of resistance. Someday, perhaps there will be some sort of glove or exoskeleton that can let us interact naturally with virtual objects, but that’s a true research problem.

In addition to getting virtual information into the perceptual system, VR also needs machine perception—the ability to sense, reconstruct, and understand the real world. That will let us move around safely and bring real-world objects like desks, keyboards, and furniture, into the virtual world, potentially reskinning them. It would be even more valuable to bring real humans into the virtual world—that would enable true telepresence, where you could meet, work, play games, and basically do anything with people anywhere in the world.

I believe this will be the most important factor in making VR more widespread, because people are the most interesting thing in the world to other people. Unfortunately, we're also highly sensitive to the nuances of other people, so enabling virtual humans is one of the hardest parts of VR.

Finally, VR is the most comprehensively perceptual technology ever developed, so underlying everything is the puzzle of human perception. The key to VR isn’t the technology that’s developed, but how that technology interacts with the perceptual system to create experiences.

Taken together, the broad areas in which VR needs to advance form an enormous research space, one that covers all of human perception and half a dozen areas of sensing and reconstruction. Exploring that space will require world-class research in areas ranging from computational optics to material science to sensor technology and much more. It will also require a great deal of multidisciplinary work, because it’s the intersection of multiple technologies that makes VR work.

As one example, consider the virtual workspace I outline above. You’d obviously need to be able to use your hands as dexterous manipulators in order for it to be as productive as the real world, and that’s certainly a difficult problem by itself. But let’s imagine we somehow solved it—then we run into the problem that VR headsets have fixed lenses focused at two meters, which has the effect of making everything within one meter blurry and uncomfortable to look at for extended periods—and one meter is everything within arm’s length.

In short, until we solve depth of focus, we can’t get full value out of enabling hands. Similarly, we’d want proper spatialization of sounds that originate within arm’s length, and that’s another unsolved problem. We’d want high enough resolution so that virtual monitors were as sharp as real ones, we’d want to be able to sense and reconstruct our desk, keyboard, mouse, and chair, and we’d want to be able to do virtual humans—before long, you realize that pretty much every research area I’ve mentioned is required in order to build a system that can deliver the right experience.

Let’s quickly take a look at three of the many challenges of VR, starting with displays.

In Focus: VR Displays
The display system in a VR headset today is essentially just a screen and a magnifying glass. When you look through the lens, what you’re seeing is a single magnified image at a single focal distance.

The question is: Where do we place that fixed focus?

On the right, we’ve placed the VR focus near infinity—out the window. So the right side and the left side, virtual and real reality, look pretty similar. There are some differences, but let’s ignore them for now.

The big difference is that, unlike the real world on the left, when you look at something up close in VR on the right, the nearest plant, which should be sharp, is now blurry, and so is everything else, because the screen is focused far away and your eye is focused up close.

So we need a better way of focusing headsets.

I don’t have time to go through all the science, but I can at least visually take you through a few of the potential solutions that have been proposed over the last few decades.

Consider a simple 3D game scene.

In optometric units, this would extend from four diopters—25 centimeters at the front of the car—to zero diopters—optical infinity.

Again, to make it clear, today’s VR headsets have a single focus at, say, half a diopter.

Obviously things very close will be very far from the focal plane, and therefore blurry.

One idea, and a lot of people have proposed this, is to just have more than one plane of focus, displayed simultaneously or in rapid succession.

Perceptual scientists will tell you you don’t want those layers to be too far apart, or things will get blurry between them, so you’re going to have trouble creating enough planes to get everything in a four-diopter range in focus.

That’s fine—the next idea would be to adapt those planes.

Researchers at Ricoh recently attempted this, and they showed that, yes, you could move those planes around if you have the right adaptive optics, but things between the planes will get blurry. So Nathan Matsuda, Alex Fix, and Doug Lanman at Oculus Research looked at this background and said, “Well, rather than having many more planes, let’s look at making each plane more capable. Let’s get rid of some planes, and bend the others.”

So if we use even more complex adaptive optics, now we can have these bendy surfaces so that one or just a few of them can touch every object in the scene.

First let’s look at a simulation.

First the distant background is in focus, and then it’s the foreground that's in focus.

We can look back and forth, and things come into correct focus, with correct defocus blur away from the focal plane.

So it seems like this idea has merit.

Of course, simulation always works, so we’ve also built a headset-like test rig.

These are actual images recorded with a camera. With today’s spatial light modulators, when you put a real camera into the prototype, the contrast is reduced. The team is working on improving that, but it does work.

First a far object is in focus, and then it’s a near object.

Near, far—we can focus wherever we want without eye tracking.

And if you now compare the modern fixed focus display on the left to the adaptive focus display on the right, you can see that this is a potentially exciting new way to bring things into focus in VR.

Problem Pupils
The second area I’m going to look at is eye tracking. Eye tracking is a key VR technology, especially as a foundation for many types of computational optics.

The state of the art in eye tracking is based on tracking pupils and glints off the cornea.

That shows how it looks when pupil tracking works well, but pupils can vary wildly.

Pupils also change size and can even change shape—not necessarily in tandem.

Glint tracking helps compensate for the limitations of pupil tracking, but eyelids can still cause problems—not to mention the problems with fitting illuminators and cameras into a headset and positioning them so that tracking works across the full range of eye motion with deep eye sockets, flat faces, and bulging eyes and is 100% reliable in all those cases.

Also, the eye isn’t actually a rigid organ:

It’s a little subtle, so watch it again, paying attention to the shape of the pupil as the eye stops moving.

The real problem is that the current state of the art in eye tracking tries to infer where photons are landing on the retina based on the pupil position and glints off the cornea. The right solution is to track features directly on the retina, and the really right solution is to look at the image that lands on the retina—but doing that across the full range of eye motion in a head-mounted display will require the development of an entirely new type of eye tracking technology.

Real Humans in Virtual Space
The third area I’m going to briefly look at is virtual humans—representing real humans in virtual space. As I said, I believe this will be the single biggest reason for the widespread adoption of VR.

Creating compelling virtual humans will require the integration of at least four separate tracking technologies, each of which is immature today. We’ve already talked about the first, eye tracking, so let's look at hand tracking.

Here’s what perfect hand tracking looks like:

Unfortunately, hands have about 25 degrees of freedom and lots of self-occlusion. Right now, retroreflector-covered gloves and lots of cameras are needed to get to this level of tracking quality.

Faces are the most expressive part of the body, with a great deal of subtle flexibility, and are perhaps the greatest of all human tracking problems.

This is an illustration of roughly where real-time headset-based face tracking is—progress is being made, but there’s still a long way to go.

Good real-time skeletal body tracking is now becoming possible, although a lot of work is still needed to make it truly robust. Literal rather than skeletal tracking using consumer-friendly cameras remains solidly in the realm of challenging research.

The underlying technology for virtual humans offers plenty of interesting research questions, but the really interesting question is, “What is it that makes an avatar a uniquely convincing person?”

The answer to that lies in the realm of perceptual science and the psychology of social interactions, and the place to start on that is by gathering lots of data. Yaser Sheikh has done this with his Panopticon at Carnegie Mellon; let’s look at an example of the sort of research that’s enabled.

This is work done by Tomas Simon, and it’s very cool—but it took two hours to process each second of this video, so it’s a long way from real-time.

New Frontiers
These are just some of the challenges VR faces today. It will take many years to fully solve each of them. And of course there are many other challenges, like haptic interaction—to say nothing of full-body haptics—as well as smell and, someday, the vestibular sense and taste. In short, VR is a truly vast space waiting to be explored, and much more research attention needs to be focused on it. Without a doubt, decades of innovation lie ahead.

VR is a grand challenge in the purest sense. Obviously, it’s enormously difficult, requiring research and development across dozens of technologies, but that’s only half of the story. VR is the culmination of more than 70 years of the computer revolution and centuries of development of information technology. We’re finally capable of building an interface that lets us interact with the digital world using a significant fraction of the full bandwidth and biological processing that we’re evolved for.

VR has the potential to vastly expand the range of human experiences—and if it’s successful, it will surely be one of the most important technologies of our time.