New AI tech to bring human-like understanding of our 3D world
Read how AIMLâs leading research is bridging the 3D / 2D domain gap.
Humans move effortlessly around our rich and detailed three-dimensional world without much second thought. But, like most mammals, our eyes actually sense the world two-dimensionally â itâs our brains that take those 2D images and interpret them into a 3D understanding of reality.
Even without the stereo visual input from our two eyes, weâre experts at looking at a flat 2D image and instantly âliftingâ it back to its 3D origins; we do it every time we watch TV or look at photos on our phone.
But computers and robots have a much harder time doing this âliftingâ. Itâs a problem that AI researchers are hard at work to fix.
Making computers able to understand 3D space from only 2D input is considered such an important capabilityâwith diverse applications ranging from mobile phones to driverless vehiclesâthat Professor Simon Lucey, director of the Australian Institute for Machine Learning (AIML), has received a $435,000 grant from the Australian ĚÇĐÄvolg Council to build a geometric reasoning system that can exhibit human-like performance.

Contrary to popular belief, humans can't actually sense the world in 3D. Our brains interpret the stereoscopic 2D input from our eyes into a 3D understanding of the world. Photo: iStock / Mark Kuiken.
âWhen cameras try to sense the world, like humans do, whatâs coming into the robot is still just 2D. Itâs missing that component that we have in our brains that can lift it out to 3D, thatâs what weâre trying to give it,â Lucey says.
If 3D understanding from normal cameras is so difficult, why not instead equip computer vision systems with proper 3D sensors like LiDAR, a sensing method that uses lasers? Itâs not that easy. Building and improving hardware technology is slow and expensive, and often out of reach for the many smaller tech startups seeking to innovate AI research commercially.
âYou could take ten years and billions of dollars and it would still be very, very risky to generateâŚbut when youâre doing something in software, you can deploy it straight away, and you can continually update and make it better,â Lucey explains.
AIML researchers are among the worldâs leaders in computer vision, a field of AI that enables computers to obtain meaningful information from digital images and video footage.
Building computer vision systems that can understand the real world typically requires vast troves of labeled training data using something called supervised machine learning. That means millions of images, each labeled âdogâ, âstrawberryâ or âPresident Obamaâ; or thousands of hours of driving footage where coloured boxes are drawn to mark each pedestrian, stop sign and traffic light. If youâve ever had a website ask you to âclick all the squares with bicyclesâ to prove youâre really human, youâve helped train a supervised machine learning model.
AI researchers are using the vast collections of labeled 2D training data, and working out how to apply it so AI systems can develop a 3D geometric understanding similar to what humans can do.
âHow can I take 2D supervision that humans can easily provide,â asks Professor Lucey, âand, using some elegant math, allow it to act as 3D supervision for modern AI systems?â
One application of this kind of computer vision is something called 3D motion capture, where earlier advances brought us Gollum in The Lord of the Rings movies. Itâs still a popular technique and one thatâs widely used in film visual effects, video game production and even medicine and sports science. But even today it uses a number of expensive and finely calibrated cameras, and sometimes still requires people to wear special reflective dots on their body and perform in front of a greenscreen, and thatâs a problem.
âPeople want the data theyâre collecting to be realisticâŚthey donât want a white background. They donât want a green screen. They would love to be out in the field, or in areas that are highly unconstrained. And the sheer cost of this limits the application of technology at the moment,â says Professor Lucey. âYou can only apply it to problems where companies are willing to invest millions to build these things.â

Traditional motion capture systems require as many as 40 to 60 finely calibrated cameras to accurately track a person's movement in 3D space. New developments in computer vision AI could reduce that to just two or three. Photo: iStock / Anake Seenadee
But in a 2021 project that saw Professor Lucey collaborate with researchers from Apple and Carnegie Mellon University[1], the team was able to demonstrate a new AI method for 3D motion capture that is sure to make the technology far more accessible and affordable.
âThe work weâve done on this paper has tried to ask the question: how few cameras could we get away with if we were willing to use AI to do this 3D lifting trick?â
The team used something called a neural prior â a mathematical way of giving an AI system an initial set of beliefs in terms of probability distribution, before any real data is provided.
As a result, the new method can perform 3D motion capture from normal video footage (no green screens or special reflective dots required) using only two or three uncalibrated camera views. It delivers similar 3D reconstruction accuracy that would otherwise require as many as 40-60 cameras using earlier methods.
Professor Lucey highlights the importance of AI research that focuses on finding efficiencies and significant cost breakthroughs as a way of bringing technology to those whoâd otherwise not have been able to afford it.
âItâs democratic AI. You could be a small startup and you could use this, whereas with other methods youâd need to be very well resourced financially,â he said.
The potential applications for this are broad, and not just related to capturing humans in motion, and include everything from mobile phone filters, autonomous vehicles, wildlife conservation, improved robots and even space satellites.
[1] ââ was presented at the 2021 International Conference on 3D Vision, 1 December 2021.