We all have some idea about the power artificial intelligence has to analyze 2D images. The most obvious example is the facial recognition software and augmented reality applications we use to artificially add everything from sunglasses to dreadlocks to our faces for our own amusement.
While seemingly impressive, this technology is greatly limited as it cannot understand images in a three-dimensional context. So, while these AI and ML systems are able to distinguish between parts of our faces, etc., they cannot do the same when it comes to the depth of objects.
This is a huge limitation that has enormous implications. You only need to think of how much of an impact on your life it would be were you to only use one eye, something which greatly affects your ability to judge depths, to understand its importance.
In the field of AI-assisted filmmaking, this limitation restricts the level of analysis that can be undertaken by conventional systems. So, for example, current systems are unable to do an in-depth analysis of such things as the significance of the use of depth to some aspect of the scene or how different perspectives impact a scene.
All that is now about to change thanks to a new technique in computer vision that is set to enhance AI’s three-dimensional understanding of two-dimensional images.
Virtual Correspondence and 3D Vision
‘Virtual correspondence’ is a new technique that has been developed by a group of researchers at MIT. In a recent article entitled ‘Seeing the whole from some of the parts’, MIT News broke the story that
a group of their AI engineers had developed a “method of 3D reconstruction that works even with images taken from extremely different views that do not show the same features.”
https://news.mit.edu/2022/seeing-whole-from-some-parts-0617
Traditional methods that, up until now, relied on the ‘Structure from Motion’ approach, which requires two images with some of the same features (i.e, a house or car) in them in order to work. Such approaches rely on the same ‘triangulation’ method as we humans do in order to judge distance.
Wei-Chiu Ma, a Ph.D. student in MIT’s Department of Electrical Engineering and Computer Science (EECS) was one of the leading engineers who led this breakthrough. Ma explains this giant leap forward, “We want to incorporate human knowledge and reasoning into our existing 3D algorithms”.
The reasoning behind this occurred to Ma when he was staring at his hands one day. Even though he could not see his fingernails from one particular angle, he knew that they were there, thus he reasoned, that by giving AI this understanding of objects, it would no longer require triangulation to understand depth in an image by taking a point in the first image, such as one side of a house, and matching it with the other side of the house in the second image.
Ma notes that “the advantage here is that you don’t need overlapping images to proceed,” continuing, “by looking through the object and coming out the other end, this technique provides points in common to work with that weren’t initially available.”
The system would require prior knowledge, which in this case, would be the width of the house, but this could be trained as part of the systems learning process. While Ma acknowledges that this technology has a long way to be developed before it can be applied to commercial solutions such as AI-assisted moviemaking, it holds immense promise for these industries.
AI-Assisted Moviemaking
The MIT News article gives a great example of how this new artificial intelligence technique could be used in AI-assisted filmmaking.
It refers to the scene in ‘Good Will Hunting’ (1997) when the audience is shown a shot of Matt Damon and Robin Williams sitting on a bench from behind overlooking a pond in Boston’s Public Garden. However, the next shot jumps to the opposite side and shows a frontal view of the pair with an entirely different background. While audiences are able to immediately understand the change, current AI systems simply cannot make the association and so cannot analyze the scene effectively.
With the above example in mind, it is clear to see how important Visual Correspondence will be in helping AI-assisted moviemaking companies to develop the next generation of tools for their platforms.
The ability to be able to understand the ‘3D world or environment’ of any given scene will not only help the system to understand films as a whole but will also allow for the creation of such things as the first accurate editing tools that will allow filmmakers to understand the effects of their editing decisions on their audience.
Let’s continue with the example of the lakeside scene in Good Will Hunting that we mentioned above to get a better picture of these implications. Firstly, as we already noted, this technique would allow the AI system to analyze the scene as a whole, something which would allow it to create an accurate genre recipe for filmmakers.
The whole picture and the effect that it generates are often very different from individual cuts or parts of the scene. While current systems try to overcome this by focusing more on script analysis, future systems will be able to extrapolate a great deal more from the scenes themselves and how they are edited together.
So, for example, were director Gus Van Sant not to have made that edit from one camera angle to another, what would the audience’s reaction have been? Likewise, what would have happened if he had made that cut earlier on in the scene? Or later?
Future AI-assisted filmmaking tools that use an evolution of the Visual Correspondence technique will be able to give the answers to all these questions, something which will mean a giant leap for AI-assisted filmmaking.
Stay connected