It seems like almost every day that AI reaches a new milestone. The reason for this is quite simply that artificial intelligence and machine learning are still in their infancy and so are progressing at a rapid rate.

However, to borrow from a famous quote from the great George Orwell – some milestones are more important than others. In an article entitled “Artificial intelligence system learns concepts shared across video, audio, and text”, MIT News broke what is certainly one of the most important AI news stories from the last 12 months.

MIT News revealed that a team of AI researchers had devised a solution that allows an AI system to effectively identify and label all the components to the action taking place in a clip and to label them and all without any human intervention.

Before we get to the huge potential ramifications of this for fields such as AI-assisted moviemaking and social media, we will dive in a little deeper to explore exactly what this advancement is all about.

Multimedia AI Understanding

All animals, and indeed, most plants are able to “process” data from more than one data source or type. So, for example, a cat reacts to sensory data like sound, vision, pressure data from its whiskers and feet, heat data, etc., to make decisions that will benefit it in some way, i.e. hunting for food or keeping safe.

A huge limitation on current AI systems is that they lack the ability to deal with multiple types of data sources at once, i.e. video, audio, text, etc.

In order to to be able to achieve this, the system needs to be engineered and trained to understand all these different types of data, and very importantly, to then be able to label them in a way that allows the system to effectively and efficiently collate them in order to provide accurate results. Adding to this difficulty is the fact these separate results (from the different media sources) are often interdependent and so require an understanding of each and how they relate to each other to derive an accurate understanding as to their meaning.

What do I mean by this? Well, just consider the scene in Stanley Kubrick’s ‘A Clockwork Orange’ where the main character beats his victim while singing the song ‘Singing in the Rain’. The very deliberate motive to adding this track, normally associated with the innocent Gene Kelly classic by the same name, was to create a distinct sense of unease in the audience as they viewed the violence, something which works to great effect.

Without understanding the significance of this juxtaposition, i.e. mood created by these two data elements, an AI system would not be able to understand the meaning or effect of the scene on an audience.

Alexander Liu, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL), worked with a team of collaborators to devise a solution to this problem. Their technique “learns to represent data in a way that captures concepts which are shared between visual and audio modalities.” – Source MIT News

Effectively, this allows the AI system to both understand and cross-reference data insights from multiple media sources effectively.

In a nutshell, what this development promises is the ability to provide AI with greatly enhanced ‘perception’ as to the tasks it is set to help solve.

Another Giant Leap for AI-Assisted Moviemaking

As is always the case with any new advancement in AI, a huge number of industries are set to benefit from this new development, and AI-assisted Moviemaking is no exception.

Current AI-assisted Movimaking systems are limited in terms of being able to simultaneously analyze and collate data from different media sources. While they are increasingly effective in analyzing and understanding single-source unstructured data, current systems still lack the ability to interpret and label multi-source unstructured data. Once AI-assisted moviemaking companies develop an approach to implement this technique, the potential will be enormous.

This approach is the Holy Grail for AI film analytics companies since it will finally allow their systems to comprehensively analyze movie scenes. Systems will be able to analyze multiple media elements in one go and then collate the data to extrapolate deeper insights that will allow them to accurately understand filmmaking techniques such as the one used in the scene in Clockwork Orange mentioned above, etc.

With this advancement in understanding, not only will the current tools offered by these AI-assisted Moviemaking companies become far more powerful and accurate, but these platforms will be able to offer a huge range of new tools too. Examples include suggested soundtrack options, mise en scène  tools such as lighting or camera angle suggestions, and actor performance feedback, etc.

Perhaps the most powerful advancement this technique will help to bring about is the ability of future AI systems to be able to analyze a new film with a better sense of perspective of the film as an original work of art and of how changing audience tastes will react to it. Current systems rely solely on past data of how similar films have performed both at the box office and in terms of audience reviews, and so are always a reflection of the past in some way.