Google Researchers proposed a new AI system, VideoBERT that got the ability to predict what will happen next in videos. It is very common habits among humans that we try to predict what can happen next in movies or web-series and perhaps it is quite difficult for machines to predict the same and disclose the suspense out of the mystery.
In a blog post, Google researcher wrote that VideoBERT’s goal is to discover high-level audio and visual semantic features corresponding to events and actions unfolding over time.
Speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision.
Google researcher scientists Chen Sun and Cordelia Schmid.
VideoBERT will use Google BERT(Bidirectional Encoder Representations from Transformers), a natural language AI system designed to model relationships among sentences.
Specifically, Google VideoBERT used image frames combined with speech recognition system sentence outputs to convert the frames into 1.5-second durations visual tokens based on feature similarities, which they added with word tokens. Then, VideoBERT will fill out the missing tokens from the visual-text sentences and we receive AI predictions as output.
Over one million instructional videos across categories like gardening, cooking and vehicle repair are used to design the AI model.
Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation.
Chen Sun and Cordelia Schmid
VideoBERT successfully predicted a bowl of flour and cocoa powder may become brownie or cupcake after baking in an oven and it generated a set of instructions from the video like recipes. The researchers further explained the concept of Contrastive Bidirectional Transformers (CBT) that skips the tokenization steps.