Google researchers handle the challenges of reaching a complete understanding of various video content material by introducing a novel encoder mannequin, VideoPrism. Current fashions in video understanding have struggled with varied duties with complicated methods and motion-centric reasoning and demonstrated poor efficiency throughout totally different benchmarks. The researchers aimed to develop a general-purpose video encoder that may successfully deal with a variety of video understanding duties with minimal adaptation.
Current video understanding fashions have made important progress however nonetheless fall wanting. Some fashions leverage textual content related to movies for studying, and others focus solely on video alerts, which limits the efficient seize of each look and movement cues. VideoPrism proposes an method that integrates each video and textual content modalities throughout pretraining. It introduces a two-stage pretraining framework that mixes contrastive studying with masked video modeling. This methodology allows the mannequin to study semantic representations from each video-text pairs and video-only knowledge.
VideoPrism’s structure is predicated on the Imaginative and prescient Transformer (ViT) with modifications for space-time factorization. Throughout pretraining, the mannequin first aligns video and textual content embeddings via contrastive studying after which continues coaching on video-only knowledge utilizing masked video modeling. This two-stage method is augmented with global-local distillation and token shuffling strategies to enhance mannequin efficiency. Intensive evaluations throughout varied video understanding duties exhibit that VideoPrism achieves state-of-the-art efficiency on 30 out of 33 benchmarks, showcasing its sturdy generalizability and effectiveness in capturing each look and movement cues.
Google researchers handle the problem of constructing a foundational video mannequin with their state-of-the-art mannequin VideoPrism for complete video understanding. The proposed methodology combines contrastive studying with masked video modeling in a two-stage pretraining framework, leading to a mannequin that excels throughout a variety of video understanding duties.
Take a look at the Paper. All credit score for this analysis goes to the researchers of this undertaking. Additionally, don’t neglect to comply with us on Twitter and Google Information. Be a part of our 38k+ ML SubReddit, 41k+ Fb Group, Discord Channel, and LinkedIn Group.
If you happen to like our work, you’ll love our e-newsletter..
Don’t Overlook to hitch our Telegram Channel
You may additionally like our FREE AI Programs….
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is at present pursuing her B.Tech from the Indian Institute of Know-how(IIT), Kharagpur. She is a tech fanatic and has a eager curiosity within the scope of software program and knowledge science functions. She is all the time studying concerning the developments in several subject of AI and ML.