For nearly a decade, the primary interface for vision AI has been bounding boxes. This enabled many applications to detect objects and take action. However, to extract deeper insights and perform more actionable analysis like “detect pickpocketing attempts in crosswalks,” ML engineers need to compose multiple models (face detectors, action detectors, object trackers) and write application-specific logic to find what they are looking for. This workflow is error-prone and time-consuming: it requires significant expertise in systems, databases, and AI, especially as models evolve and become more complex. We envision a world where a user simply types a natural language query and gets reliable results and dataset insights without needing to worry about which models were used, how they were composed, or even what hardware they ran on.
Video Foundation Models (VFM) provide us with an opportunity to revolutionize how we interact with videos. They will replace the complex and fragile pipelines computer vision applications depend on today because they can extract the intricacies that are vital for fields like healthcare, surveillance, and defense. In the Figure, instead of a hard-coded pipeline, a user should simply be able to prompt a VFM with “find suspicious activity” and the model identifies an attempted pickpocket. Similar to how ChatGPT is trained on a massive corpus of text, building video foundation models requires massive video datasets. We are already seeing traction from industry: LAION released video2dataset that enables organizing millions of videos for training data. Google has built YouTube-8M and VideoCC.
Building end-to-end, robust systems around video foundation models is challenging, especially when considering petabyte-scales of data. Video comes in varying resolution, frame rates, and codecs. Some videos have audio, others do not. Some sequences of frames are more meaningful than others. This is precisely what makes video so useful, but also a hurdle towards achieving our goal of deploying powerful VFMs. Existing solutions strictly focus on image and bounding box labeling while others have just started using foundation models. However, this is still far-away from training models that can reason about “suspicious activity.”
We have taken the first steps by building infrastructure for a general-purpose and interactive video DBMS. Our research provides a structured-data querying experience for videos by (1) making query execution performant and cost-effective, (2) automating query optimization, (3) improving ML model usage and expressivity, and (4) enabling natural language querying of videos. We have shown exciting results (VLDB’23, CIDR’22, arXiv) and competed against a human in analyzing a day of movie footage in less than a minute.
To help us figure out how to build this infrastructure, we want to learn:
How do you manage and process video datasets today? What is straightforward, what is a pain?
What use-cases would you be interested in using video foundation models for?
Is natural language the right interface to interact with one’s videos (versus a language like SQL)?
If you are thinking about these questions as well, please reach out to calebwin@stanford.edu!