clifs - Improve video search accuracy with AI-driven matching technology

Introduction to CLIFS: Contrastive Language-Image Forensic Search

CLIFS, or Contrastive Language-Image Forensic Search, is a cutting-edge proof-of-concept project designed to facilitate free text searching within videos. Its primary objective is to enable users to find specific video frames that match the content of a textual query. This innovative capability is powered by OpenAI's CLIP (Contrastive Language–Image Pre-Training) model, which is renowned for its effectiveness in linking images to their corresponding descriptions and vice versa. Here's a closer look at how CLIFS works and how it can be utilized.

How CLIFS Works

The Role of the CLIP Model

The core functionality of CLIFS revolves around the CLIP model. The process begins by extracting features from each frame of a video using the CLIP image encoder. This encoder transforms the visual data into a format that can be compared to textual data. Simultaneously, the text encoder of CLIP processes the search query to extract its feature representation. The system then calculates the similarity between the features derived from the video frames and the search query. The most closely matched results are returned to the user, provided they surpass a predefined similarity threshold.

User Interaction and Backend Support

To enhance user accessibility, CLIFS incorporates a straightforward web server built on the Django framework. This server acts as the interface between the user and the complex backend processes, making the search engine's powerful capabilities easily available through a simple web browser interface.

Practical Examples

To demonstrate the model's prowess, consider these search queries made against a two-minute video from the UrbanTracker Dataset, specifically the Sherbrooke video:

A truck with the text "odwalla": The system accurately identifies and displays the frame containing a truck with the specified text.
A white BMW car: A frame showing a white BMW car is retrieved successfully.
A truck with the text "JCN": The system locates the frame with a truck displaying the text "JCN".
A bicyclist with a blue shirt: A frame featuring a bicyclist in a blue shirt is found.
A blue SMART car: The corresponding frame with a blue SMART car is displayed.

These examples, among others, illustrate CLIFS' capability, including its ability to effectively carry out Optical Character Recognition (OCR).

Setting Up CLIFS

Here's a step-by-step guide for setting up and using the CLIFS system:

Setup Script: Execute the setup.sh script. This creates the necessary folders and gives you the option to download a sample video for testing.
```
./setup.sh
```
Add Your Videos: Place any video files you wish to index in the data/input directory.
Build and Launch: Construct and initiate the search engine and web server using Docker Compose.
```
docker-compose build && docker-compose up
```
If your setup includes an NVIDIA GPU and your environment supports Docker GPU, you can use a GPU-enabled Docker Compose file:
```
docker-compose build && docker-compose -f docker-compose-gpu.yml up
```
Start Searching: Once the system has encoded the features of the files in the data/input directory, you can visit 127.0.0.1:8000 in your web browser to perform searches based on textual queries.

In summary, CLIFS provides a powerful and user-friendly solution for searching video content using text, bridging the gap between complex AI models and practical, everyday applications.