Asking questions is a natural way to perform a search. When you want to know the definition of Document in Jina, you will naturally ask, "What is the Document in Jina?". The expected answer can be found either from Jina's docs or the introduction videos on Jina's YouTube channel. Thanks to the latest advances in NLP, AI models can automatically find these answers from the content.
The goal of this tutorial is to build a Question-Answering (QA) system for video content. Although most existing QA models only work for text, most videos in our life have speech which contains rich information about the video and can be converted to text via speech recognition (STT). Thereafter, videos with speech naturally fit question-answering via text.
In this tutorial, we will show you how to find and extract content from videos that answers a query question. Instead of just finding related videos and having the user skim through the whole video, QA models can tell the user which second they should start from to get the answer to their question.
To convert speech information from the videos into text, we can rely on STT algorithms. Fortunately, for most videos on YouTube, you can download the subtitles that are generated automatically via STT. In this example, we assume the video files already have subtitles embedded. By loading these subtitles, we can get the text of the speech together with the beginning and ending timestamps.
You can use youtube-dl to download YouTube videos with embedded subtitles: youtube-dl --write-auto-sub --embed-subs --recode-video mkv -o zvXkQkqd2I8 https://www.youtube.com/watch\?v\=zvXkQkqd2I8
Subtitles generated with STT are not 100% accurate. Usually, you need to post-process the subtitles. For example, in the toy data, we use an introduction video of Jina. In the auto-generated subtitles, Jina is misspelled as gena, gina, etc. Worse still, most of the sentences are broken and there is no punctuation.
With the subtitles of the videos, we further need a QA model. The input to the QA model usually has two parts: the question and the context. The context denotes the candidate texts that contain the answers. In our case, the context corresponds to the subtitles from which the answers are extracted.
To save computational cost, we want to have the context as short as possible. To generate such contexts, one can use either traditional information sparse vectors or dense vectors. In this example, we decide to use the dense vectors that are shipped together with the QA model.
With traditional methods, retrieval can also be done using BM25, Tf-idf, etc.
We use VideoLoader to extract subtitles from the videos. It uses ffmpeg to extract subtitles and then generates chunks based on the subtitles using webvtt-py. The subtitles are stored in the chunks together with other meta-information in the tags, including timestamp and video information. Extracted subtitles have the following attributes:
|text||Text of the subtitle|
|location||Index of the subtitle in the video, starting from 0|
|modality||always set to text|
|tags['beg_in_seconds']||Beginning of the subtitle in seconds|
|tags['end_in_seconds']||End of the subtitle in seconds|
|tags['video_uri']||URI of the video|
DPR is a set of tools and models for open domain Q&A tasks.
For the indexer, we choose SimpleIndexer for demonstration purposes. It stores both vectors and meta-information together. You can find more information on Jina Hub
Because the indexing and querying Flows have only one shared Executor, we create separate Flows for each task.
The index request contains Documents that have the path information of the video files stored in their uri attribute.
There are three Executors in the index Flow:
There are four Executors in the query Flow:
The overall structure of the query Flow is as follows:
You might note that DPRTextEncoder is used in both the index and query Flows:
In these two cases, we need to choose different models to encode the different attributes of the Documents. To achieve this, we use different initialization settings for DPRTextEncoder by overriding the with arguments in the YAML file. To do this, we need to pass the new argument to uses_with. You can find more information in Jina's docs.
# index.yml ... - name: encoder uses: jinahub://DPRTextEncoder/ uses_with: pretrained_model_name_or_path: 'facebook/dpr-ctx_encoder-single-nq-base' encoder_type: 'context' traversal_paths: - 'c' ...
# query.yml ... - name: encoder uses: jinahub://DPRTextEncoder/ uses_with: pretrained_model_name_or_path: 'facebook/dpr-question_encoder-single-nq-base' encoder_type: 'question' batch_size: 1 ...
You can find the code at example-video-qa.
Most of the Executors used in this tutorial are available on Jina Hub:
In this example, we rely on subtitles embedded in the video. For videos without subtitles, we need to build Executors using STT models to extract speech information. If the video contains other sounds, you can resort to VADSpeechSegmenter for separating speech beforehand.
Another direction to extend this example is to consider the videos' other text information. While subtitles contain rich information about the video, not all text information is included in subtitles. A lot of videos have text information embedded in images. In such cases, we need to rely on OCR models to extract text information from the video frames.
Overall, searching in-video content is a complex task and Jina makes it a lot easier.
We’d appreciate any feedback you’d have about your experience with the learning bootcamp. Please check it out and provide us with your valuable feedback.