Image-grounded Text Generation (ITG)

Image-grounded Text Generation (ITG) loss trains the Q-Former to generate texts, given input images as the condition.
Since the architecture of Q-Former does not allow direct interactions between the frozen image encoder and the text tokens, the information required for generating the text must be first extracted by the queries, and then passed to the text tokens via self-attention layers.
Therefore, the queries are forced to extract visual features that capture all the information about the text.
We employ a multimodal causal self-attention mask to control query-text interaction, similar to the one used in UniLM (Dong et al., 2019). The queries can attend to each other but not the text tokens. Each text token can attend to all queries and its previous text tokens. We also replace the [CLS] token with a new [DEC] token as the first text token to signal the decoding task.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Task: Vision-Language Models

Cross-Modal Alignment

Q-Former

Overview of BLIP-2's Framework (Figure 1)

The Key Advantages of BLIP-2 (1/2)

The Key Advantages of BLIP-2 (2/2)

Related Work

End-to-end Vision-Language Pre-training (1/2)

End-to-end Vision-Language Pre-training (2/2)

Modular Vision-Language Pre-training (1/2)

Modular Vision-Language Pre-training (2/2)

Method

Overview of BLIP-2's Framework (Figure 1)

Model Architecture: Q-Former (Figure 2-Left)

Model Architecture: Q-Former

Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder

BLIP-2's First-Stage Vision-Language Representation Learning Objectives (Figure 2-Right)

Image-Text Contrastive Learning (ITC)

Image-grounded Text Generation (ITG)

Image-Text Matching (ITM)

Bootstrap Vision-to-Language Generative Learning from a Frozen LLM (1/3)

Bootstrap Vision-to-Language Generative Learning from a Frozen LLM (2/3)

Bootstrap Vision-to-Language Generative Learning from a Frozen LLM (3/3)

Pre-training Data (1/2)

Pre-training Data (2/2)

Pre-trained Image Encoder and LLM

Pre-Training Settings

Pre-Training Settings

Experiments

Overview Of Blip-2 Results On Various Zero-Shot Vision-Language Tasks

Instructed Zero-shot Image-to-Text Generation (1/5)

Instructed Zero-shot Image-to-Text Generation (2/5)

Instructed Zero-shot Image-to-Text Generation (3/5)

Instructed Zero-shot Image-to-Text Generation (4/5)

Instructed Zero-shot Image-to-Text Generation (5/5)

Zero-Shot Visual Question Answering: Settings

Zero-Shot Visual Question Answering: Results

Effect of Vision-Language Representation Learning

Image Captioning

Visual Question Answering: Settings

Visual Question Answering: Model Architecture

Visual Question Answering: Results

Image-Text Retrieval: Settings

Image-Text Retrieval: Results

Image-Text Retrieval: The Impact of ITG Loss

Limitation (1/3)

Limitation (2/3)

Limitation (3/3)

Conclusion

Summary