BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2301.12597 |
| License | ArXiv |
| Provider | semantic_scholar |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2301.12597,
author = {Junnan Li, Dongxu Li, S. Savarese, Steven C. H. Hoi},
title = {BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2301.12597}},
note = {Accessed via Free2AITools Knowledge Fortress}
} š¬Technical Deep Dive
Full Specifications [+]ā¾
āļø Nexus Index V2.0
š¬ Index Insight
FNI V2.0 for BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models: Semantic (S:50), Authority (A:89), Popularity (P:72), Recency (R:100), Quality (Q:45).
Verification Authority
š Executive Summary
ā Cite Node
@article{Unknown2026BLIP-2:,
title={BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2301.12597},
year={2026}
} Abstract & Analysis
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Title:
Content selection saved. Describe the issue below:
Description:
License: CC BY 4.0
arXiv:2301.12597v3 [cs.CV] 15 Jun 2023
BLIP-2: Bootstrapping Language-Image Pre-training
with Frozen Image Encoders and Large Language Models
Junnan Li
āā
Dongxu Li
āā
Silvio Savarese
āā
Steven Hoi
āā
Salesforce Research
Abstract
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the modelās emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
vision-language
https://github.com/salesforce/LAVIS/tree/main/projects/blip2
1
Introduction
Vision-language pre-training (VLP) research has witnessed a rapid advancement in the past few years, where pre-trained models with increasingly larger scale have been developed to continuously push the state-of-the-art on various downstream tasksĀ (Radford etĀ al., 2021 ; Li etĀ al., 2021 , 2022 ; Wang etĀ al., 2022a ; Alayrac etĀ al., 2022 ; Wang etĀ al., 2022b ) . However, most state-of-the-art vision-language models incur a high computation cost during pre-training, due to end-to-end training using large-scale models and datasets.
Figure 1 : Overview of BLIP-2ās framework. We pre-train a lightweight Querying Transformer following a two-stage strategy to bridge the modality gap. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM, which enables zero-shot instructed image-to-text generation (see FigureĀ 4 for more examples).
Vision-language research sits at the intersection between vision and language, therefore it is naturally expected that vision-language models can harvest from the readily-available unimodal models from the vision and natural language communities. In this paper, we propose a generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models. Pre-trained vision models offer high-quality visual representation. Pre-trained language models, in particular large language models (LLMs), offer strong language generation and zero-shot transfer abilities. To reduce computation cost and counteract the issue of catastrophic forgetting, the unimodal pre-trained models remain frozen during the pre-training.
In order to leverage pre-trained unimodal models for VLP, it is key to facilitate cross-modal alignment. However, since LLMs have not seen images during their unimodal pre-training, freezing them makes vision-language alignment in particular challenging. In this regard, existing methods ( e . g .Ā FrozenĀ (Tsimpoukelli etĀ al., 2021 ) , FlamingoĀ (Alayrac etĀ al., 2022 ) ) resort to an image-to-text generation loss, which we show is insufficient to bridge the modality gap.
To achieve effective vision-language alignment with frozen unimodal models, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy. As shown in FigureĀ 1 , Q-FormerĀ is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text. In the first pre-training stage, we perform vision-language representation learning which enforces the Q-FormerĀ to learn visual representation most relevant to the text. In the second pre-training stage, we perform vision-to-language generative learning by connecting the output of the Q-FormerĀ to a frozen LLM, and trains the Q-FormerĀ such that its output visual representation can be interpreted by the LLM.
We name our VLP framework as BLIP-2: Bootstrapping Language-Image Pre-training with frozen unimodal models. The key advantages of BLIP-2 include:
ā¢
BLIP-2 effectively leverages both frozen pre-trained image models and language models.
We bridge the modality gap using a Q-FormerĀ pre-trained in two-stages: representation learning stage and generative learning stage.
BLIP-2 achieves state-of-the-art performance on various vision-language tasks including visual question answering, image captioning, and image-text retrieval.
ā¢
Powered by LLMs ( e . g .Ā OPTĀ (Zhang etĀ al., 2022 ) , FlanT5Ā (Chung etĀ al., 2022 ) ), BLIP-2 can be prompted to perform zero-shot image-to-text generation that follows natural language instructions, which enables emerging capabilities such as visual knowledge reasoning, visual conversation, etc. (see FigureĀ 4 for examples).
ā¢
Due to the use of frozen unimodal models and a lightweight Q-Former,
BLIP-2 is more compute-efficient than exisiting state-of-the-arts.
For example, BLIP-2 outperforms FlamingoĀ (Alayrac etĀ al., 2022 ) by 8.7% on zero-shot VQAv2, while using 54 Ć \times fewer trainable parameters.
Furthermore,
our results show that BLIP-2 is a generic method that can harvest more advanced unimodal models for better VLP performance.
2
Related Work
2.1
End-to-end Vision-Language Pre-training
Vision-language pre-training aims to learn multimodal foundation models with improved performance on various vision-and-language tasks. Depending on the downstream task, different model architectures have been proposed, including the dual-encoder architectureĀ (Radford etĀ al., 2021 ; Jia etĀ al., 2021 ) , the fusion-encoder architectureĀ (Tan & Bansal, 2019 ; Li etĀ al., 2021 ) , the encoder-decoder architectureĀ (Cho etĀ al., 2021 ; Wang etĀ al., 2021b ; Chen etĀ al., 2022b ) , and more recently, the unified transformer architectureĀ (Li etĀ al., 2022 ; Wang etĀ al., 2022b ) . Various pre-training objectives have also been proposed over the years, and have progressively converged to a few time-tested ones: image-text contrastive learningĀ (Radford etĀ al., 2021 ; Yao etĀ al., 2022 ; Li etĀ al., 2021 , 2022 ) , image-text matchingĀ (Li etĀ al., 2021 , 2022 ; Wang etĀ al., 2021a ) , and (masked) language modelingĀ (Li etĀ al., 2021 , 2022 ; Yu etĀ al., 2022 ; Wang etĀ al., 2022b ) .
Most VLP methods perform end-to-end pre-training using large-scale image-text pair datasets. As the model size keeps increasing, the pre-training can incur an extremely high computation cost. Moreover, it is inflexible for end-to-end pre-trained models to leverage readily-available unimodal pre-trained models, such as LLMsĀ (Brown etĀ al., 2020 ; Zhang etĀ al., 2022 ; Chung etĀ al., 2022 ) .
2.2
Modular Vision-Language Pre-training
More similar to us are methods that leverage off-the-shelf pre-trained models and keep them frozen during VLP. Some methods freeze the image encoder, including the early work which adopts a frozen object detector to extract visual features (Chen et al., 2020 ; Li et al., 2020 ; Zhang et al., 2021 ) , and the recent LiT (Zhai et al., 2022 ) which uses a frozen pre-trained image encoder for CLIP (Radford et al., 2021 ) pre-training. Some methods freeze the language model to use the knowledge from LLMs for vision-to-language generation tasks (Tsimpoukelli et al., 2021 ; Alayrac et al., 2022 ; Chen et al., 2022a ; Mañas et al., 2023 ; Tiong et al., 2022 ; Guo et al., 2022 ) . The key challenge in using a frozen LLM is to align visual features to the text space. To achieve this, Frozen (Tsimpoukelli et al., 2021 ) finetunes an image encoder whose outputs are directly used as soft prompts for the LLM. Flamingo (Alayrac et al., 2022 ) inserts new cross-attention layers into the LLM to inject visual features, and pre-trains the new layers on billions of image-text pairs. Both methods adopt the language modeling loss, where the language model generates texts conditioned on the image.
Different from existing methods, BLIP-2 can effectively and efficiently leverage both frozen image encoders and frozen LLMs for various vision-language tasks, achieving stronger performance at a lower computation cost.
3
Method
Figure 2 : ( Left ) Model architecture of Q-FormerĀ and BLIP-2ās first-stage vision-language representation learning objectives. We jointly optimize three objectives which enforce the queries (a set of learnable embeddings) to extract visual representation most relevant to the text. ( Right ) The self-attention masking strategy for each objective to control query-text interaction.
We propose BLIP-2, a new vision-language pre-training method that bootstraps from frozen pre-trained unimodal models. In order to bridge the modality gap, we propose a Querying Transformer (Q-Former) pre-trained in two stages: (1) vision-language representation learning stage with a frozen image encoder and (2) vision-to-language generative learning stage with a frozen LLM. This section first introduces the model architecture of Q-Former, and then delineates the two-stage pre-training procedures.
3.1
Model Architecture
We propose Q-FormerĀ as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM. It extracts a fixed number of output features from the image encoder, independent of input image resolution. As shown in FigureĀ 2 , Q-FormerĀ consists of two transformer submodules that share the same self-attention layers: (1) an image transformer that interacts with the frozen image encoder for visual feature extraction, (2) a text transformer that can function as both a text encoder and a text decoder. We create a set number of learnable query embeddings as input to the image transformer. The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block). The queries can additionally interact with the text through the same self-attention layers. Depending on the pre-training task, we apply different self-attention masks to control query-text interaction. We initialize Q-FormerĀ with the pre-trained weights of BERT base {}_{\text{base}} Ā (Devlin etĀ al., 2019 ) , whereas the cross-attention layers are randomly initialized. In total, Q-FormerĀ contains 188M parameters. Note that the queries are considered as model parameters.
In our experiments, we use 32 queries where each query has a dimension of 768 (same as the hidden dimension of the Q-Former). We use Z Z to denote the output query representation. The size of Z Z ( 32 Ć 768 32\times 768 ) is much smaller than the size of frozen image features ( e . g .Ā 257 Ć 1024 257\times 1024 for ViT-L/14). This bottleneck architecture works together with our pre-training objectives into forcing the queries to extract visual information that is most relevant to the text.
3.2
Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder
In the representation learning stage, we connect Q-FormerĀ to a frozen image encoder and perform pre-training using image-text pairs. We aim to train the Q-FormerĀ such that the queries can learn to extract visual representation that is most informative of the text. Inspired by BLIPĀ (Li etĀ al., 2022 ) , we jointly optimize three pre-training objectives that share the same input format and model parameters. Each objective employs a different attention masking strategy between queries and text to control their interaction (see FigureĀ 2 ).
Image-Text Contrastive Learning (ITC) learns to align image representation and text representation such that their mutual information is maximized. It achieves so by contrasting the image-text similarity of a positive pair against those of negative pairs. We align the output query representation Z Z from the image transformer with the text representation t t from the text transformer, where t t is the output embedding of the [CLS] token. Since Z Z contains multiple output embeddings (one from each query), we first compute the pairwise similarity between each query output and t t , and then select the highest one as the image-text similarity. To avoid information leak, we employ a unimodal self-attention mask, where the queries and text are not allowed to see each other. Due to the use of a frozen image encoder, we can fit more samples per GPU compared to end-to-end methods. Therefore, we use in-batch negatives instead of the momentum queue in BLIP.
Image-grounded Text Generation (ITG) loss trains the Q-FormerĀ to generate texts, given input images as the condition. Since the architecture of Q-FormerĀ does not allow direct interactions between the frozen image encoder and the text tokens, the information required for generating the text must be first extracted by the queries, and then passed to the text tokens via self-attention layers. Therefore, the queries are forced to extract visual features that capture all the information about the text. We employ a multimodal causal self-attention mask to control query-text interaction, similar to the one used in UniLMĀ (Dong etĀ al., 2019 ) . The queries can attend to each other but not the text tokens. Each text token can attend to all queries and its previous text tokens. We also replace the [CLS] token with a new [DEC] token as the first text token to signal the decoding task.
Image-Text Matching (ITM) aims to learn fine-grained alignment between image and text representation. It is a binary classification task where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched). We use a bi-directional self-attention mask where all queries and texts can attend to each other. The output query embeddings Z Z thus capture multimodal information. We feed each output query embedding into a two-class linear classifier to obtain a logit, and average the logits across all queries as the output matching score. We adopt the hard negative mining strategy fromĀ Li etĀ al. ( 2021 , 2022 ) to create informative negative pairs.
Figure 3 : BLIP-2ās second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs). ( Top ) Bootstrapping a decoder-based LLM (e.g. OPT). ( Bottom ) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5). The fully-connected layer adapts from the output dimension of the Q-Former to the input dimension of the chosen LLM.
3.3
Bootstrap Vision-to-Language Generative Learning from a Frozen LLM
In the generative pre-training stage, we connect Q-FormerĀ (with the frozen image encoder attached) to a frozen LLM to harvest the LLMās generative language capability. As shown in FigureĀ 3 , we use a fully-connected (FC) layer to linearly project the output query embeddings Z Z into the same dimension as the text embedding of the LLM. The projected query embeddings are then prepended to the input text embeddings. They function as soft visual prompts that condition the LLM on visual representation extracted by the Q-Former. Since the Q-FormerĀ has been pre-trained to extract language-informative visual representation, it effectively functions as an information bottleneck that feeds the most useful information to the LLM while removing irrelevant visual information. This reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem.
We experiment with two types of LLMs: decoder-based LLMs and encoder-decoder-based LLMs. For decoder-based LLMs, we pre-train with the language modeling loss, where the frozen LLM is tasked to generate the text conditioned on the visual representation from Q-Former. For encoder-decoder-based LLMs, we pre-train with the prefix language modeling loss, where we split a text into two parts. The prefix text is concatenated with the visual representation as input to the LLMās encoder. The suffix text is used as the generation target for the LLMās decoder.
3.4
Model Pre-training
Pre-training data. We use the same pre-training dataset as BLIP with 129M images in total, including COCOĀ (Lin etĀ al., 2014 ) , Visual GenomeĀ (Krishna etĀ al., 2017 ) , CC3MĀ (Sharma etĀ al., 2018 ) , CC12MĀ (Changpinyo etĀ al., 2021 ) , SBUĀ (Ordonez etĀ al., 2011 ) , and 115M images from the LAION400M datasetĀ (Schuhmann etĀ al., 2021 ) . We adopt the CapFilt methodĀ (Li etĀ al., 2022 ) to create synthetic captions for the web images. Specifically, we generate 10 captions using the BLIP large captioning model, and rank the synthetic captions along with the original web caption based on the image-text similarity produced by a CLIP ViT-L/14 model. We keep top-two captions per image as training data and randomly sample one at each pre-training step.
Pre-trained image encoder and LLM. For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models: (1) ViT-L/14 from CLIPĀ (Radford etĀ al., 2021 ) and (2) ViT-g/14 from EVA-CLIPĀ (Fang etĀ al., 2022 ) . We remove the last layer of the ViT and uses the second last layerās output features, which leads to slightly better performance. For the frozen language model, we explore the unsupervised-trained OPT model familyĀ (Zhang etĀ al., 2022 ) for decoder-based LLMs, and the instruction-trained FlanT5 model familyĀ (Chung etĀ al., 2022 ) for encoder-decoder-based LLMs.
Figure 4 : Selected examples of instructed zero-shot image-to-text generation using a BLIP-2 model w/ ViT-g and FlanT5 XXL {}_{\text{XXL}} , where it shows a wide range of capabilities including visual conversation, visual knowledge reasoning, visual commensense reasoning, storytelling, personalized image-to-text generation, etc.
Pre-training settings. We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage. During pre-training, we convert the frozen ViTsā and LLMsā parameters into FP16, except for FlanT5 where we use BFloat16. We found no performance degradation compared to using 32-bit models. Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods. For example, using a single 16-A100(40G) machine, our largest model with ViT-g and FlanT5-XXL requires less than 6 days for the first stage and less than 3 days for the second stage.
The same set of pre-training hyper-parameters are used for all models. We use the AdamW (Loshchilov & Hutter, 2017 ) optimizer with β 1 = 0.9 \beta_{1}=0.9 , β 1 = 0.98 \beta_{1}=0.98 , and a weight decay of 0.05. We use a cosine learning rate decay with a peak learning rate of 1e-4 and a linear warmup of 2k steps. The minimum learning rate at the second stage is 5e-5. We use images of size 224 à \times 224, augmented with random resized cropping and horizontal flipping.
4
Experiment
Models
#Trainable
Params
Open-
sourced?
Visual Question Answering
Image Captioning
Image-Text Retrieval
VQAv2 (test-dev) NoCaps (val) Flickr (test)
VQA acc. CIDEr SPICE TR@1 IR@1
BLIPĀ (Li etĀ al., 2022 )
583M ā
113.2 14.8 96.7 86.7
SimVLMĀ (Wang etĀ al., 2021b )
1.4B ā
112.2
-
BEIT-3Ā (Wang etĀ al., 2022b )
1.9B ā
-
94.9 81.5
FlamingoĀ (Alayrac etĀ al., 2022 )
10.2B ā 56.3
-
-
BLIP-2
188M
ā
65.0
121.6
15.8
97.6
89.7
Table 1 :
Overview of BLIP-2 results on various zero-shot vision-language tasks. Compared with previous state-of-the-art models.
BLIP-2 achieves the highest zero-shot performance while requiring the least number of trainable parameters during vision-language pre-training.
Models
#Trainable
Params
#Total
Params
VQAv2 OK-VQA GQA
val test-dev test test-dev
VL-T5 no-vqa {}_{\text{no-vqa}}
224M 269M 13.5
5.8 6.3
FewVLMĀ (Jin etĀ al., 2022 )
740M 785M 47.7
16.5 29.3
FrozenĀ (Tsimpoukelli etĀ al., 2021 )
40M 7.1B 29.6
5.9
VLKDĀ (Dai etĀ al., 2022 )
406M 832M 42.6 44.5 13.3
Flamingo3BĀ (Alayrac etĀ al., 2022 )
1.4B 3.2B
49.2 41.2
Flamingo9BĀ (Alayrac etĀ al., 2022 )
1.8B 9.3B
51.8 44.7
Flamingo80BĀ (Alayrac etĀ al., 2022 )
10.2B 80B
56.3 50.6
BLIP-2 ViT-L OPT 2.7B {}_{\text{2.7B}}
104M 3.1B 50.1 49.7 30.2 33.9
BLIP-2 ViT-g OPT 2.7B {}_{\text{2.7B}}
107M 3.8B 53.5 52.3 31.7 34.6
BLIP-2 ViT-g OPT 6.7B {}_{\text{6.7B}}
108M 7.8B 54.3 52.6 36.4 36.4
BLIP-2 ViT-L FlanT5 XL {}_{\text{XL}}
103M 3.4B 62.6 62.3 39.4 44.4
BLIP-2 ViT-g FlanT5 XL {}_{\text{XL}}
107M
4.1B
63.1
63.0
40.7
44.2
BLIP-2 ViT-g FlanT5 XXL {}_{\text{XXL}}
108M
12.1B
65.2
65.0
45.9
44.7
Table 2 :
Comparison with state-of-the-art methods on zero-shot visual question answering.
Table 1 provides an overview of the performance of BLIP-2 on various zero-shot vision-language tasks. Compared to previous state-of-the-art models, BLIP-2 achieves improved performance while requiring substantially fewer number of trainable parameters during vision-language pre-training.
4.1
Instructed Zero-shot Image-to-Text Generation
BLIP-2 effectively enables a LLM to understand images while preserving its capability in following text prompts, which allows us to control image-to-text generation with instructions. We simply append the text prompt after the visual prompt as input to the LLM. FigureĀ 4 shows examples to demonstrate a wide range of zero-shot image-to-text capabilities including visual knowledge reasoning, visual commensense reasoning, visual conversation, personalized image-to-text generation, etc.
Zero-shot VQA . We perform quantitative evaluation on the zero-shot visual question answering task. For OPT models, we use the prompt āQuestion: {} Answer:ā. For FlanT5 models, we use the prompt āQuestion: {} Short answer:ā. During generation, we use beam search with a beam width of 5. We also set the length-penalty to -1 which encourages shorter answers that align better with human annotation.
As shown in TableĀ 2 . BLIP-2 achieves state-of-the-art result on the VQAv2Ā (Goyal etĀ al., 2017 ) and GQAĀ (Hudson & Manning, 2019 ) datasets. It outperforms Flamingo80B by 8.7% on VQAv2, despite having 54x fewer trainable parameters. On the OK-VQAĀ (Marino etĀ al., 2019 ) dataset, BLIP-2 comes secondary to Flamingo80B. We hypothesis that this is because OK-VQA focuses more on open-world knowledge than visual understanding, and the 70B ChinchillaĀ (Hoffmann etĀ al., 2022 ) language model from Flamingo80B possesses more knowledge than the 11B FlanT5 XXL {}_{\text{XXL}} .
We make a promising observation from TableĀ 2 : a stronger image encoder or a stronger LLM both lead to better performance. This observation is supported by several facts: (1) ViT-g outperforms ViT-L for both OPT and FlanT5. (2) Within the same LLM family, larger models outperform smaller ones. (3) FlanT5, an instruction-tuned LLM, outperforms the unsupervised-trained OPT on VQA. This observation validates BLIP-2 as a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities.
Effect of Vision-Language Representation Learning.
Figure 5: Effect of vision-language representation learning on vision-to-language generative learning. Without representation learning, the Q-FormerĀ fails the bridge the modality gap, leading to significantly lower performance on zero-shot VQA.
The first-stage representation learning pre-trains the Q-FormerĀ to learn visual features relevant to the text, which reduces the burden of the LLM to learn vision-language alignment. Without the representation learning stage, Q-FormerĀ relies solely on the vision-to-language generative learning to bridge the modality gap, which is similar to the Perceiver Resampler in Flamingo. FigureĀ 5 shows the effect of representation learning on generative learning. Without representation learning, both types of LLMs give substantially lower performance on zero-shot VQA. In particular, OPT suffers from catastrophic forgetting where performance drastically degrades as training proceeds.
Models
#Trainable
Params
NoCaps Zero-shot (validation set) COCO Fine-tuned
in-domain near-domain out-domain overall Karpathy test
C S C S C S C S B@4 C
OSCARĀ (Li etĀ al., 2020 )
345M
-
-
- 80.9 11.3 37.4 127.8
VinVLĀ (Zhang etĀ al., 2021 )
345M 103.1 14.2 96.1 13.8 88.3 12.1 95.5 13.5 38.2 129.3
BLIPĀ (Li etĀ al., 2022 )
446M 114.9 15.2 112.1 14.9 115.3 14.4 113.2 14.8 40.4 136.7
OFAĀ (Wang etĀ al., 2022a )
930M
-
-
-
-
43.9
145.3
FlamingoĀ (Alayrac etĀ al., 2022 )
10.6B
-
-
-
-
138.1
SimVLMĀ (Wang etĀ al., 2021b )
ā¼ \sim 1.4B 113.7
110.9
115.2
112.2
40.6 143.3
BLIP-2 ViT-g OPT 2.7B {}_{\text{2.7B}}
1.1B
123.0
15.8
117.8
15.4
123.4
15.1
119.7
15.4
43.7
145.8
BLIP-2 ViT-g OPT 6.7B {}_{\text{6.7B}}
1.1B
123.7
15.8
119.2
15.3
124.4
14.8
121.0
15.3
43.5
145.2
BLIP-2 ViT-g FlanT5 XL {}_{\text{XL}}
1.1B
123.7
16.3
120.2
15.9
124.8
15.1
121.6
15.8
42.4
144.5
Table 3 :
Comparison with state-of-the-art image captioning methods on NoCaps and COCO Caption.
All methods optimize the cross-entropy loss during finetuning. C: CIDEr, S: SPICE, B@4: BLEU@4.
Models
#Trainable
Params
VQAv2
test-dev test-std
Open-ended generation models
ALBEFĀ (Li etĀ al., 2021 )
314M 75.84 76.04
BLIPĀ (Li etĀ al., 2022 )
385M 78.25 78.32
OFAĀ (Wang etĀ al., 2022a )
930M 82.00 82.00
Flamingo80BĀ (Alayrac etĀ al., 2022 )
10.6B 82.00 82.10
BLIP-2 ViT-g FlanT5 XL {}_{\text{XL}}
1.2B 81.55 81.66
BLIP-2 ViT-g OPT 2.7B {}_{\text{2.7B}}
1.2B 81.59 81.74
BLIP-2 ViT-g OPT 6.7B {}_{\text{6.7B}}
1.2B
82.19
82.30
Closed-ended classification models
VinVL 345M 76.52 76.60
SimVLMĀ (Wang etĀ al., 2021b )
ā¼ \sim 1.4B 80.03 80.34
CoCaĀ (Yu etĀ al., 2022 )
2.1B 82.30 82.30
BEIT-3Ā (Wang etĀ al., 2022b )
1.9B
84.19
84.03
Table 4 :
Comparison with state-of-the-art models fine-tuned for visual question answering.
4.2
Image Captioning
We finetune BLIP-2 models for the image captioning task, which asks the model to generate a text description for the imageās visual content. We use the prompt āa photo ofā as an initial input to the LLM and trains the model to generate the caption with the language modeling loss. We keep the LLM frozen during finetuning, and updates the parameters of the Q-FormerĀ together with the image encoder. We experiment with ViT-g and various LLMs. Detailed hyperparameters can be found in the appendix. We perform finetuning on COCO, and evaluate on both COCO test set and zero-shot transfer to NoCapsĀ (Agrawal etĀ al., 2019 ) validation set.
The results are shown in TableĀ 3 . BLIP-2 achieves state-of-the-art performance with significant improvement on NoCaps over existing methods, demonstrating strong generalization ability to out-domain images.
4.3
Visual Question Answering
Given annotated VQA data, we finetune the parameters of the Q-FormerĀ and the image encoder while keeping the LLM frozen. We finetune with the open-ended answer generation loss, where the LLM receives Q-Formerās output and the question as input, and is asked to generate the answer. In order to extract image features that are more relevant to the question, we additionally condition Q-FormerĀ on the question. Specifically, the question tokens are given as input to the Q-FormerĀ and interact with the queries via the self-attention layers, which can guide the Q-FormerāsĀ cross-attention layers to focus on more informative image regions.
Following BLIP, our VQA data includes the training and validation splits from VQAv2, as well as training samples from Visual Genome. TableĀ 4 demonstrates the state-of-the-art results of BLIP-2 among open-ended generation models.
4.4
Image-Text Retrieval
Model
#Trainable
Params
Flickr30K Zero-shot (1K test set)
COCO Fine-tuned (5K test set)
Image ā \rightarrow Text
Text ā \rightarrow Image
Image ā \rightarrow Text
Text ā \rightarrow Image
R@1
R@5
R@10
R@1
R@5
R@10
R@1
R@5
R@10
R@1
R@5
R@10
Dual-encoder models
CLIPĀ ( Radford etĀ al. , 2021 )
428M 88.0 98.7 99.4 68.7 90.6 95.2
ALIGNĀ ( Jia etĀ al. , 2021 )
820M
88.6
98.7
99.7
75.7
93.8
96.8
77.0
93.5
96.9
59.9
83.3
89.8
FILIPĀ ( Yao etĀ al. , 2022 )
417M
89.8
99.2
99.8
75.0
93.4
96.3
78.9
94.4
97.4
61.2
84.3
90.6
FlorenceĀ ( Yuan etĀ al. , 2021 )
893M 90.9 99.1
76.7 93.6
81.8 95.2
63.2 85.7
BEIT-3 ( Wang etĀ al. , 2022b )
1.9B
94.9
99.9
100.0
81.5
95.6
97.8
84.8
96.5
98.3
67.2
87.7
92.8
Fusion-encoder models
UNITERĀ ( Chen etĀ al. , 2020 )
303M
83.6
95.7
97.7
68.7
89.2
93.9
65.7
88.6
93.8
52.9
79.9
88.0
OSCARĀ ( Li etĀ al. , 2020 )
345M
70.0
91.1
95.5
54.0
80.8
88.5
VinVLĀ ( Zhang etĀ al. , 2021 )
345M
75.4
92.9
96.2
58.8
83.5
90.3
Dual encoder + Fusion encoder reranking
ALBEFĀ ( Li etĀ al. , 2021 )
233M
94.1
99.5
99.7
82.8
96.3
98.1
77.6
94.3
97.2
60.7
84.3
90.5
BLIPĀ ( Li etĀ al. , 2022 )
446M
96.7
100.0
100.0
86.7
97.3
98.7
82.4
95.4
97.9
65.1
86.3
91.8
BLIP-2 ViT-L
474M
96.9
100.0
100.0
88.6
97.6
98.9
83.5
96.0
98.0
66.3
86.5
91.8
BLIP-2 ViT-g
1.2B
97.6
100.0
100.0
89.7
98.1
98.9
85.4
97.0
98.5
68.3
87.7
92.6
Table 5:
Comparison with state-of-the-art image-text retrieval methods, finetuned on COCO and zero-shot transferred to Flickr30K.
Since image-text retrieval does not involve language generation, we directly finetune the first-stage-pretrained model w/o LLM. Specifically, we finetune the image encoder together with Q-FormerĀ on COCO using the same objectives ( i . e .Ā ITC, ITM, and ITG) as pre-training. We then evaluate the model for both image-to-text retrieval and text-to-image retrieval on COCO and Flickr30KĀ (Plummer etĀ al., 2015 ) datasets. During inference, we followĀ Li etĀ al. ( 2021 , 2022 ) which first select k = 128 k=128 candidates based on the image-text feature similarity, followed by a re-ranking based on pairwise ITM scores. We experiment with both ViT-L and ViT-g as the image encoder. Detailed hyperparameters can be found in the appendix.
The results are shown in TableĀ 5 . BLIP-2 achieves state-of-the-art performance with significant improvement over existing methods on zero-shot image-text retrieval.
COCO finetuning
objectives
Image ā \rightarrow Text Text ā \rightarrow Image
R@1 R@5 R@1 R@5
ITC + ITM 84.5 96.2 67.2 87.1
ITC + ITM + ITG 85.4 97.0 68.3 87.7
Table 6 :
The image-grounded text generation (ITG) loss improves image-text retrieval performance by enforcing the queries to extract language-relevant visual features.
The ITC and ITM losses are essential for image-text retrieval as they directly learn image-text similarity. In TableĀ 6 , we show that the ITG (image-grounded text generation) loss is also beneficial for image-text retrieval. This result supports our intuition in designing the representation learning objectives: the ITG loss enforces the queries to extract visual features most relevant to the text, thus improving vision-language alignment.
5
Limitation
Recent LLMs can perform in-context learning given few-shot examples. However, our experiments with BLIP-2 do not observe an improved VQA performance when providing the LLM with in-context VQA examples. We attribute the lack of in-context learning capability to our pre-training dataset, which only contains a single image-text pair per sample. The LLMs cannot learn from it the correlation among multiple image-text pairs in a single sequence. The same observation is also reported in the Flamingo paper, which uses a close-sourced interleaved image and text dataset (M3W) with multiple image-text pairs per sequence. We aim to create a similar dataset in future work.
BLIP-2ās image-to-text generation could have unsatisfactory results due to various reasons including inaccurate knowledge from the LLM, activating the incorrect reasoning path, or not having up-to-date information about new image content (see FigureĀ 7 ). Furthermore, due to the use of frozen models, BLIP-2 inherits the risks of LLMs, such as outputting offensive language, propagating social bias, or leaking private information. Remediation approaches include using instructions to guide modelās generation or training on a filtered dataset with harmful content removed.
6
Conclusion
We propose BLIP-2, a generic and compute-efficient method for vision-language pre-training that leverages frozen pre-trained image encoders and LLMs. BLIP-2 achieves state-of-the-art performance on various vision-language tasks while having a small amount of trainable parameters during pre-training. BLIP-2 also demonstrates emerging capabilities in zero-shot instructed image-to-text generation. We consider BLIP-2 as an important step towards building a multimodal conversational AI agent.
References
Agrawal etĀ al. (2019)
Agrawal, H., Anderson, P., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., and Lee, S.
nocaps: novel object captioning at scale.
In ICCV , pp. 8947ā8956, 2019.
Alayrac etĀ al. (2022)
Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Simonyan, K.
Flamingo: a visual language model for few-shot learning.
arXiv preprint arXiv:2204.14198 , 2022.
Brown etĀ al. (2020)
Brown, T.Ā B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.Ā M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.
Language models are few-shot learners.
In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS , 2020.
Changpinyo etĀ al. (2021)
Changpinyo, S., Sharma, P., Ding, N., and Soricut, R.
Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.
In CVPR , 2021.
Chen etĀ al. (2022a)
Chen, J., Guo, H., Yi, K., Li, B., and Elhoseiny, M.
Visualgpt: Data-efficient adaptation of pretrained language models for image captioning.
In CVPR , pp. 18009ā18019, 2022a.
Chen etĀ al. (2022b)
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.Ā J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.Ā K., Riquelme, C., Steiner, A., Angelova, A., Zhai, X., Houlsby, N., and Soricut, R.
Pali: A jointly-scaled multilingual language-image model.
arXiv preprint arXiv:2209.06794 , 2022b.
Chen etĀ al. (2020)
Chen, Y., Li, L., Yu, L., Kholy, A.Ā E., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J.
UNITER: universal image-text representation learning.
In ECCV , volume 12375, pp. 104ā120, 2020.
Cho etĀ al. (2021)
Cho, J., Lei, J., Tan, H., and Bansal, M.
Unifying vision-and-language tasks via text generation.
arXiv preprint arXiv:2102.02779 , 2021.
Chung etĀ al. (2022)
Chung, H.Ā W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.Ā S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V.Ā Y., Huang, Y., Dai, A.Ā M., Yu, H., Petrov, S., Chi, E.Ā H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.Ā V., and Wei, J.
Scaling instruction-finetuned language models.
arXiv preprint arXiv:2210.11416 , 2022.
Dai etĀ al. (2022)
Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., and Fung, P.
Enabling multimodal generation on CLIP via vision-language knowledge distillation.
In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL Findings , pp. 2383ā2395, 2022.
Devlin etĀ al. (2019)
Devlin, J., Chang, M., Lee, K., and Toutanova, K.
BERT: pre-training of deep bidirectional transformers for language understanding.
In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL , pp. 4171ā4186, 2019.
Dong etĀ al. (2019)
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H.
Unified language model pre-training for natural language understanding and generation.
In Wallach, H.Ā M., Larochelle, H., Beygelzimer, A., dāAlchĆ©-Buc, F., Fox, E.Ā B., and Garnett, R. (eds.), NeurIPS , pp. 13042ā13054, 2019.
Fang etĀ al. (2022)
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y.
Eva: Exploring the limits of masked visual representation learning at scale.
arXiv preprint arXiv:2211.07636 , 2022.
Goyal etĀ al. (2017)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D.
Making the V in VQA matter: Elevating the role of image understanding in visual question answering.
In CVPR , pp. 6325ā6334, 2017.
Guo etĀ al. (2022)
Guo, J., Li, J., Li, D., Tiong, A. M.Ā H., Li, B., Tao, D., and Hoi, S. C.Ā H.
From images to textual prompts: Zero-shot VQA with frozen large language models.
In CVPR , 2022.
Hoffmann etĀ al. (2022)
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d.Ā L., Hendricks, L.Ā A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v.Ā d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.Ā W., Vinyals, O., and Sifre, L.
Training compute-optimal large language models.
arXiv preprint arXiv:2203.15556 , 2022.
Hudson & Manning (2019)
Hudson, D.Ā A. and Manning, C.Ā D.
GQA: A new dataset for real-world visual reasoning and compositional question answering.
In CVPR , pp. 6700ā6709, 2019.
Jia etĀ al. (2021)
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q.Ā V., Sung, Y., Li, Z., and Duerig, T.
Scaling up visual and vision-language representation learning with noisy text supervision.
arXiv preprint arXiv:2102.05918 , 2021.
Jin etĀ al. (2022)
Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X.
A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models.
In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL , pp. 2763ā2775, 2022.
Krishna etĀ al. (2017)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.Ā A., Bernstein, M.Ā S., and Fei-Fei, L.
Visual genome: Connecting language and vision using crowdsourced dense image annotations.
IJCV , 123(1):32ā73, 2017.
Li etĀ al. (2021)
Li, J., Selvaraju, R.Ā R., Gotmare, A.Ā D., Joty, S., Xiong, C., and Hoi, S.
Align before fuse: Vision and language representation learning with momentum distillation.
In NeurIPS , 2021.
Li etĀ al. (2022)
Li, J., Li, D., Xiong, C., and Hoi, S. C.Ā H.
BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation.
In ICML , pp. 12888ā12900, 2022.
Li etĀ al. (2020)
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., and Gao, J.
Oscar: Object-semantics aligned pre-training for vision-language tasks.
In ECCV , pp. 121ā137, 2020.
Lin etĀ al. (2014)
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., DollÔr, P., and Zitnick, C. L.
Microsoft COCO: common objects in context.
In Fleet, D.Ā J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), ECCV , volume 8693, pp. 740ā755, 2014.
Loshchilov & Hutter (2017)
Loshchilov, I. and Hutter, F.
Decoupled weight decay regularization.
arXiv preprint arXiv:1711.05101 , 2017.
Mañas et al. (2023)
MaƱas, O., RodrĆguez, P., Ahmadi, S., Nematzadeh, A., Goyal, Y., and Agrawal, A.
MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting.
In EACL , 2023.
Marino etĀ al. (2019)
Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R.
Ok-vqa: A visual question answering benchmark requiring external knowledge.
In CVPR , 2019.
Ordonez etĀ al. (2011)
Ordonez, V., Kulkarni, G., and Berg, T.Ā L.
Im2text: Describing images using 1 million captioned photographs.
In Shawe-Taylor, J., Zemel, R.Ā S., Bartlett, P.Ā L., Pereira, F. C.Ā N., and Weinberger, K.Ā Q. (eds.), NIPS , pp. 1143ā1151, 2011.
Plummer etĀ al. (2015)
Plummer, B.Ā A., Wang, L., Cervantes, C.Ā M., Caicedo, J.Ā C., Hockenmaier, J., and Lazebnik, S.
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.
In ICCV , pp. 2641ā2649, 2015.
Radford etĀ al. (2021)
Radford, A., Kim, J.Ā W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etĀ al.
Learning transferable visual models from natural language supervision.
arXiv preprint arXiv:2103.00020 , 2021.
Schuhmann etĀ al. (2021)
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A.
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114 , 2021.
Sharma etĀ al. (2018)
Sharma, P., Ding, N., Goodman, S., and Soricut, R.
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.
In Gurevych, I. and Miyao, Y. (eds.), ACL , pp. 2556ā2565, 2018.
Tan & Bansal (2019)
Tan, H. and Bansal, M.
LXMERT: learning cross-modality encoder representations from transformers.
In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), EMNLP , pp. 5099ā5110, 2019.
Tiong etĀ al. (2022)
Tiong, A. M.Ā H., Li, J., Li, B., Savarese, S., and Hoi, S. C.Ā H.
Plug-and-play VQA: zero-shot VQA by conjoining large pretrained models with zero training.
In EMNLP Findings , 2022.
Tsimpoukelli etĀ al. (2021)
Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S. M.Ā A., Vinyals, O., and Hill, F.
Multimodal few-shot learning with frozen language models.
In Ranzato, M., Beygelzimer, A., Dauphin, Y.Ā N., Liang, P., and Vaughan, J.Ā W. (eds.), NeurIPS , pp. 200ā212, 2021.
Wang etĀ al. (2022a)
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H.
OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.
In Chaudhuri, K., Jegelka, S., Song, L., SzepesvĆ”ri, C., Niu, G., and Sabato, S. (eds.), ICML , pp. 23318ā23340, 2022a.
Wang etĀ al. (2021a)
Wang, W., Bao, H., Dong, L., and Wei, F.
Vlmo: Unified vision-language pre-training with mixture-of-modality-experts.
arXiv preprint arXiv:2111.02358 , 2021a.
Wang etĀ al. (2022b)
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.Ā K., Singhal, S., Som, S., and Wei, F.
Image as a foreign language: Beit pretraining for all vision and vision-language tasks.
arXiv preprint arXiv:2208.10442 , 2022b.
Wang etĀ al. (2021b)
Wang, Z., Yu, J., Yu, A.Ā W., Dai, Z., Tsvetkov, Y., and Cao, Y.
Simvlm: Simple visual language model pretraining with weak supervision.
arXiv preprint arXiv:2108.10904 , 2021b.
Yao etĀ al. (2022)
Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C.
FILIP: fine-grained interactive language-image pre-training.
In ICLR , 2022.
Yu etĀ al. (2022)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y.
Coca: Contrastive captioners are image-text foundation models.
arXiv preprint arXiv:2205.01917 , 2022.
Yuan etĀ al. (2021)
Yuan, L., Chen, D., Chen, Y., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., and Zhang, P.
Florence: A new foundation model for computer vision.
arXiv preprint arXiv:2111.11432 , 2021.
Zhai etĀ al. (2022)
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L.
Lit: Zero-shot transfer with locked-image text tuning.
In CVPR , pp. 18102ā18112, 2022.
Zhang etĀ al. (2021)
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J.
Vinvl: Making visual representations matter in vision-language models.
arXiv preprint arXiv:2101.00529 , 2021.
Zhang etĀ al. (2022)
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M.Ā T., Li, X., Lin, X.Ā V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P.Ā S., Sridhar, A., Wang, T., and Zettlemoyer, L.
OPT: open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068 , 2022.
LLM
FlanT5 XL {}_{\text{XL}}
OPT 2.7B {}_{\text{2.7B}}
OPT 6.7B {}_{\text{6.7B}}
Fine-tuning epochs
5
Warmup steps
1000
Learning rate
1e-5
Batch size
256
AdamW β \beta
(0.9,0.999)
Weight decay
0.05
Drop path
0
Image resolution
364
Prompt
āa photo ofā
Inference beam size
5
Layer-wise learning rate decay for ViT
1
1
0.95
Table 7:
Hyperparameters for fine-tuning BLIP-2 with ViT-g on COCO captioning.
LLM
FlanT5 XL {}_{\text{XL}}
OPT 2.7B {}_{\text{2.7B}}
OPT 6.7B {}_{\text{6.7B}}
Fine-tuning epochs
5
Warmup steps
1000
Learning rate
1e-5
Batch size
128
AdamW β \beta
(0.9,0.999)
Weight decay
0.05
Drop path
0
Image resolution
490
Prompt
āQuestion: {} Answer:ā
Inference beam size
5
Layer-wise learning rate decay for ViT
0.95
0.95
0.9
Table 8:
Hyperparameters for fine-tuning BLIP-2 with ViT-g on VQA.
Image Encoder
ViT-L/14
ViT-g/14
Fine-tuning epochs
5
Warmup steps
1000
Learning rate
5e-6
1e-5
Batch size
224
AdamW β \beta
(0.9,0.98)
(0.9,0.999)
Weight decay
0.05
Drop path
0
Image resolution
364
Layer-wise learning rate decay for ViT
1
0.95
Table 9:
Hyperparameters for fine-tuning BLIP-2 on COCO image-text retrieval.
Figure 6 : Incorrect output examples for instructed zero-shot image-to-text generation using a BLIP-2 model w/ ViT-g and FlanT5 XXL {}_{\text{XXL}} .
Figure 7 : Model architecture for VQA finetuning, where the LLM receives Q-Formerās output and the question as input, then predicts answers. We also provide the question as a condition to Q-Former, such that the extracted image features are more relevant to the question.
BETA
AI Summary: Based on semantic_scholar metadata. Not a recommendation.
š”ļø Paper Transparency Report
Technical metadata sourced from upstream repositories.
š Identity & Source
- id
- arxiv-paper--unknown--2301.12597
- slug
- unknown--2301.12597
- source
- semantic_scholar
- author
- Junnan Li, Dongxu Li, S. Savarese, Steven C. H. Hoi
- license
- ArXiv
- tags
- paper, research, academic
āļø Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
š Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.