中国传媒大学学报自然科学版

视觉-语言多模态预训练模型综述

A survey on Vision-Language multimodality pre-training

投稿时间：

2023/2/20 0:00:00

DOI：

中文关键词：

多模态预训练；视觉-语言预训练；表征学习

英文关键词：

multimodal pre-training; Vision-Language (VL) training; representation learning

基金项目：

国家重点研发研究计划(2018YFB1404103)

姓名	单位
朱若琳	中国传媒大学信息与通信工程学院
蓝善祯	中国传媒大学信息与通信工程学院
朱紫星	中国传媒大学信息与通信工程学院

点击数：1086

下载数：3033

中文摘要：

近年来，多模态预训练学习在视觉-语言任务上蓬勃发展。大量研究表明，多个模态特征的表征学习预训练有利于视觉-语言下游任务的效果提升。多模态表征预训练旨在采用自监督的学习范式，包括对比学习，掩码自监督等，在大规模的图文相关性数据上进行训练，通过学习模态自身与模态间的知识先验，使模型获得通用的、泛化性较强的视觉表征能力。后BERT时代，视觉多模态领域主流模型大多采用基于Transformer的网络结构，本文介绍了视觉多模态领域基于Transformer的相关工作；对主流多模态学习方法的发展脉络进行梳理，分析了不同方法的优势和局限性；总结了多模态预训练的各种监督信号及其作用；概括了现阶段主流的大规模图像-文本数据集；最后简要介绍了几种相关的跨模态预训练下游任务。

英文摘要：

Multimodal pre-training has shown increased interest on vision-language tasks. Recent comprehensive studies have demonstrated that, multimodal representations training can benefit the Vision-Language downstream tasks. Multimodal pre-training requires a large-scale training data and self-supervised learning. This paper reviews some significant transformer-base researches about Vision-Language (VL) pre-training, which came out after BERT. Firstly, the application background and development significance of multimode pretraining are expounded. Secondly, this paper introduces the development of mainstream multimodal networks and analyzes the advantages and disadvantages of methods. Then, we explain cost functions used in multi-task pre-training. Next, We then illustrate the large-scale image-text database mentioned in recent studies. In the end, combining different VL downstream tasks, this paper describes the task objectives, datasets and training methods.

参考文献：