A survey on Vision-Language multimodality pre-training
投稿时间: 2023/2/20 0:00:00
中文关键词: 多模态预训练;视觉-语言预训练;表征学习
英文关键词: multimodal pre-training; Vision-Language (VL) training; representation learning
基金项目: 国家重点研发研究计划(2018YFB1404103)
姓名 单位
朱若琳 中国传媒大学信息与通信工程学院
蓝善祯 中国传媒大学信息与通信工程学院
朱紫星 中国传媒大学信息与通信工程学院
点击数:729 下载数:1400



Multimodal pre-training has shown increased interest on vision-language tasks. Recent comprehensive studies have demonstrated that, multimodal representations training can benefit the Vision-Language downstream tasks. Multimodal pre-training requires a large-scale training data and self-supervised learning. This paper reviews some significant transformer-base researches about Vision-Language (VL) pre-training, which came out after BERT. Firstly, the application background and development significance of multimode pretraining are expounded. Secondly, this paper introduces the development of mainstream multimodal networks and analyzes the advantages and disadvantages of methods. Then, we explain cost functions used in multi-task pre-training. Next, We then illustrate the large-scale image-text database mentioned in recent studies. In the end, combining different VL downstream tasks, this paper describes the task objectives, datasets and training methods.
