中国传媒大学学报自然科学版

一种基于文本和图像的多模态目标检测方法

A multimodal object detection method based on text and image

投稿时间：

2023/6/20 0:00:00

DOI：

中文关键词：

多模态；目标检测；深度学习

英文关键词：

multimodal; object detection; deep learning

基金项目：

姓名	单位
员娇娇	北京工业大学信息学部
胡永利	北京工业大学信息学部
尹宝才	北京工业大学信息学部

点击数：585

下载数：883

中文摘要：

近年来，网络上涌现了大量的多模态数据（图像、文本、视频、音频等），由于不同模态的数据之间具有互补性，因此，利用不同模态的数据进行分类、检测、分割等任务已成为计算机视觉领域的研究热点。目标检测作为其中的一个重要方向，得到了越来越深入的研究。在传统的目标检测算法中，研究者们仅利用图像这一单模态的数据来实现对目标的分类和定位，这种做法没有考虑文本对目标检测算法性能的影响。本文重点研究基于文本和图像的多模态目标检测算法，首先利用传统的Faster R-CNN算法提取图像中的候选目标的特征，同时利用Bi-GRU算法提取文本的特征；其次，设计了一种有效的协同注意力模型来促进文本和图像这两种不同模态数据之间的融合。在大型的目标检测数据集MSCOCO上的实验结果表明，本文方法的检测精度高于仅利用图像信息的目标检测算法的精度，充分证明了本文方法的有效性。

英文摘要：

In recent years, a large number of multimodal data (image, text, video, audio, etc.) have emerged on the network. Due to the complementarity between the multimodal data, it has become a research hotspot in the field of computer vision to use the data for the tasks of classification, detection, segmentation. As an important research direction in the field of computer vision, object detection has received more and more research. In the traditional object detection algorithm, researchers only use the single-mode data of the image to achieve the classification and location of the objects, which does not consider the impact of text on the performance of the object detection algorithms. This paper focuses on the object detection algorithm which based on text and images. Firstly, the traditional Faster R-CNN algorithm is used to extract the features of candidate objects in the image, and the Bi-GRU algorithm is used to extract the features of text; Secondly, an effective co-attention mode is designed to promote the interaction between text and images. The experimental results on MS COCO show that the detection accuracy of this method is higher than the object detection algorithm which only using image information, and the effective fusion of text and image is achieved.

参考文献：