英文摘要:
In recent years, a large number of multimodal data (image, text, video, audio, etc.) have emerged on the network. Due to the complementarity between the multimodal data, it has become a research hotspot in the field of computer vision to use the data for the tasks of classification, detection, segmentation. As an important research direction in the field of computer vision, object detection has received more and more research. In the traditional object detection algorithm, researchers only use the single-mode data of the image to achieve the classification and location of the objects, which does not consider the impact of text on the performance of the object detection algorithms. This paper focuses on the object detection algorithm which based on text and images. Firstly, the traditional Faster R-CNN algorithm is used to extract the features of candidate objects in the image, and the Bi-GRU algorithm is used to extract the features of text; Secondly, an effective co-attention mode is designed to promote the interaction between text and images. The experimental results on MS COCO show that the detection accuracy of this method is higher than the object detection algorithm which only using image information, and the effective fusion of text and image is achieved.
|