中国传媒大学学报自然科学版

基于稀疏Transformer的长短时序关联动作识别算法

Sparse transformer-based algorithm for long-short temporal association action recognition

投稿时间：

2023/12/20 0:00:00

DOI：

中文关键词：

深度学习；动作识别；稀疏Transformer；R3D-18

英文关键词：

deep learning; action recognition; sparse transformer;R3D-18

基金项目：

姓名	单位
廖健文	中国传媒大学信息与通信工程学院
杨盈昀	中国传媒大学信息与通信工程学院
卢玥	中国传媒大学信息与通信工程学院

点击数：186

下载数：179

中文摘要：

针对主流的视频动作识别算法对时序信息的挖掘不充分，而Transformer能够更好地处理长序列和全局依赖性问题，本文将3DCNN和Transformer结合起来，提出了基于稀疏Transformer的长短时序关联动作识别算法，从而实现对视频的全局时序信息进行建模。该算法提取预训练视频模型各个片段特征，嵌入视频特征聚类模块降低输入特征的潜在噪声，并利用基于稀疏自注意力的Transformer长短时序关联模块，引入稀疏掩码矩阵，对相似度矩阵进行掩码操作，抑制较小的注意力权重，选择性地保留重要的长短时序信息，提高模型对全局上下文信息的注意力集中程度。本文在UCF101和HMDB51数据集上进行了大量的实验，验证了本文算法的有效性，在参数量和计算复杂度较小的情况下准确率高于同类权威算法。

英文摘要：

Mainstream video action recognition algorithms often lack sufficient exploitation of temporal information, while Transformer excels at handling long sequences and global dependency issues. In this paper 3D Convolutional Neural Networks(3DCNN) and Transformer were combined to propose a sparse Transformer-based long-short temporal association action recognition algorithm, so as to realize the modeling of global temporal information of video. The algorithm used a pre-trained model to extract clip features, embedded a video feature clustering module to reduce the potential noise of the input features, and used a Transformer long-short temporal association module based on sparse self-attentiveness which introduced a sparse mask matrix masking operations on the similarity matrix to suppress smaller attention weights, selectively retained important long-short temporal information, and improved the model’s attention concentration on global contextual information. The experimental results verify the effectiveness of the proposed algorithm, showing the model can achieve higher accuracy compared to state-of-the-art approaches on the UCF101 and HMDB51 datasets with a small number of parameters and computational complexity.

参考文献：