一种新的基于三维卷积共生梯度直方图和多示例学习的特殊视频检测算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

一种新的基于三维卷积共生梯度直方图和多示例学习的特殊视频检测算法

详细信息查看全文 | 推荐本文 |

英文篇名：A New Special Video Detection Algorithm Based on 3D Convolution CoHOG and MIL
作者：宋伟 ; 任栋 ; 于京 ; 齐振国
英文作者：SONG Wei;REN Dong;YU Jing;QI Zhen-Guo;School of Information Engineering,Minzu University of China;School of Electronic Information Engineering,Beijing Jiaotong University;
关键词：视频内容检测 ; 梯度方向直方图 ; 多示例学习 ; 卷积 ; 池化 ; 极限学习机
英文关键词：special videos detection;;histogram of oriented gradient;;multiple instance learning;;convolution;;pooling;;extreme learning machine
中文刊名：JSJX
英文刊名：Chinese Journal of Computers
机构：中央民族大学信息工程学院;北京交通大学电子信息工程学院;
出版日期：2018-07-16 15:20
出版单位：计算机学报
年：2019
期：v.42;No.433
基金：国家自然科学基金(61503424,61331013);; 国家留学基金;; 中央民族大学一流大学一流学科(“图像工程”);中央民族大学青年教师科研能力提升计划项目资助~~
语种：中文;
页：JSJX201901011
页数：15
CN：01
ISSN：11-1826/TP
分类号：151-165

摘要

已有的基于梯度方向直方图信息的视频内容检测算法侧重在二维的视频帧上提取特征,忽略了视频内容在时间维度上的相关性.提取局部梯度间潜在的共生关系特征可一定程度上提高算法的检测准确率;同时,对相邻特征池化可有效减少特征降维过程中的信息丢失.基于此,利用视频帧间结构信息通过卷积运算构建共生梯度直方图的三维结构,然后对相邻特征池化实现描述特征的有效降维,解决了忽略帧间信息影响识别准确率以及高维度特征难以训练的问题;将视频特征映射到多示例学习中的示例和包,非常容易地实现了对不同长度视频的检测.在公开测试数据集Hockey、Movie上进行测试,实验结果显示,Hockey数据集上算法的检测准确率高于现有最优算法3%,Movie数据集上的检测准确率高于现有最优算法0.5%,验证了新特征与算法的有效性.
Existing video content detection algorithms based on gradient direction histogram information are focused on the features extracted from the single two-dimensional video frames,ignored the correlation of the video frames on the time dimension.The frames in the video are inseparable whole.All consecutive frames could express true and complete semantics.The extracted information contained in video is inaccurate if only consider key frames.The correlation contain semantic information of video,is import for video content detection.And the potential symbiotic relationship between local gradient direction features is beneficial to the improvement of the algorithm accuracy.Just as important,pooling used in the adjacent features can reduce high-dimensional feature dimension,avoid losing hidden action information.Constructed 3D Conv-CoHOG feature by using the hidden structure information in video frames on the time dimension,and extending two-dimensional CoHOG features to three-dimensional features.Pooling operation on neighboring features reduced feature dimension effectively.This algorithm solved the problems of recognition accuracy reduction because of the inter-frame informationneglect and the high computing complexity caused by high-dimensional features.Mapping video features to instances and bags corresponding to multiple-instance learning,dealing with video content detection problems for different lengths of videos simply.In this article,we introduced field of research and the importance of video violence content detection firstly.Then summarized the achievements of previous research,classified the findings of the research.All algorithms are divided into 3 categories,based on multi-modal features of audio and video and fused color feature,based on fusion of different action features,and the content detection algorithm based on neural network and unsupervised feature extraction.The most important part of this article is the introduction of algorithmic structure.We introduced the concept of HOG features and the extraction process,compared the extraction difference between HOG,CoHOG and Conv-CoHOG,also compared the extraction difference between HOG and HOG3D,and proposed the new special video content detection algorithm 3D convolution CoHOG extended from Conv-CoHOG.We compared the difference between the proposed new feature and the old features,such as computational dimension,feature dimension,and the relationship between adjacent features.In part 3.2,we introduced the framework of the new algorithm.In part 3.3 to part 3.7,we introduced the construction of feature extraction unit,the quantization of three dimensional gradients,extraction of Co-HOG3D,extraction of Conv-CoHOG3D,and the training of multiple-instance learning algorithm model.In part 4.1,described the two databases used in this experiment.In part 4.2,showed parameter setting and evaluation criteria.Then we analyzed the experimental results.In stage of training data,we used three classifiers,each classifier has a variety of implementations.When testing,compared the results of different features,analyzed the reasons for the different results,and analyzed the effectiveness of the new feature.In the end,we put forward effective solution on special video content detection.The highest detection accuracy on hockey and movie sets illustrated the availability of the proposed new algorithm on the special video detection.3% higher than the existing optimal algorithm on Hockey data set,0.5% higher than the existing optimal algorithm on Movie data set.

引文

[1]Yang C,Yuan J,Liu J.Abnormal event detection in crowded scenes using sparse representation.Pattern Recognition,2013,46(7):1851-1864
    [2]Wang F S,Wang C L,Bing L,et al.Specified sensitive video recognition based on multi-context construction and linear fusion.Acta Electronica Sinica,2015,43(4):675-683
    [3]Karpathy A,Toderici G,Shetty S,et al.Large-scale video classification with convolutional neural networks//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Columbus,USA,2014:1725-1732
    [4]Giannakopoulos T,Kosmopoulos D,Aristidou A,et al.Violence content classification using audio features.Advances in Artificial Intelligence,2006,3955:502-507
    [5]Clarin C T,Dionisio M,Echavez M T,et al.DOVE:Detection of movie violence using motion intensity analysis on skin and blood//Proceedings of the 6th Philippine Computing Science Congress(PCSC 2005).Cebu City,Philippines,2005:150-156
    [6]Nam J H,Alghoniemy M,Tewfik A H.Audio-visual content-based violent scene characterization//Proceedings of the International Conference on Image Processing.Chicago,USA,1998:353-357
    [7]Lin J,Wang W.Weakly-supervised violence detection in movies with audio and video based co-training.Lecture Notes in Computer Science,2009,5879:930-935
    [8]Datta A,Shah M,Vitoria Lobo N D.Person-on-person violence detection in video data//Proceedings of the International Conference on Pattern Recognition.Quebec City,Canada,2002:433-438
    [9]Wang K,Zhang Z,Wang L.Violence video detection by discriminative slow feature analysis.Communications in Computer&Information Science,2012,321:137-144
    [10]Chen D,Wactlar H,Chen M Y,et al.Recognition of aggressive human behavior using binary local motion descriptors.Engineering in Medicine and Biology Society,2008,(1):5238-5241
    [11]Giannakopoulos T,Makris A,Kosmopoulos D,et al.Audiovisual fusion for detecting violent scenes in videos//Proceedings of the 6th Hellenic Conference on Artificial Intelligence:Theories,MODELS and Applications.Athens,Greece,2010:91-100
    [12]Nievas E B,Suarez O D,García G B,et al.Violence detection in video using computer vision techniques//Proceedings of the14th International Conference on Computer Analysis of Images and Patterns.Seville,Spain,2011:332-339
    [13]Chen M Y,Hauptmann A.MoSIFT:Recognizing human actions in surveillance videos.Annals of Pharmacotherapy,2009,39(1):150-152
    [14]Kataoka H,Hashimoto K,Iwata K,et al.Extended co-occurrence HOG with dense trajectories for fine-grained activity recognition//Proceedings of the 12th Asian Conference on Computer Vision(ACCV 2014).Singapore,2014:336-349
    [15]Xu L,Gong C,Yang J,et al.Violent video detection based on MoSIFT feature and sparse coding//Proceedings of the2014IEEE International Conference on Acoustics,Speech and Signal Processing.Florence,Italy,2014:3538-3542
    [16]Peng X,Qiao Y,Peng Q,et al.Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition//Proceedings of the British Machine Vision Conference.Bristol,UK,2013:1-11
    [17]Peng X,Qiao Y,Peng Q.Motion boundary based sampling and 3D co-occurrence descriptors for action recognition.Image&Vision Computing,2014,32(9):616-628
    [18]Mousavi H,Mohammadi S,Perina A,et al.Analyzing tracklets for the detection of abnormal crowd behavior//Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision.Big Island,USA,2015:148-155
    [19]Senst T,Eiselein V,Sikora T.A local feature based on Lagrangian measures for violent video classification//Proceedings of the International Conference on Imaging for Crime Prevention and Detection.London,UK,2015:1-6
    [20]Yeffet L,Wolf L.Local trinary patterns for human action recognition//Proceedings of the International Conference on Computer Vision.Kyoto,Japan,2009:492-497
    [21]Laptev I,Marszalek M,Schmid C,et al.Learning realistic human actions from movies//Proceedings of the Conference on Computer Vision and Pattern Recognition.Alaska,USA,2008:1-8
    [22]Hinton G E,Salakhutdinov R R.Reducing the dimensionality of data with neural networks.Science,2006,313(5786):504-507
    [23]Chen X W,Lin X.Big data deep learning:Challenges and perspectives.IEEE Access,2014,2:514-525
    [24]Volodymyr M,Koray K,David S,et al.Human-level control through deep reinforcement learning.Nature,2015,518(7540):529-533
    [25]Ding C,Fan S,Zhu M,et al.Violence detection in video by using 3Dconvolutional neural networks//Proceedings of the10th International Symposium on Visual Computing(ISVC2014).Las Vegas,USA,2014:551-558
    [26]Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:Towards good practices for deep action recognition.ACMTransactions on Information Systems,2016,22(1):20-36
    [27]Zhou P,Ding Q,Luo H,et al.Violent interaction detection in video based on deep learning//Proceedings of the 6th Conference on Advances in Optoelectronics and Micro/NanoOptics.Nanjing,China,2017:012044
    [28]Feichtenhofer C,Pinz A,Zisserman A.Convolutional twostream network fusion for video action recognition//Proceedings of the Computer Vision and Pattern Recognition.Las Vegas,USA,2016:1933-1941
    [29]Tran D,Bourdev L,Fergus R,et al.Learning spatiotemporal features with 3Dconvolutional networks//Proceedings of the IEEE International Conference on Computer Vision.Santiago,Chile,2015:4489-4497
    [30]Watanabe T,Ito S,Yokoi K.Co-occurrence histograms of oriented gradients for human detection.IEEE Transactions on Computer Vision&Applications,2010,2:39-47
    [31]Su B,Lu S,Tian S,et al.Character recognition in natural scenes using convolutional co-occurrence HOG//Proceedings of the International Conference on Pattern Recognition.Stockholm,Sweden,2014:2926-2931
    [32]Ren Dong,Song Wei,Yu Jing,et al.A survey on special video content detection algorithms.Netinfo Security,2016,(9):184-191(in Chinese)(任栋,宋伟,于京等.特殊视频内容检测算法研究综述.信息网络安全,2016,(9):184-191)
    [33]Dalal N,Triggs B,Triggs B.Histograms of oriented gradients for human detection//Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition.San Diego,USA,2005:886-893
    [34]Klser A,Marszalek M,Schmid C.A spatio-temporal descriptor based on 3D-gradients//Proceedings of the British Machine Vision Conference.the University of Leeds,UK,2008:995-1004
    [35]Tian S,Bhattacharya U,Lu S,et al.Multilingual scene character recognition with co-occurrence of histogram of oriented gradients.Pattern Recognition,2015,51(C):125-134
    [36]Dietterich T G,Lathrop R H,Lozano-Perez T.Solving the multiple instance learning with axis-parallel rectangles.Artificial Intelligence,1997,89(96):31-71
    [37]Wang J,Zucker J D.Solving the multiple-instance problem:A lazy learning approach//Proceedings of the 17th International Conference on Machine Learning(ICML 2000).Stanford,USA,2000:1119-1126
    [38]Senst T,Eiselein V,Kuhn A,et al.Crowd violence detection using global motion-compensated Lagrangian features and scale-sensitive video-level representation.IEEE Transactions on Information Forensics&Security,2017,PP(99):1
    [39]Zhang T,Jia W,Gong C,et al.Semi-supervised dictionary learning via local sparse constraints for violence detection.Pattern Recognition Letters,2017,107:98-104
    [40]Deniz O,Serrano I,Bueno G,et al.Fast violence detection in video//Proceedings of the 2014International Conference on Computer Vision Theory and Applications.Lisbon,Portugal,2014:478-485
    [41]Mohammadi S,Perina A,Kiani H,et al.Angry crowds:Detecting violent events in videos//Proceedings of the European Conference on Computer Vision.Amsterdam,The Netherlands,2016:3-18
    [42]Bilinski P,Bremond F.Human violence recognition and detection in surveillance video//Proceedings of the IEEEInternational Conference on Advanced Video and Signal Based Surveillance.Colorado,USA,2016:30-36
    (1)第41次《中国互联网络发展状况统计报告》.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/P020180305409870339136.pdf 2018,03,05
    (2)Cisco Visual Networking Index:Global Mobile Data Traffic Forecast Update,2016—2021White Paper.http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html2017,2,7

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700