用户名: 密码: 验证码:
语言隐写术的分析与设计研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
传统的密码术通过实施各种变换,把可理解的明文变成不可理解的密文,为隐秘信息的保护提供了有力的手段。然而,密码术有其固有的缺点,它在扰乱明文使其不可理解的同时,也暴露了信息的隐秘性、重要性、信息的发送方和接收方等重要信息,容易引起第三方的注意并受到攻击。另外,有些国家的政府已经对密码术的使用进行限制。信息隐藏则能很好地克服密码术的这些缺点,规避人为的使用限制。信息隐藏掩盖了隐秘信息的存在,大大地提高了信息传输、存储的安全性。
     文本信息隐藏是以文本为隐藏载体的信息隐藏,它通过利用文本在格式、编码、结构、语法和语义等方面的冗余,把隐秘信息隐藏到文本之中。在互联网中,文本信息起了非常重要的作用,其数据量也非常之大,加之文本处理的直观性更强的优点,使用文本进行信息隐藏是一个吸引人的想法。
     语言隐写术作为文本信息隐藏的一子类,使用文本中自然语言内容的冗余来隐藏信息,并以安全通信为主要目的。它相对于其它的文本信息隐藏方法,具有更高的安全性和更强的鲁棒性,因而更受研究者关注。语言隐写术是信息隐藏和自然语言处理有机结合的产物。
     语言隐写术分析是指对可疑文本进行分析,发现语言隐写术的使用并拦截、破坏和还原所隐写的隐秘信息。语言隐写术分析是语言隐写术的逆向研究,它可分为检测分析和还原分析两个阶段。语言隐写术分析的研究,对维护国家安全和社会稳定有着重要的意义。
     本文一方面深入调研各种常见的语言隐写术的隐写原理,研究其分析问题,提出其相应的分析方法,并进行相关的实验验证;另一方面则在对现有语言隐写术了解和分析的基础上,研究设计更安全的语言隐写术。
     有鉴于此,本文的研究内容包括:
     1)研究常见语言隐写术的检测分析:研究语言隐写术的盲检测分析思想,设计高效实用的检测分析算法,编写相应的检测分析工具,进行相关的实验验证。
     2)研究常见语言隐写术的还原分析:对语言隐写术还原分析问题进行合理地定义和建模,提出还原分析思想,设计相应的还原分析算法。
     3)设计一个文本信息隐藏分析系统,对系统用户来说,实现对包括各种语言隐写术在内的文本信息隐藏进行盲检测分析或者近似盲检测分析。
     4)在借鉴现有语言隐写术设计经验的基础上,提出更安全的语言隐写术,并对其安全性进行理论分析和实验验证。
     对以上的研究内容,本文以各种自然语言处理技术为基本手段,采用了概率统计分析的方法,并结合了机器学习和人工智能的思想进行研究。另外,本文还采用了理论分析和实验验证的方法对所提出的一些算法进行论证。在这些研究的基础上,本文取得了一定的创新成果,主要包括如下:
     1)针对不同大小的待测文本段,设计了基于语法的语言隐写术的三种盲检测分析算法,即基于词间关联统计特性、基于词位统计特性和基于检测熵的算法,并进行了相应的理论分析和实验验证,取得了良好的检测分析效果。对基于句子模板的语言隐写术的还原分析问题进行建模,并设计了一种可行的还原分析算法。
     2)构建了基于语义的语言隐写术的分析框架,设计了一种基于同义词替换的语言隐写术的检测分析算法,并进行了理论分析和实验验证,取得了良好的检测分析效果,设计了一种基于同义词替换的语言隐写术的还原分析算法,并进行了相关的理论分析。
     3)提出了通过构建一个合适的文本信息隐藏分析系统以达到盲检测分析或者近似盲检测分析目的的方法,给出了这种系统应该满足的特性,即可扩展性、自适应性、可反馈性和可学习性。基于这个设计思想,提出了一个具体的系统设计方案,并分析讨论了该设计方案是如何满足盲检测分析要求的。
     4)探讨了语言隐写术的安全性要求,根据此要求,提出了两种安全的语言隐写术,即基于双文本段的语言隐写术和基于上下文同义词替换的语言隐写术,并分别对它们进行了安全性的分析和评估。
     上述的成果1)、2)提高了对语言隐写术检测分析的正确率,丰富了语言隐写术的检测分析技术,并首次探讨了语言隐写术还原分析问题。成果3)提出了通过构建一个合适系统以实现盲检测分析或近似忙检测分析的思想,并提出了一个具体的系统设计方案。成果4)给出了两种新的更安全的语言隐写术。
Traditional cryptography turns a plain message into an incomprehensible ciphertext through the implementation of various transformations, providing a powerful means of the protection of secret information. However, cryptography has its inherent shortcomings. It exposes the important information such as the security and importance of the message itself, the sender and the receiver’s information while applying transformations to disturb the message, which may attract the attention of the third-party and be attacked. Moreover, some governments have limited the use of cryptography. Information hiding technique can overcome these shortcomings and avoid the restrictions from the governments desirably. Information hiding technique conceals the very existence of secret information, which greatly enhances the security of storation and transmission of information.
     Text information hiding is a kind of information hiding that uses text as its carrier. It hides secret information in the text carrier using the redundancy in format, coding, structure, syntax and semantics. In the Internet, text information has played a very important role. As the text data is very large and text processing is more intuitive, the use of text information hiding is an attractive idea.
     Linguistic steganography is a subclass of text information hiding. It uses the redundancy of the natural language content in text to hide secret information, for the main purpose of secure communications. Compaered to other kinds of text information hidding, it has higher security and better robustness, and thus attracts more attention of researchers. Linguistic steganography is an organic integration of information hiding and natural language processing.
     Linguistic steganalysis refers to analyzing suspicious text, finding the employment of linguistic steganography, blocking, destroying and recovering the hidden secret information. Linguistic steganalysis is a research direction opposite to linguistic steganography and it includes two stages which are detecting analysis and recovering analysis. The research on linguistic steganalysis has an important significance in national security and social stability.
     In this dissertation, on one hand, we fully investigate the principle of various common linguistic steganography methods, work over the linguistic steganalysis problems, propose the linguistic steganalysis methods and carry out relative experiments. On the other hand, we design more secure linguistic steganography methods based on the understanding and analyzing the existent linguistic steganography methods.
     As a result, this dissertation focuses on the following research tasks.
     1) Research on the detecting analysis of common linguistic steganography methods: studying the blind-detecting idea of linguistic steganography, designing effective practical detection methods, programming corresponding tools and carrying out related experiments.
     2) Research on the recovering analysis of common linguistic steganography methods: reasonably defining and modeling the recovering analysis problem of linguistic steganography, studying the ideas of recovering analysis, designing corresponding recovering methods.
     3) Designing a system for text information hiding analysis, implementing the blind or nearly blind detecting analysis, in the view of system users, of text information hiding methods including all kinds of linguistic steganography methods.
     4) Presenting more secure linguistic steganography methods on the basis of learning from the existent linguistic steganography methods, and then theoretically analyzing and experimentally validating their security.
     According to the research tasks described above, this dissertation makes a variety of natural language processing techniques as the basic means, uses probability and statistics analysis as its main method and combines the idea of machine learning and artificial intelligence to do the research. Additionally, certain necessary theoretical analysis and experimental validation are carried out. Based on the research work described, this dissertation makes some innovative achievements including the followings.
     1) Aiming at testing text segments of different sizes, three blind-detecting analisys methods for linguistic steganography based on syntax, that is blind-detecting analysis method based on statistical characteristics of correlations between words, based on word position distributions and based on detection entropy, are designed, corresponding theretically analysis and experimental validations are done and these detection analysyis methods turn out to be promising. In addition to this, the recovering analysis problem of linguistic steganography method based on sentence template is defined and a feasible recovering analysis method is designed.
     2) An steganalysis scheme for linguistic steganography based on semantics is presented. A detection analysis method for linguistic steganography based on synonym substitution is designed and the experiment shows that the detection method is promising. Furthermore, a recovering method for linguistic steganography based on synonym substitution is proposed and related theoretical analysis is made.
     3) A method is proposed that we can realize blind or nearly blind detecting analysis of text information hiding through constructing a proper system. The properties that the system should have are expandability, adaptability, capability of feedback and learning. Based on the conception, a detailed system design scheme is presented and discussions are made to make it clear that how this scheme satisfies the required blind detecting analysis well.
     4) The security requirements of linguistic steganography are discussed and according to these requirements, two secure linguistic steganography methods, which are linguistic steganography based on double text segments and based on context synonym substitution, are presented. Finally, analysis and evaluation of security of these methods are made.
     In the achievements described above, points 1) and 2) improve the accuracy of linguistic steganography detection, enrich the detection technique of linguistic steganography and firstly discuss the problem of linguistic steganography recovering analysis. Point 3) proposes the method that realizes blind or nearly blind detection of text information hiding through constructing a proper system and a concrete system design scheme is presented. Point 4) brings forward two new and more secure linguistic steganography methods.
引文
白剑,徐迎晖,杨榆. 2004.利用文本载体的信息隐藏算法研究[J].计算机应用研究, 21(12): 147-148.
    曹卫兵,戴冠中,夏煜等. 2003.基于文本的信息隐藏技术[J].计算机应用研究, 20(10): 39-41.
    陈希孺. 1996.概率论与数理统计[M].合肥:中国科学技术大学出版社,1-44.
    甘灿,孙星明,刘玉玲等. 2007.一种改进的基于同义词替换的中文文本信息隐藏方法[J]. 东南大学学报(自然科学版), 37(1S): 137-140.
    罗纲,孙星明,向凌云等. 2008.针对同义词替换信息隐藏的检测方法研究[J]。计算机研究与发展, 45(10):1696-170.
    闵华玲. 1987.随机过程[M].上海:同济大学出版社, 87-143.
    钮心忻. 2005.信息隐藏与数字水印[M].北京:北京邮电大学出版社,52-66.
    眭新光,沈蕾,燕继坤等. 2007.基于AdaBoost的文本隐写分析[J].通信学报, 28(12): 136-146.
    眭新光,朱中梁. 2008.字典隐藏法的脆弱性分析与改进.计算机工程, 34(8): 144-149.
    孙星明,黄华军,王保卫等. 2007.一种基于等价标记的网页信息隐藏算法[J].计算机研究与发展, 44(5): 756-760.
    王丽娜,张焕国. 2003.信息隐藏技术与应用[M].武汉:武汉大学出版社,1-13.
    王小捷. 2002.自然语言处理技术基础[M].北京:北京邮电大学出版社.
    吴秋新,钮心忻,杨义先等译. 2001.信息隐藏技术——隐写术与数字水印[M].北京:人民邮电出版社.
    吴树峰. 2003.信息隐藏技术研究[D]: [硕士].合肥:中国科学技术大学.
    佚名.英文世界名著1000部.复旦大学出版社.
    苑春法,李庆中,王昀等译. 2005.统计自然语言处理基础[M].北京:电子工业出版社.
    中科院计算所. ICTCLAS: http://www.ictclas.org/.
    周继军,杨著,钮心忻等. 2004.文本信息隐藏检测算法研究[J].通信学报, 25(12): 97-101.
    Atallah M J, McDonough C J, Raskin V et a1. 2000. Natural language Processing for information assurance and security:An overview and implementations [C] //Proceedings of the 9th ACM/SIGSAC New Security Paradigms Workshop. New York: ACM, 2000: 51-65.
    Bender W, Gruhl D, Morlmoto N et al. 1996. Techniques for data hiding [J]. IBM System Journal, 35(3&4): 313-336.
    Bennett K. 2004. Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text [R]. Purdue University, CERIAS Tech. Report.
    Bergmair R. 2004. Towards linguistic steganography: A systematic investigation of approaches,systems, and issues, A-4061[R]. Vienna, Austria: University of Derby.
    Bolshakov I A, Gelbukh A. 2004. Synonymous paraphrasing using WordNet and Internet Natural Language Processing and Information Systems[C] //Proceedings of the 9th Int Conf on Applications of Natural Language to Information Systems. LNCS 3136, Berlin: Springer, 2004: 312-323.
    Boser B, Guyon I, Vapnik V. 1992. A training algorithm for optimal margin classiers [C]. // Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144-152.
    Chang Chih-Chung and Lin Chih-Jen. 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
    Chen Chao, Wang Shuozhong, and Zhang Xinpeng. 2006. Information Hiding in Text Using Typesetting Tools with Stego-Encoding [C]. //Proceedings of the 1st International Conference on Innovative Computing, Information and Control.
    Cheng J, Greiner R. 1999. Comparing Bayesian Network Classifiers [C]. //Proceedings of UAI 1999, 101-108.
    Compris Intelligence GmbH. Texthide. Website. http://www.texthide.com/
    Cortes C and Vapnik V. 1995. Support-vector network [J]. Machine Learning 20, 273-297.
    Cover T, Thomas J, Wiley J. 1991. Elements of information theory [M]. New York: John Wiley & Sons.
    Duda R and Hart P. 1973. Pattern classification and scene analysis [J]. John Wiley & Sons.
    Fellbaum C. 1998. WordNet: An electronic lexical database. USA: The MIT Press.
    Francis WN and Ku?era H. 1964. Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Providence, RI: Dept of Linguistics, Brown University.
    Francis WN and Ku?era H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar [M]. Boston, MA: Houghton Mifflin.
    Friedman N, Geiger D and Goldszmidt M. 1997. Bayesian Network Classifiers [J]. Machine Learning, 29: 131-161.
    Garside R and Leech F. 1987. The UCREL probabilistic parsing system. //The Computational Analysis of English: A Corpus-Based Approach, 66–81. London: Longman.
    Gold E. 1967. Language identification in the limit [J]. Information and Control, 10: 447-474.
    Grothoff C, Grothoff K, Alkhutova L et al.2005. Translation-based steganography [R]. Technical Report TR 2005-39, Purdue CERIAS.
    Grothoff C, Grothoff K, Alkhutova L et al. 2005. Translation-based steganography [C]. //Proceedings of Information Hiding Workshop. LNCS 3727: 213-233.
    Herodotus A. 1992. The histories [M]. London, England: J.M. Dent & Sons, Ltd.
    Herwijnen EV. 1994. Pratical SGML, 2nd edition [M]. Dordrecht: Kluwer Academic.
    Hickey R. 1993. Lexa: Corpus processing software [R]. Technical report, The Norwegian Computing Centre for Humanities, Bergen.
    HIT IR Lab: http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm.
    Horning J. 1969. A study of grammatical inference[D]: [PhD thesis]. USA: Standford University.
    Hugg S. 1999. Stegparty. http://www.fasterlight.com/hugg/projects/stegparty.html.
    Ide N and Véronis J. 1995. The Text Encoding Initiative: Background and Context. Dordrecht: Kluwer Academic [J]. Computers and the Humanities 29(1–3).
    Johansson S, Leech GN and Goodluck H. 1978. Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Oslo: Dept of English, University of Oslo.
    Katzenbeisser S, Petitcolas F. 2000. Information hiding techniques for steganography and digital watermarking. Artech House, Inc.
    Keerthi S and Lin C. 2003. Asymptotic behaviors of support vector machines with Gaussian kernel [J]. Neural Computation 15 (7): 1667–1689.
    Kohavi R, John G. 1997. Wrappers for Feature Subset Selection [J]. In Artificial Intelligence journal, special issue on relevance, 97(1-2): 273-324.
    Kononenko I. 1991. Semi-na?ve Bayesian classifier [C]. // Proceedings of sixth European working session on learning, 206-219.
    Ku?era H, and Francis WN. 1967. Computational Analysis of Present-Day American English [M]. Providence, RI: Brown University Press.
    Langley P, Iba W, Thompson K. 1992. An analysis of Bayesian classifiers [C]. //Proceedings of AAAI-92, 223-228.
    Langley P and Sage S. 1994. Induction of Selective Bayesian Classifiers [C]. //In Proceedings of UAI-94.
    Lari K and Young S. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm[J]. Computer Speech and Language, 4: 35-36.
    Lin H and Lin C. 2003. A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods [R]. Technical report, Department of Computer Science, National Taiwan University.
    Liu Y, Sun X, Gan C, et al. 2007. An efficient linguistic steganography for chinese text [C]. //Proceedings. of IEEE Int. Conf. on Multimedia & Expro (ICME), China, July 2007, 2094-2097.
    Magerman D and Marcus Mitchell. 1991. Pearl: A probabilistic chart parser. EACL 4.
    Magerman D and Weir C. 1992. Efficiency, robustness and accuracy in Picky chart parsing [J]. ACL 30: 40-47.
    Maher K. 1981. TEXTO. URL:ftp://ftp.funet.fi/pub/crypt/steganography/texto.tar.gz.
    Manning C, Schutze H. 1999. Foundations of statistical natural language Processing [M]. MIT Press.
    Marcus MP, Santorini B and Marcinkiewicz MA. 1993. Building a large annotated corpus of English: The Penn Treebank [J]. Computationa Linguistics 19:313–330.
    Mark C. 1997. Hiding the hidden: a software system for concealing ciphertext as innocuous text [D]: [Degree of master]. Wisconsin USA: University of Wisconsin-Milwaukee.
    Mark C. and Davida G. 1997. Hiding the hidden: A software system for concealing ciphertext as innocuous text [C]. //Proceedings of International Conference on Information and Communications Security. LNCS, 1334: 335-345.
    Mark C, Davida G and Rennhard M. 2001. A practical and effective approach to large-scale automated linguistic steganography [C]. //Proceedings. of Int. Conf. on Information and Communication Security. LNCS, 2200: 156-167.
    Mark C and Davida G. 2002. Plausible deniability using automated linguistic steganography [C]. // Proceedings of International Conference on Infrastructure Security. LNCS 2437:276-287.
    McGrath S. 1997. PARSEME.1ST: SGML for Software Developers [M]. Upper Saddle River, NJ: Prentice Hall PTR.
    McQueen CM and Burnard L. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3) [M]. Chicago, IL: ACH/ACL/ALLC.
    Miller G . 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11): 39-41.
    Miller G, Fellbaum C, Tengi R et al. WordNet: A Lexical Database for the English Language. http://wordnet.princeton.edu/. Princeton University.
    Murphy B. 2001. Syntactic information hiding in plain text [D]: [Master]. CLCS, Trinity College Dublin. https://www.cs.tcd.ie/Brian.Murphy/publications/murphy01hidingMasters.
    Pazzani MJ. 1995. Searching for dependencies in Bayesian classifiers [C]. //Proceedings of AI & STAT’95.
    POR L, Delina B. 2008. Information Hiding: A New Approach in Text Steganography [C]. // Proceedings of 7th WSEAS Int. Conf. on Applied Computer & Applied Computational Science.
    Singh M and Provan GM. 1996. Efficient Learning of Selective Bayesian Network Classifiers [C]. //Proceedings of the ICML-96.Sogou Labs: http://www.sogou.com/labs/dl/r.html.
    Stutsman R, Atallah M, Grothoff C et al. 2006. Lost in just the translation [C]. //Proceedings of the 21st Annual ACM Symposium on Applied Computing (SAC 2006).
    St. Laurent, Simon. 1998. XML: A Primer [M]. Foster City, CA: MIS Press/IDG Books.
    Sui X, Luo H, Zhu Z. 2006. A Steganalysis Method Based on the Distribution of Characters. //Proceedings of ICSP 2006.
    Taskiran CM, Topkara U, Topkara M et al. 2006. Attacks on lexical natural language steganography systems [C]. //Proceedings of SPIE 2006.
    Topkara U, Topkara M, Atallah M J. 2006. The hiding virtues of ambiguity: Quantifiably resilient watermarking of natural language text through synonym substitutions[C]. //Proceedings of the 8th ACM Multimedia and Security Workshop. New York: ACM, 2006: 164-174
    Vapnik V. 1995. The Nature of Statistical Learning Theory [M]. New York, NY: Springer-Verlag.
    Villan R, Voloshynovskiy S, Koval O et al. 2006. Text data-hiding for digital and printed documents: theoretical and practical considerations [C]. //Proceedings of SPIE 2006.
    Winstein K. 1999. Lexical steganography through adaptive modulation of the word choice hash [EB/OL]. http://alumni.imsa.edu/~keithw/tlex.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700