用户名: 密码: 验证码:
粗糙集与支持向量机结合的方法在连续属性离散化中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
粗糙集和支持向量机都是为了从数据中提取固定模式而提出的数据挖掘方法。粗糙集理论适用于海量数据,支持向量机是在统计学习理论基础上提出的分类方法,它的结构风险最小化准则和核函数理论,避免了“维数灾难”和“过学习”等传统方法的缺点。
     本文将粗糙集和支持向量机相结合,利用两种方法的优越性,提出了一种先用粗糙集进行预处理,再用支持向量机精确分类的方法。
     本文首先介绍了粗糙集和支持向量机的基本理论,对粗糙集的下近似、上近似、决策规则以及支持向量机的结构风险最小化原则、核函数等理论做了简要的回顾,分析了两种方法在数据挖掘领域的优势和局限性。
     然后,针对以往连续属性离散化方法分类规则复杂、会丢失大量信息的问题,提出了基于粗糙集下近似理论的离散化方法。这种方法可以对海量数据进行预处理,将根据粗糙集理论肯定属于某一类别的样本提出,并删除样本数据中可能的噪音数据,得到部分决策规则。这种方法不会破坏原数据集的不可分辨关系,而且得到的分类规则简洁。
     接着,利用支持向量机方法只与支持向量有关的特点和能够精确分类的优势,将经过粗糙集预处理的数据用支持向量机方法精确分类。
     最后,仿真实验表明,该方法在缩短训练时间的基础上,保留了支持向量机方法所需的分类信息,去除了样本数据中的噪音数据,提高了分类精度,克服了SVM算法的应用瓶颈。
Rough set and the support vector machine (SVM) are the data mining methods which are aimed to make the fixed model. Rough set theory is suitable for the magnanimous data, simple and easy to use. SVM is a classification method proposed in the statistical learning theory, its structural risk minimization and the kernel function theory have avoided the traditional method shortcoming of "dimension disaster" and "over-fitting" and so on.
     By unifying rough set and SVM, this article proposed a method: first, used the rough set to carry on the pre-treatment, and then, used the support vector machine to make the precise classification.
     This article first introduced rough set and support vector machine elementary theory, made a brief review at lower approximation, upper approximation, decision rule of rough set, as well as structural risk minimization, kernel function theory of support vector machine, and analyzed two methods advantage and limitation in the data mining domain.
     Then, in view of the previous discrete method probably lose the massive information and the classification rule it obtained is complex and not easy to be understood, chapter five proposed a method that based on the lower approximation of rough set. This method can make the pre-treatment to the mass data and get the classification that definitely belong to some category according to the lower approximation of rough set, and delete some possible noise data. This method will obtain some decision rule finally. This method will not destroy the indiscernibility relation of original data, moreover the classified rule is brief.
     After that, using the support vector machine can make the precisely classify only with the support vector related, presents a SVM classification method based on rough sets lower approximations theory and its application in continuous attribute.
     Finally, experiment results show that the method can preserve the necessary information needed by SVM, and can improve the prediction accuracy and reduce the training time of support vector machine.
引文
1. Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques[M], Second Edition, 北京:机械工业出版社,2006.2,1-14.
    2. 史开泉,刘保相.S-粗集与动态信息处理[M],北京:冶金工业出版社,2005.2.
    3. Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods[M].北京:电子工业出版社,2004.3.
    4. Pawlak Z, Rough Set:Theoretical aspects of reasoning about data[M], Boston, Kluwer Academic Publishers,1991.
    5. 胡寿松.粗糙决策理论与应用[M],北京:北京航空航天大学出版社,2006.4.
    6. 邓乃扬,田英杰.数据挖掘中的新方法——支持向量机[M],北京:科学出版社,2004,23-34.
    7. 曾黄麟.智能计算:关于粗集理论、模糊逻辑、神经网络的理论及其应用[M],重庆:重庆大学出版社,2004.6.
    8. 张雪峰.粗糙集数据分析系统MATLAB仿真工具箱实现[J],东北大学学报(自然科学版),2007,28(1):40-43.
    9. 祝峰,何华灿.粗集的公理化[J],计算机学报,2000,23(3):330-333.
    10. 陶鲜花,郝志峰.粗糙群的性质[J],计算机工程与应用,2002(10):221-225.
    11. 陈德刚,张文修.粗糙集和拓扑空间[J],西安交通大学学报,2001,35(12):1313-1315.
    12. 侯媛彬.基于RSNN的煤自燃预测方法[J],信息与控制,2004,33(1):93-96.
    13. W.C.Chena, Ni-Bin Changb, Jeng-Chung Chen. Rough set-based hybrid fuzzy -neural controller design for industrial wastewater treatment[J], Water Research,2003 (37): 95-107.
    14. R.Yasdi. Combining Rough Sets Learning-and Neural Learning-method to deal with uncertain and imprecise information[J], Neurocomputing,1995(7):61-84.
    15. 张文修,吴伟志等.粗糙集理论和方法[M],北京:科学出版社,2000.
    16. 刘清.Rough集及Rough推理[M],北京:科学出版社,2001.
    17. 秦中广,毛宗源,邓兆智.基于RoughSet的中医类风湿诊断知识抽取[J],华南理工大学学报(自然科学版),2000,28(4):30-34.
    18.王俊海,伊旭日.Rough集理论及其在汽车故障诊断中的应用[J],交通与计算机,2000,18(3):14-16.
    19.征峥,束金龙.基于粗糙集与层次分析法的组合预测方法[J],经济数学,2003,20(4):70-76.
    20.徐捷,徐从富,耿卫东,潘云鹤.基于粗糙集理论的动态目标识别及跟踪[J],电子学报,2002(4):605-607.
    21.刘清,黄兆华,刘少辉,姚力文.带Rough算子的决策规则及数据挖掘中的软计算[J],计算机研究与发展,1999,36(7):800-804.
    22.徐立中,王慧敏,刘美林,杨锦堂.粗糙集理论在图像增强中的应用[J],数据采集与处理,1999,14(3):307-310.
    23.王国胤.Rough集理论与知识获取[M],西安:西安交通大学出版社,2001.
    24.郝丽娜,徐心和.粗糙集神经网络系统在故障诊断中的应用[J],控制理论与应用,2001,18(5):1855-1858.
    25.陈德刚,张文修.粗糙集和拓扑空间[J],西安交通大学学报,2001,35(12):1313-1315.
    26. R.Yasdi. Combining Rough Sets Learning- and Neural Learning-method to deal with uncertain and imprecise information[J], Neurocomputing,1995(7):61-84.
    27.王珏,苗夺谦,关于Rough Set理论与应用的综述[J],模式识别与人工智能,1996,9(4):337-344.
    28.曾黄麟.粗糙集理论及其应用[M](修订版),重庆:重庆大学出版社,1998,5.
    29.常犁云,王国胤,吴渝.一种Rough Set理论的属性约简及规则提取方法[J],软件学报,1999,10(11):1206-1211.
    30.张琦,祯祥,文福拴.一种基于粗糙集理论的电力系统故障诊断和警报处理新方法[J],中国电力,1998(4):32-35,38.
    31.史忠植.智能科学[M],清华大学出版社,2005.
    32.安金龙,支持向量机若干问题的研究[D],天津大学博士学位论文,2004.
    33.李红莲,王春花.针对大规模的支持向量机的学习策略[J],计算机学报,2004.
    34. VapnikV N著,许建华译,统计学习理论[M],北京:电子工业出版社,2004,1-10.
    35. Vapnik V N. An Overview of Statistical Learning Theory[J], IEEE Trans, on Neural Networks,1999,10(5):988-999.
    36.张学工.关于统计学理论和支持向量机[J],自动化学报,2000,4(26):32-42.
    37.边肇祺,张学工.模式识别[M],北京:清华大学出版社,2000.
    38. Osuna E, Freund R. Training Support Vector Machines:an Application toFace Detection[A, Proc.of Computer Vision and Pattern Recognition[C], San Juan, Puerto Rico, IEEE Computer Soc,1997:130-136.
    39.忻栋,杨莹春,吴朝晖.基于SVM-HMM混合模型的说话人确认[J],计算机辅助设计与图形学学报,2002,14(11):1080-1082.
    40.柳回春,马树元,吴平东.UK心理测试自动分析系统的手写体数字识别[J],北京理工大学学报,2002,22(5):599-603.
    41.张磊,林福宗,张钹.基于支持向量机的相关反馈图像检索算法[J],清华大学学报(自然科学版),2002,42(1):80-83.
    42.孙宗海.支持向量机及其在控制中的应用研究[D],浙江大学博士学位论文,杭州:浙江大学控制系,2003:5-18.
    43.张文修,仇国芳.基于粗糙集的不确定性决策[M],北京:清华大学出版社,2005.7.
    44.常犁,王国胤,吴渝.一种基于Rough Set理论的属性约简及规则提取方法[J],软件学报,1999,10(11):1207-1209.
    45.刘清.Rough集及Rough推理[M],北京:科学出版社,2001.
    46.王文杰,叶世伟.人工智能原理与应用[M],北京:人民邮电出版社,2004.3.
    47. Vladimir N Vapnik著,张学工译.统计学习理论的本质[M],北京:清华大学出版社,2000.
    48. V. VaPnik. The Nature of Statistical Learning Theory[M], New York: Springer-Verlag, 1995.
    49.吉家锋.连续属性的离散化及知识获取的研究[D],西华大学硕士学位论文,2007.
    50. King R D, Statlog Databases. Department of Statistics and Modeling science [DB/OL], http://www.liacc.up.pt/ML/statlog/datasets.html,1992.
    51. Hettich S, Bay S D. The UCI KDD Archive [DB/OL], http://kdd.ics.uci.edu/,1999.
    52. J. C. Platt. Adult and Web Datasets, [Online] Available:www.research.microsoft.com.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700