基于Spark的协同过滤算法并行化研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于Spark的协同过滤算法并行化研究

详细信息查看全文 | 推荐本文 |

英文篇名：Research on Parallelization of Collaborative Filtering Algorithm Based on Spark
作者：陆俊尧 ; 李玲娟
英文作者：LU Jun-yao;LI Ling-juan;School of Computer Science,Nanjing University of Posts and Telecommunications;
关键词：协同过滤 ; Spark平台 ; 并行化 ; 基于项目
英文关键词：collaborative filtering;;Spark platform;;parallelization;;item-based
中文刊名：WJFZ
英文刊名：Computer Technology and Development
机构：南京邮电大学计算机学院;
出版日期：2018-09-21 10:29
出版单位：计算机技术与发展
年：2019
期：v.29;No.261
基金：国家自然科学基金(61302158,61571238)
语种：中文;
页：WJFZ201901018
页数：5
CN：01
ISSN：61-1450/TP
分类号：91-95

摘要

协同过滤算法在推荐系统中应用广泛。但是随着数据量的爆炸式增长,协同过滤算法所需的计算量也随之增长。针对传统的单机集中式计算已无法满足推荐系统的实时性和扩展性要求的问题,基于主流的大数据平台Spark在迭代计算以及内存计算方面的优势,设计了一种基于项目的协同过滤算法在Spark上的并行化方案。该方案利用RDD并行化计算的特点,通过合理设计RDD算子来实现对物品间相似度计算过程和评分计算过程的并行化,同时采用了RDD的缓存机制以及Spark中的广播变量来对一些重要的计算资源进行缓存与分发,从而提高计算速度。用MovieLens公开数据集对基于Spark平台的并行化Item-Based协同过滤算法的性能进行测试,结果表明该并行化协同过滤算法在准确性以及时效性方面均有较好的表现。
Collaborative filtering algorithm is a widely used in the recommendation system. However,with the explosive growth of the amount of data,the amount of computing required by the collaborative filtering algorithm also increases. The traditional centralized computing of single machine has not been able to meet the requirements of the real-time and expansibility of the recommended system.Based on the advantages of the mainstream big data platform Spark in iterative computing and memory computing,we design a parallelization scheme of item-based collaborative filtering algorithm based on Spark. Based on the parallelization characteristics of RDD,it realizes the parallelization of items' similarity calculation and score calculation by reasonably designing RDD operator,at the same time using the cache mechanism of RDD and broadcast variables of Spark to cache and distribute some important computing resources,so as to improve the calculation speed. The performance of parallel item-based collaborative filtering algorithm based on Spark platform is tested by MovieLens dataset. The results show that this parallel collaborative filtering algorithm performs well in accuracy and timeliness.

引文

[1] GOLDBERG D,NICHOLS D,OKI B M,et al. Using collaborative filtering to weave an information tapestry[J]. Communications of the ACM,1992,35(12):61-70.
    [2]冷亚军,陆青,梁昌勇.协同过滤推荐技术综述[J].模式识别与人工智能,2014,27(8):720-734.
    [3] BELLOGN A,CASTELLS P,CANTADOR I. Neighbor selection and weighting in user-based collaborative filtering[J].ACM Transactions on the Web,2014,8(2):1-30.
    [4]李涛,王建东,叶飞跃,等.一种基于用户聚类的协同过滤推荐算法[J].系统工程与电子技术,2007,29(7):1178-1182.
    [5]何哲.基于用户聚类的推荐算法研究[J].科技创业月刊,2017,30(10):135-136.
    [6]邓爱林,左子叶,朱扬勇.基于项目聚类的协同过滤推荐算法[J].小型微型计算机系统,2004,25(9):1665-1670.
    [7] KIM B M,LI Q,PARK C S,et al. A new approach for combining content-based and collaborative filters[J]. Journal of Intelligent Information System,2006,27:79-91.
    [8]邓爱林,朱扬勇,施伯乐.基于项目评分预测的协同过滤推荐算法[J].软件学报,2003,14(9):1621-1628.
    [9] TIAN X H. PCIB:a new algorithm for item-based collaborative filtering recommendations[C]//Proceedings of 2014 international conference on artificial intelligence and industrial application.[s. l.]:Advanced Science and Industry Research Center,2014:9.
    [10]杨志伟.基于Spark平台推荐系统研究[D].合肥:中国科学技术大学,2015.
    [11]赵娟,程国钟.基于Hadoop、Storm、Samza、Spark及Flink大数据处理框架的比较研究[J].信息系统工程,2017(6):117.
    [12]唐振坤.基于Spark的机器学习平台设计与实现[D].厦门:厦门大学,2014.
    [13]徐新瑞,孟彩霞,周雯,等.一种基于Spark时效化协同过滤推荐算法[J].计算机技术与发展,2015,25(6):48-55.
    [14]李成,冯青青.推荐系统准确度衡量方案—引入权重概念[C]//工业设计研究.出版地不详:出版者不详,2017:269-275.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700