智能算法在新闻门户应用系统中的应用毕业论文

2020-02-16 22:21:18

摘要

随着互联网技术的发展，网络已经越来越深入到日常生活之中，而网络平台的即时性与普及性，使得新闻能够在上面更有效的传播，同时也为人们随时随地阅读带来了极大的便利，新闻种类繁多，如何挑选用户感兴趣的新闻，成为了国内外各大新闻门户网站和学术界研究的重点。

目前推荐技术比较常见的有基于内容的推荐，基于协同过滤的推荐和混合推荐，这些技术虽然能达到推荐的效果，但是它们各自都因自身算法而拥有局限性。基于内容推荐本身是针对文本提取的，所以无法挖掘用户潜在兴趣，而协同过滤虽然是扩展了用户的潜在兴趣，但是在用户量较少或几乎没有的初始阶段，推荐效果就会非常差。因此，本文针对基于内容的推荐，结合协同过滤算法的优势，进行了研究，主要包含以下四个方面：

(1)数据预处理。利用网络爬虫，爬取新闻网站的数据，并基于真实数据集，针对自然语言特征处理，进行了模型训练的研究。本文首先是对数据集进行了文本特征处理，将句子转化为词，然后基于word2vec的模型训练方法，无监督训练生成词向量的模型，解决了自然语言转化为词向量的问题。

(2)信息提取。本文针对数据集，进行文本特征处理工作，将句子转化为词，利用TF-IDF（term frequency-inverse document frequency）算法提取关键词，并将关键词作为代表该文本的核心内容。同时，本文尝试对关键词个数选定进行了测试，并对实验结果进行了分析。

(3)新闻优先级及兴趣衰减度。本文针对新闻的特性，挑选了最重要的时效性作为研究，以时间远近为单位，根据时间对新闻优先级进行设定，考虑到用户有兴趣衰减的问题，本文对其进行了研究讨论，尝试根据时间梯度，模拟指数函数的衰减趋势，以达到模拟兴趣衰减的目的，本文对时间范围进行了测试，并对实验结果进行了分析。

(4)基于词表的推荐。本文通过对文本特征以及用户兴趣研究，对用户浏览的内容进行信息提取，用来代表用户感兴趣的词语，然后，再通过机器学习的方法，利用已训练好的模型，以用户浏览内容的兴趣词为基础，利用word2vec工具，通过相似度计算，无监督学习来扩充兴趣词。此方法改善了基于内容推荐，弥补无法挖掘用户潜在兴趣的不足。本文尝试融合了内容推荐和协同过滤中的兴趣机制，并对新闻推荐结果进行了分析。

关键词：新闻内容，词表，提取，内容，推荐

Abstract

With the development of Internet technology, the network has become more and more in daily life, and the immediacy and popularity of the network platform enable news to spread more effectively on it, and it also brings people to the extremes of reading anytime, anywhere. Great convenience, a wide variety of news, how to select news of interest to users, has become the focus of research at major domestic and international news portals and academic circles.

Currently recommended techniques are more common based on content-based recommendations, based on collaborative filtering recommendations and hybrid recommendations, although these techniques can achieve the recommended results, but each has its own algorithm because of its limitations. Content-based recommendation itself is for text extraction, so it is impossible to mine the potential interest of users. Collaborative filtering, while expanding the potential interest of users, is very poor in the initial stage of the user with little or no user. Therefore, this paper conducts research on content-based recommendation and the advantages of collaborative filtering algorithm, including the following four aspects:

(1) Data preprocessing. Using web crawlers, crawling the data of news websites, and based on real data sets, conducted research on model training for natural language feature processing. In this paper, the text feature processing is carried out on the data set, the sentence is transformed into words, and then based on the word2vec model training method, the unsupervised training generates the word vector model, which solves the problem of natural language conversion into word vector.

(2) Information extraction. In this paper, the text feature processing is carried out for the data set, the sentence is converted into words, and the keyword is extracted by TF-IDF (term frequency-inverse document frequency) algorithm, and the keyword is used as the core content of the text. At the same time, this paper attempts to test the number of keywords and analyze the experimental results.

(3) News priority and interest attenuation. In view of the characteristics of news, this paper selects the most important timeliness as the research, sets the priority of news according to time, and considers the problem that users are interested in attenuation. This paper studies and discusses it. According to the time gradient, the attenuation trend of the exponential function is simulated to achieve the purpose of simulating interest attenuation. The time range is tested and the experimental results are analyzed.

(4) Recommendation based on vocabulary. Through textual feature and user interest research, this paper extracts the content of the user's browsing, and uses it to represent the words of interest to the user. Then, through the machine learning method, the trained model is used to browse the content interest of the user. Word-based, using the word2vec tool to expand interest words through similarity calculations and unsupervised learning. This approach improves the lack of content-based recommendations to compensate for potential user interest. This paper attempts to integrate the interest mechanism in content recommendation and collaborative filtering, and analyzes the news recommendation results.

Keywords: news content, vocabulary, extraction, content, recommendation

摘要 I

Abstract II

第1章绪论 1

1.1.国内外现状 1

1.2 研究目的 2

1.3.研究意义 2

1.4 主要工作和技术路线 3

第2章数据预处理 5

2.1数据构建 5

2.2.1新闻语料库 5

2.1.2 模型训练 6

第3章基于新闻内容的推荐 9

3.1基于新闻内容的信息提取 9

3.2基于词表的新闻推荐 10

3.2.1基于时间原则的优先级排序 10

3.2.2 用户兴趣词表构建 12

3.2.3 基于词表的推荐 15

第4章实验与分析 17

4.1数据环境 17

4.2评价指标 17

4.3实验结果与分析 17

4.3.1新闻语料库实验结果与分析 17

第5章结束语 19

5.1 论文总结 19

5.2 未来展望 19

致谢 20

参考文献 21

第1章绪论

1.1.国内外现状

新闻目前在国内发布的平台主要集中在微信、微博和新闻网站，微信是由腾讯推出的一款为手机提供的免费APP，用于社交工作，因此在上面发布新闻会有大量的用户基础。微博则是分享广播式即时消息的社交平台，同样也是拥有大量用户基础，适合新闻的传播。最后便是新闻网站，一般是根据新闻类型会分成不同类型的网站，当然，也会有综合性的门户网站，它涵盖了几乎所有的新闻类型，并对其进行分类，推荐，如：新浪，网易，搜狐，腾讯等。国外比较流行的新闻网站则有：雅虎，谷歌，纽约时报，赫芬顿邮报等。本文主要是对新闻网站的新闻进行研究，新闻发布离不开推荐技术。

什么是推荐技术呢？举个简单的例子：用户想要购买关于推荐技术的书，可以在搜索引擎上输入关于推荐技术的书作为关键词，然后搜索引擎会将相关的书推荐给用户。但是，当用户没有所谓想要什么的目的，用户只能通过某些标签或者兴趣去寻找自己感兴趣的内容，但面对非常多的数据，用户这时候便很难抉择具体选择哪一个，这时候，就需要制作一个自动推荐的工具，用来收集用户之前的信息，得到用户兴趣，从而达到推荐的目的，这就是个性化推荐的工作。

而最早的个性化推荐技术是自动化协同过滤技术，可以追溯到1994年，可以说是互联网刚刚兴起与发展的时候，发展到后来，可以归为三种技术：基于内容的推荐，基于协同过滤的推荐和混合推荐。^[^[1]]

(一)基于内容的推荐

对于新闻而言，新闻内容的本身价值，是对新闻推荐起到决定性的作用，因此，基于内容的推荐，是最适合新闻的个性化推荐的技术。通过对用户浏览新闻的习惯，浏览新闻的内容，进行存储，并构建用户的兴趣模型，以此向用户推荐新闻。^[^[2]]

存在的问题：它有自己的局限性，个性化推荐通常会带有冷启动问题，冷启动问题是指面对大量用户数据空白的新用户，无法获得用户的兴趣与信息时，也就无法启动相应的系统进行推荐。这时候就需要别的技术如基于热点推荐或者专家系统来解决，同时由于是根据内容自身的变化，缺乏标签的多样化，所以它的多样性将会非常局限。用户很可能只能在一部分内容里获取到推荐，但一些用户感兴趣的潜在新闻，就不能通过该技术很好的表现出来。^[^[3]]

（二）基于协同过滤的推荐

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码