基于博容舆情分析的Python爬虫系统的设计与应用毕业论文
2021-03-25 22:08:09
摘 要
在过去的几十年内,随着web的迅速发展,大量的数据在web发布使得互联网成为一个非常庞大的非结构化的数据库,互联网成为海量信息的载体。如何将数据有效地检索并组织呈现出来有着巨大的应用前景。搜索引擎虽然有着为人们检索信息的功能,然而,这些通用性搜索引擎[2-3]搜索到的结果,往往夹杂着大量用户根本就不需要的信息。因为那些不同的背景、不同领域的用户,这些明显存在着局限的搜索引擎根本无法满足他们。
为了快速获取关于某件网络公共事件的信息,通过收集该事件相关的博文及其评论,以达到了解网络上对于公共事件的舆论情况。以便达到了解网络舆论,维护网络环境秩序的目的。因此,根据博客网页的特点,设计一款基于Python爬虫的程序。通过模拟登陆博客网站,实时抓取博客中的正文以及评论等内容。针对网站部分更新这种情况,需要实现记性增量式爬取;对海量数据实现分布式爬取等抓取相关内容。然后将爬取到的内容存到本地数据库中。
首先,本文给出了网络爬虫的一般工作流程,着重介绍了网络爬虫的一些关键性技术,并且采用了目前比较广泛运用的Scrapy来作为爬虫框架。
其次,本系统使用了基于Python语言的Scrapy框架,对于搜索引擎的爬虫模块则是使用BeautifulSoup网页解析库。同时也指出当采用Scrapy框架中原有的URL去重方法,会在大规模爬取网站时产生内存消耗过大的问题,因此提出了采用RFPDupeFilter方法来解决相应问题。并且根据实际的经验,提出了有效可行的方法来解决爬虫容易被ban的问题。
再次,利用基于Python语言的的Scrapy框架,作为本系统的网页抓取模块。该框架利用twisted这个异步网络库处理网络通讯,架构清晰,其中还包括各种各样的中间件接口,因此能够胜任本系统的要求。分析出需要的内容之后,接下来就是存储了。先在自己的数据库中建立一个空表,再导入相应的模块与相应的数据库连接,最后将需要的数据分类存入MySQL数据库中。
最后,本系统对于需求的针对性极强,能快速从网页上爬取需要的资源。并且还有代码简单明了,易于嵌入开发等优点。就算是不熟悉编程的人,依然可以使用本系统快速、准确地获取网络上博客的内容,以便分析出博客及评论的相关信息及规律。
关键词:网页抓取;博客;Python;URL
Abstract
In the past few decades, with the rapid development of the web, a lot of data in the web publishing makes the Internet into a very large unstructured database, the Internet has become a carrier of massive information. How to effectively retrieve the data and organize the presentation has a huge application prospects. Although the search engine has the ability to retrieve information for people, however, these universal search engine search results, often mixed with a large number of users simply do not need information. Because of those different backgrounds, different areas of the user, these obviously there are limitations of the search engine simply can not meet them.
In order to quickly obtain information about a network of public events, through the collection of events related to the blog and its comments, in order to understand the network for public events on the public opinion. In order to achieve understanding of the network of public opinion, the purpose of maintaining the order of the network environment. Therefore, according to the characteristics of the blog page, based on a Python crawler program. Through the simulation landing blog site, real-time crawling the blog in the text and comments and so on. For the site part of the update of this situation, the need to achieve the incremental incremental crawl; massive data to achieve a crawling and other crawling related content. And then the contents of the crawl to the local database.
First of all, this paper gives the general work of web crawler, focusing on some of the key technologies of network reptiles, and the use of the more widely used Scrapy as a crawler framework.
Second, the system uses a Python-based Scrapy framework, and the search engine crawler module uses the BeautifulSoup web analytics library. It is also pointed out that when using the original URL de-emphasis method in the Scrapy framework, it will cause the memory consumption to be too large when crawling the website on a large scale. Therefore, the RFPDupeFilter method is used to solve the corresponding problem. And according to the actual experience, put forward an effective and feasible way to solve the problem of crawling insects easily ban.
Again, using the Python-based Scrapy framework, as the system's web crawler module. The framework of the use of twisted this asynchronous network library to deal with network communications, clear architecture, which also includes a variety of middleware interface, it can be qualified for the system requirements. After analyzing the contents of the need, the next is stored. First in their own database to create an empty table, and then import the corresponding module and the corresponding database connection, and finally the required data classification into the MySQL database.
Finally, the system for the demand for highly targeted, can quickly crawl from the web needs of the resources. And there are simple and clear code, easy to embed and other advantages of development. Even if you are not familiar with the programming, you can still use the system quickly and accurately access the contents of the blog on the network in order to analyze the blog and comments related information and laws.
Key Words:Web crawler;Blog; Python;URL
目录
摘要 I
Abstract II
第1章 绪论 1
1.1 课题背景 1
1.2 研究意义 1
1.3 Python语言的发展现状 2
1.4 本文主要工作 2
1.5 论文结构 2
第2章 网络爬虫的工作原理以及相关技术介绍 4
2.1 网页爬取流程 4
2.2 网络爬虫 4
2.2.1 深度优先搜索 5
2.2.2 广度优先搜索 5
2.2.3非完全PageRank策略 6
2.2.4网页链接去重 7
2.3 本章小结 7
第3章 Srapy框架在网络爬虫中的应用 8
3.1 Srapy框架综述 8
3.1.1 Srapy安装 8
3.1.2 Srapy框架 9
3.1.3 Srapy工作流程 11
3.2 Srapy的使用 11
3.3使用 Srapy常见的问题 13
3.4本章小结 16
第4章 网络爬虫的实现 17
4.1系统总体框架设计 17
4.2网页抓取模块 17
4.3 数据库设计模块 19
4.4 本章小结 20
结论 21
参考文献 22
致谢 23
第1章 绪论
1.1 课题背景
在过去的几十年内,随着web的迅速发展,大量的数据在web发布使得互联网成为一个非常庞大的非结构化的数据库,互联网成为海量信息的载体。如何将数据有效地检索并组织呈现出来有着巨大的应用前景。搜索引擎[1]虽然有着为人们检索信息的功能,然而,这些通用性搜索引擎搜索到的结果,往往夹杂着大量用户根本就不需要的信息。因为那些不同的背景、不同领域的用户,这些明显存在着局限的搜索引擎根本无法满足他们。
加入有一种能够实现定向抓取网络资源的工具,就能很好地解决上述问题了。因此,能够实现这个功能的新技术--聚焦爬虫就因此渐渐受到了关注。聚焦爬虫是指可以根据既定的目标,有选择地访问那些在万维网上的网页与相关的链接后,能够自动下载那些网页的一个程序。通过这种途径获取所需要的信息。聚焦爬虫往往不会去不追求大的覆盖,这与通用爬虫(general purpose web crawler)不同。聚焦爬虫仅仅把抓取与某个特定的主题内容相关的网页为目标,以此为那些用户提供服务。为此,本文提出了一款基于Python的网络爬虫程序,使得能够抓取某个网络公共事件相关的博文以及评论。