基于python的数据爬取和分析程序设计与实现毕业论文

2021-11-10 23:41:37

论文总字数：24201字

摘要

作为搜索引擎的核心模块，爬虫技术在信息技术飞速发展的今天备受关注。网络爬虫具有爬取规则自定义性高、爬取结果精确等优点，被广泛应用于各领域。为实现开发一个招聘信息爬取系统，本文以“51job网站Python岗位招聘信息”为爬取对象，通过对招聘网站网页结构进行分析，结合对系统的功能模块展开需求分析，设计了基于Python的招聘信息的爬取和分析系统。

首先对系统进行了需求分析，采用结构化方式设计出各个功能模块。该系统利用Python的图形化设计框架tkinter来设计交互界面；使用Python的爬虫库requests库和BeautifulSoup库来实现数据提取；采用jieba（结巴分词）工具库完成对文本信息进行分词；利用MySQL数据库进行数据存储，并通过PyMySQL进行数据库连接。采用wordcloud库和Matplotlib进行数据可视化展示。最后，通过对系统的功能进行测试，根据测试结果，分析存在的问题并给出可行的解决方案。

测试结果表明系统能够对招聘信息进行有效的提取，并实现了数据分析和分词处理。爬虫系统能够使得招聘信息的获取更加准确高效，分析结果的可视化也能在一定程度上节省用户的时间和精力。

关键词：网络爬虫；jieba分词；MySQL数据库；词云图；Matplotlib

Abstract

As the core module of search engine, crawler technology has attracted much attention in the rapid development of information technology today. Web crawlers have the advantages of high customization of crawling rules and precise crawling results, and are widely used in various fields. In order to develop a recruitment information crawling system, this paper takes “Python job recruitment information on 51job website” as the crawling object, analyzes the webpages structure of the recruitment website, and carries out demand analysis on the functional modules of the system, so as to design a Python based recruitment information crawling and analyzing system.

To start with, the structural way is used to design each functional module according to the demand analysis. This system uses Python’s graphic design framework tkinter to design the interactive interface, uses Python’s crawler libraries: requests library and BeautifulSoup library to realize data extraction, uses jieba (steller segmentation) tool library to complete word segmentation of text information, uses MySQL database for data storage, and uses PyMySQL technology for database connection. Wordcloud library and Matplotlib are used for data visualization. Finally, by testing the function of the system, the existing problems are analyzed and feasible solutions are given according to the test results.

The tests outcomes show that the system can extract the recruitment information effectively and realize data analysis and word segmentation. The crawler system can make the extraction of recruitment information more precise and efficient, and the visualization of analysis results can also save users’ time and energy to some extent.

Keywords: Web Crawler; jieba segmentation; MySQL database; Wordcloud; Matplotlib

摘要 I

Abstract II

第1章绪论 1

1.1 研究背景 1

1.2 国内外研究现状 1

1.2.1 爬虫国内外研究现状 1

1.2.2 中文分词国内外研究现状 2

1.3 研究目的与意义 3

第2章相关技术介绍 5

2.1 网络爬虫 5

2.2 jieba中文分词工具库 6

2.3 PyMySQL数据库连接技术 7

第3章系统的需求分析与设计 9

3.1 系统需求分析 9

3.2 系统总体设计 10

3.2.1 爬取对象简介 10

3.2.2 系统总体设计 10

3.3 主要功能模块设计 11

3.3.1 人机交互模块 11

3.3.2 数据提取模块 11

3.3.3 数据分析模块 12

3.3.4 结果展示模块 13

3.4 数据库设计 13

第4章系统的具体实现 15

4.1 开发与运行环境 15

4.2 人机交互模块实现 15

4.3 数据提取模块实现 16

4.3.1 网页爬取及URL的自动构造 16

4.3.2 异步爬取的实现 19

4.4 数据分析模块实现 19

4.4.1 jieba分词 20

4.4.2 职位名去重 20

4.4.3 薪资统一单位 20

4.5 结果可视化模块实现 21

第5章系统测试与结果分析 23

5.1 测试环境 23

5.2 测试内容及结果分析 23

第6章总结与展望 26

6.1 全文总结 26

6.2 展望 26

参考文献 27

附录 28

致谢 29

第1章绪论

1. 研究背景

在二十一世纪的今天，随着网络信息的日益繁荣，互联网上的信息几乎涵盖了社会生活的方方面面^[1]。

那么如何从互联网如此浩瀚的信息量中高效准确地获取有用的数据成为了时代新青年的必修课。正是这些前驱因素，催生出了搜索引擎，同时也促进了搜索引擎的发展。搜索引擎，就是开发者根据用户的需求，通过运用特定的算法和策略从网络中提取数据信息并展示给用户的一门检索技术。搜索引擎的出现，让人们通过互联网获取信息变得更加方便，同时也使得更多网站的更多数据信息能够为人所知。

网络技术不断发展，使得传统的通用型搜索引擎的局限性不断显现，不仅费时，而且搜索效率也不高^[2],很难满足广大用户的需求。传统的搜索引擎存在以下弊端，如：

请支付后下载全文，论文总字数：24201字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码