基于Jsoup的网络爬虫实现与数据可视化

2023-03-16 09:34:39

论文总字数：17087字

摘要

互联网这个庞大的数据库中存储着海量的数据，是一个巨大的人类文明库。但是这些数据是非结构化的，大量有价值的信息隐藏在这个非结构化数据库中，难以被索引和利用。因此，在互联网中检索出有效的数据并通过分析和组织将它们呈现出来，有着巨大的应用前景。

搜索引擎作为多数人获取网络信息的入口可以很方便地帮助人们在互连网中检索他们所需要想信息。但是搜素引擎作为一种通用的检索器，其缺点也是显而易见的：搜索引擎返回了大量的冗余信息。

当用户需求进一步细化时，需要用人工的方法将搜索引擎返回的高冗余信息进行筛选和处理，效率极低。为了解决上述问题，在研究了现有优秀的开源网络爬虫的基础上，使用JAVA语言设计实现针对特定需求的网络爬虫，并利用JavaScript在Web端将爬虫收集的信息进行可视化，呈现给用户。

关键词：网络爬虫；JAVA ；JavaScript；数据可视化；

Web Crawler amp; Data Visualization

Abstract

The Internet is a huge database of massive data storage, is a huge library of human civilization. However, these data are unstructured, and a lot of valuable information is hidden in the unstructured database, which is difficult to be indexed and used. Therefore, it is useful to search out the effective data and present them in the internet.

Search engines, as the gateway for most people to obtain information from the Internet, help people search the Internet for information they need. However, the search engine as a general-purpose machine,its shortcomings are obvious, such as the search engine returns a lot of redundant information.

When the user needs to be further refined, it is necessary to use artificial methods to search engine to return to the high redundancy information for screening and processing, the efficiency is very low. In order to solve the above problems, based on the existing excellent open source web crawler on the use of JAVA language design and implementation of web crawler for specific requirements, and the use of JavaScript in the Web end of the crawler collected information visualization, presented to the user.

Keywords：Web Crawler; JAVA; JavaScript; Data Visualization;

摘要 I

Abstract II

第一章初识网络爬虫 1

1.1简介 1

1.2基本的爬虫术语 1

1.2.1种子页面 1

1.2.2处理队列 2

1.2.3解析器 2

1.3爬虫的基本工作 2

1.4爬虫设计中的问题 2

1.4.1爬虫应该下载哪些页面 2

1.4.2如何防止被目标网站封锁ip 2

第二章写一个爬虫 3

2.1下载一个页面 3

2.2 HttpClient 4

2.2.1 认识HttpClient 4

2.2.2 HttpClient请求与接收 4

2.2.3 用HttpClient下载页面 4

2.3 正则表达式(Regular Expression) 5

2.3.1 什么是正则表达式 5

2.3.2 正则表达式语法 5

2.3.3 java中的正则表达式 6

2.3.4 使用正则表达式解析文本 7

2.4 Jsoup 9

2.4.1 认识Jsoup 9

2.4.2 用Jsoup解析HTML文档 9

第三章爬虫优化 10

3.1 多线程(multithreading) 10

3.1.1 什么是多线程 10

3.1.2为什么使用多线程 11

3.1.3 Java中的多线程 11

3.1.4 多线程存在的问题 11

3.1.5 线程锁 12

3.2 数据存储 12

3.2.1 数据库(database) 12

3.2.2 JDBC 12

3.2.3 数据存储 12

3.3 URL去重 14

3.3.1去重的必要性 14

3.3.2信息摘要(MD5 code) 14

3.3.3 对比MD5码实现URL去重 14

3.3.4 在java中调用MD5算法 15

第四章数据可视化 15

4.1 FLASH 15

4.1.1 flash简介 15

4.1.2 ActionScript3.0 15

4.1.1 用ActionScript3绘制柱状图 15

4.1.2 ActionScript3控制地图 16

4.2 D3(Data-Driven Documents) 19

4.2.1 JavaScript 19

4.2.2 D3(Data-Driven Documents) 19

4.2.3 SVG(Scalable Vector Graphics) 19

4.2.4 D3绘制简单柱状图 19

4.2.5 添加动画效果 20

4.2.6 D3绘制简单饼状图 21

4.2.7 D3绘制地图 22

第五章结束语 23

5.1总结 23

5.2展望 23

致谢 24

参考文献 25

第一章初识网络爬虫

1.1简介

对于任何一个搜索引擎，最重要的任务之一就是在万维网中收集网页，这项工作离不开网络爬虫的参与。

网络爬虫(Web Crawler)也被称为“网络蜘蛛(Web Spider)”或者“机器人(robot)”，顾名思义它就像一只互联网上的蜘蛛，在web页面中的一个又一个链接之间爬行，抓取一个又一个页面，然后收集页面上的信息，直到所有的页面都被收集完成。完成收集之后将我们需要的数据存储在数据库(Database)中，爬虫的工作就算完成了。

剩余内容已隐藏，请支付后下载全文，论文总字数：17087字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

基于Jsoup的网络爬虫实现与数据可视化

第一章初识网络爬虫

1.1简介

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于Jsoup的网络爬虫实现与数据可视化

第一章 初识网络爬虫

1.1简介

您可能感兴趣的文章

最新文档

推荐栏目

第一章初识网络爬虫