A hands on guide to web scraping and text mining for both beginners and experienced users of R

Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
Provides basic techniques to query web documents and data sets (XPath and regular expressions).
An extensive set of exercises are presented to guide the reader through each technique.
Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
Case studies are featured throughout along with examples for each technique presented.
R code and solutions to exercises featured in the book are provided on a supporting website.

Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Preface xv

1 Introduction 1

1.1 Case study: World Heritage Sites in Danger 1

1.2 Some remarks on web data quality 7

1.3 Technologies for disseminating, extracting, and storing web data 9

1.3.1 Technologies for disseminating content on the Web 9

1.3.2 Technologies for information extraction from web documents 11

1.3.3 Technologies for data storage 12

1.4 Structure of the book 13

Part One A Primer onWeb and Data Technologies 15

2 HTML 17

2.1 Browser presentation and source code 18

2.2 Syntax rules 19

2.2.1 Tags, elements, and attributes 20

2.2.2 Tree structure 21

2.2.3 Comments 22

2.2.4 Reserved and special characters 22

2.2.5 Document type definition 23

2.2.6 Spaces and line breaks 23

2.3 Tags and attributes 24

2.3.1 The anchor tag <a> 24

2.3.2 The metadata tag <meta> 25

2.3.3 The external reference tag <link> 26

2.3.4 Emphasizing tags <b>, <i>, <strong> 26

2.3.5 The paragraphs tag <p> 27

2.3.6 Heading tags <h1>, <h2>, <h3>,… 27

2.3.7 Listing content with <ul>, <ol>, and <dl> 27

2.3.8 The organizational tags <div> and <span> 27

2.3.9 The <form> tag and its companions 29

2.3.10 The foreign script tag <script> 30

2.3.11 Table tags <table>, <tr>, <td>, and <th> 32

2.4 Parsing 32

2.4.1 What is parsing? 33

2.4.2 Discarding nodes 35

2.4.3 Extracting information in the building process 37

Summary 38

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining

9781118834817

111883481X

Supplemental Materials

Summary

Author Biography

Table of Contents

Supplemental Materials

Rewards Program