《基于R语言的自动数据收集:网络抓取和文本挖掘实用指南》英文原版Automated data Collection with rAutomated Data Collection with rA Practical Guide to Web Scraping andText MiningSimon munzertDepartment of politics and Public administration, University of Konstanz,GermanyChristian RubbaDepartment of political Science, University of Zurich and National Center ofCompetence in Research, SwitzerlandPeter meinerDepartment of politics and Public Administration, University of Konstanz,GermanyDominic nyhuisDepartment of Political science, University of mannheim, GermanyWILEYThis edition first published 2015o 20 15 John wiley sons, LtdRegistered officJohn Wiley sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, Po19 8sQ, United KingdomFor details of our global editorial offices, for customer services and for information about how to apply forpermissiontoreusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.comThe right of the author to be identified as the author of this work has been asserted in accordance with theCopyright, Designs and Patents Act 1988All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmittedany form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted bythe UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisherWiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic booksDesignations used by companies to distinguish their products are often claimed as trademarks. All brand names andproduct names used in this book are trade names, service marks, trademarks or registered trademarks of theirrespective owners. The publisher is not associated with any product or vendor mentioned in this book.Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparingthis book, they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom. If professional advice or other expertassistance is required, the services of a competent professional should be soughtLibrary of Congress Cataloging-in-Publication DataMunzert simonAutomated data collection withR: a practical guide to web scraping and text mining /Simon Munzert, ChristianRubba, Peter MeiBner, Dominic Nyhuispages cmSummary: "This book provides a unified framework of web scraping and information extraction from text datawith r for the social sciences"-Provided by publisherIncludes bibliographical references and indexISBN978-1-118-83481-7( hardback)1. Data mining. 2. Automatic data collection systems. 3. Social sciences-Research-Data processing4. R(Computer program language) I TitleQA769D343M8652014006.3′12dc232014032266A catalogue record for this book is available from the British LibraryISBN:9781118834817Set in 10/12pt Times by Aptara Inc, New Delhi, India12015To my parents, for their unending support. Also, to Stefanie-SimonTo my parents for their love and encouragement-ChristianTo Kristin, Buddy, and Paul for love, regular walks, and a final deadline-PeterMeiner familieDominicContentsPrefaceXIIntroduction1.1 Case study: World Heritage Sites in Danger1. 2 Some remarks on web data quality1.3 Technologies for disseminating, extracting, and storing web data1.3. 1 Technologies for disseminating content on the Web7991.3.2 Technologies for information extraction from web documents1.3.3 Technologies for data storage121. 4 Structure of the bookPart One A Primer on Web and data Technologies152 HTML2.1 Browser presentation and source code82.2 Syntax rules192.2. 1 Tags, elements, and attributes202.2.2 Tree structure212.2. 3 Comments222.2.4 Reserved and special characters222.2.5 Document type definition232.2.6 Spaces and line breaks232. 3 Tags and attributes242.3.1 The anchor tag 252. 3. 3 The external reference tag 262.3.4 Emphasizing tags , ,262.3.5 The paragraphs tag

272.3.6 Heading tags ,


,272.3.7 Listing content with ,
    , and 272.3.8 The organizational tags
    and 27CONTENTS2.3.9 The
    tag and its companions2.3.10 The foreign script tag 卡了网 - Kaledl.Com