Gareth james· Daniela witten· Trevor hastie Robert tibshirani An introduction to Statistical Learning with Applications in R 2 Springer Gareth james Daniela witten Department of Information and Department of biostatistics Operations Management University of washington University of Southern California Seattle WA. Usa Los Angeles cA usa Trevor hastie Robert tibshirani Department of Statistics Department of Statistics Stanford University Stanford University Stanford. CA USA Stanford. CA. USA ISSN1431-875X ISBN978-1-4614-7137-0 ISBN978-1-4614-7138-7( e Book) DOI10.1007/978-1-4614-7138-7 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2013936251 O Springer Science+Business Media New York 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission o information storage and retrieval, electronic adaptation, computer software, or by similar or dissim ilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the pur- pose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publishers location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law The use of general descriptive names, registered names, trademarks, service marks, etc in this publi cation does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein Printed on acid-free paper SpringerispartofSpringerScience+businessMedia(www.springer.com) 070 alison and michael ames Chiara Nappi and edward witten Valerie and patrick hastie Vera and sami tibshirani d to our families Michael. daniel and catherine Samantha, Timothy, and Lynda Charlie, Ryan, Julie, and Cheryl Preface Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines With the explosion of"Big Data"problems, statistical learning has be come a very hot field in many scientific areas as well as marketing, finance and other business disciplines. People with statistical learning skills are in high demand One of the first books in this area The Elements of Statistical Learning (ESL)(Hastie, Tibshirani, and Friedman)-was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statis- tics but also in related fields. One of the reasons for ESL's popularity is its relatively accessible style. But ESL is intended for individuals with ad vanced training in the mathematical sciences. An Introduction to Statistical Learning(ISL) arose from the perceived need for a broader and less tech nical treatment of these topics. In this new book, we cover many of the same topics as Esl, but we concentrate more on the applications of the methods and less on the mathematical details. We have created labs illus trating how to implement each of the statistical learning methods using the popular statistical software package R. These labs provide the reader with valuable hands-on experience This book is appropriate for advanced undergraduates or masters stu dents in statistics or related quantitative fields or for individuals in other viii Preface disciplines who wish to use statistical learning tools to analyze their data It can be used as a textbook for a course spanning one or two semesters We would like to thank several readers for valuable comments on prelim inary drafts of this book: Pallavi Basu. Alexandra Chouldechova, Patrick Danaher. Will Fithian. Luella Fu. Sam Gross Max grazier GSell court- ney Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan and Xin Lu tan It's tough to mahe predictions, especially about the future ogl Berra Los Angeles, USA Gareth James Seattle. USA Daniela Witten Palo Alto. USA Trevor hastie Palo alto. USa Robert tibshirani Contents Preface 1 Introduction 2 Statistical Learning 15 2.1 What Is Statistical Learning? 15 2.1.1 Why Estimate f? 17 2.1.2 How Do We Estimate f? 21 2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability 24 2.1.4 Supervised Versus Unsupervised Learning 26 2.1.5 Regression Versus Classification Problems 28 2.2 Assessing Model Accuracy 29 2.2. 1 Measuring the Quality of Fit 29 2.2.2 The Bias-Variance Trade-Off 33 2.2.3 The Classification Setting 37 2.3 Lab: Introduction to R 42 2.3.1 Basic Commands 42 2.3.2 Graph 45 2.3.3 Indexing Data 47 2.3.4 Loading Data 48 2.3.5 Additional Graphical and Numerical Summaries 49 2. 4 Exercises 2