My purpose in writing this book is to introduce the mathematically sophisticated reader to a large number of topics and techniques in the eld variously known as machine learning, statistical learning, or predictive modeling. I believe that a deeper understanding of the subject as a whole will be obtained from re ection on an intuitive understanding of many techniques rather than a very detailed understanding of only one or two, and the book is structured accordingly. I have omitted many details while focusing on what I think sh ows “what is really going on.” For details, the reader will be directed to the relevant literature, or to the exercises, which form an integral part of the text. No work this small on a subject this large can be self-contained. Some undergraduate-level calculus, linear algebra, and probability is assumed without ref- erence, as are a few basic ideas from statistics. All of the techniques discussed here can, I hope, be implemented using this book and a mid-level programming language (such as C),1 and explicit implementation of many techniques using R is presented in the last chapter. The reader may detect a coverage bias in favor of classi cation over regression. This is deliberate. The existing literature on the theory and practice of linear regres- sion and many of its variants is so strong that it does not need any contribution from me. Classi cation, I believe, is not yet so well documented. In keeping with what has been important in my experience, loss functions are completely general and predictive modeling is stressed more than explanatory modeling. The intended audience for these notes has an extremely diverse background in probability, ranging from one introductory undergraduate course to extensive 1 There is one exception: the convex programming needed to implement a support vector machine is omitted. xii PREFACE graduate work and published research.2 In seeking a probability notation which will create the least confusion for all concerned, I arrived at the non-standard use of P(x) for both the probability of an event x and a probability mass or density function, with respect to some measure which is never stated, evaluated at a point x. My hope, which I believe has been borne out in practice, is that anyone with suf cient knowledge to nd this notation confusing will have suf cient knowledge to work through that confusion.