Bioinformatics Data skills Vince bufalo Beijing: Boston. Farnham. Sebastopol. Tokyo OREILLY Bioinformatics Data skills by vince buffalo Copyright C 2015 Vince Buffalo. All rights reserved Printed in the United states of america Published by o reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA95472 OReilly books may be purchased for educational, business, or sales promotional use. Online editions are alsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporate institutionalsalesdepartment800-998-9938orcorporate@oreilly.com Editors: Courtney Nash and Amy Jollymore Indexer: Ellen troutman Production Editor: Nicole Shelby Interior Designer: David Futato Copyeditor: Jasmine Kwityn Cover Designer: Ellie Volckhausen Proofreader: Kim Cofer Illustrator: Rebecca Demarest June 2015 First edition Revision History for the First Edition 2015-06-30: First Release Seehttp://oreilly.com/catalog/errata.csp?isbn=9781449367374forreleasedetails The O reilly logo is a registered trademark of o reilly Media, Inc. Bioinformatics Data Skills, the cover image, and related trade dress are trademarks of o reilly media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/ or rights 978-1-449-36737-4 ILSI To my (rather large) family for their continued support: Mom, Dad, Anne, Lisa, Lauren, Violet, and Dalilah; the Buffalos, the Kihns, and the lambs And my earliest mentors for inspiring me to be who I am today: Randy iverson and Duncan Temple lang Table of contents Preface Part L. Ideology: Data Skills for Robust and Reproducible Bioinformatics 1. How to learn bioinformatics.................1 Why Bioinformatics? Biology's Growing dat Learning Data Skills to Learn Bioinformatics New Challenges for Reproducible and robust research Reproducible research Robust research and the golden rule of bioinformatics 5689 Adopting robust and reproducible Practices Will Make Your Life Easier, Too Recommendations for robust research 10 Pay Attention to Experimental Design 10 Write Code for Humans, Write Data for Computers Let Your Computer do the work For you 12 Make assertions and be loud in Code and in your methods Test Code, or better Yet Let Code test code 13 USe Existing Libraries Whenever Possible 14 Treat Data as Read-only 14 Spend Time Developing Frequently Used Scripts into Tools Let Data Prove That It's High Quality 15 Recommendations for Reproducible research 16 Release your code and data 16 Document Everything 16 Make Figures and Statistics the Results of Scripts 17 Use Code as documentation 17 Continually Improving Your Bioinformatics Data Skills 17 Part ll. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project 2. Setting up and managing a bioinformatics Project 21 Project Directories and Directory Structures 21 Project documentation Use Directories to Divide Up Your Project into Subprojects 26 Organizing data to automate File processing tasks 26 Markdown for Project Notebooks 31 Markdown Formatting basics 31 Using Pandoc to Render Markdown to HTML 35 3. Remedial unix Why Do We Use Unix in Bioinformatics? Modularity and the Unix Philosophy pny Working with Streams and Redirection 41 Redirecting Standard Out to a File 41 Redirecting Standard error 43 Using Standard Input Redirection 45 The Almighty Unix Pipe: Speed and Beauty in One Pipes in Action: Creating Simple Programs with Grep and Pipes 47 Combining Pipes and Redirection 48 Even More Redirection: A tee in Your Pipe 49 Managing and interacting with processes Background Processes 50 Killing Pi rocesses Exit Status: How to Programmatically Tell Whether Your Command worked 52 Command substitution 54 4. Working with Remote Machines. ,57 Connecting to Remote Machines with SSh 57 Quick authentication with SSH Keys 59 Maintaining Long-Running Jobs with nohup and tmux 61 nohup 61 Working with remote machines Through Tmux 61 Installing and Configuring Tmux 62 Creating, Detaching, and Attaching Tmux Sessions 62 Working with Tmux Windows 64 Table of contents 5. Git for scientists 67 Why git Is Necessary in Bioinformatics Projects 8 Git Allows You to Keep Snapshots of Your Project 68 Git Helps You Keep track of Important Changes to Code Git Helps Keep Software Organized and Available After People leave 9 Installing Git 70 Basic Git: Creating Repositories, Tracking Files, and Staging and Committing Changes Git Setup: Telling Git Who You are 70 git init and git clone: Creating Repositories 70 Tracking Files in Git: git add and git status Part I Staging files in Git: git add and git status Part II git commit: Taking a Snapshot of Your Project 76 Seeing File Differences: git diff Seeing Your Commit History: git log 79 Moving and removing files: git mv and git rm 80 Telling Git What to Ignore: gitignore 81 Undoing a stage: git reset 83 Collaborating with Git: Git Remotes, git push, and git pull 83 Creating a Shared Central Repository with GitHub 86 Authenticating with Git Remotes 87 Connecting with Git Remotes: git remote 87 Pushing Commits to a Remote repository with git push 88 Pulling Commits from a Remote repository with git pull 89 Working with Your Collaborators: Pushing and Pulling 90 Merge conflicts 92 More GitHub Workflows: Forking and Pull requests USing Git to Make Life Easier: Working with Past Commits 97 Getting Files from the Past: git checkout 97 Stashing Your Changes: git stash More git diff: Comparing Commits and Files 100 Undoing and Editing Commits: git commit --amend 102 Working with Branches 102 Creating and Working with Branches: git branch and git checkout 103 ging Branches git merge Branches and remotes 106 Continuing Your Git education 108 6. Bioinformatics Data...................109 Retrieving bioinformatics Data 110 Downloading Data with wget and curl 110 Rsync and Secure Copy(scp) 113 Table of contents|viii Data Integrit 114 SHA and md5 checksums 115 Looking at Differences Between Data 116 Compressing Data and Working with Compressed Data 118 gzip 119 Working with Gzipped Compressed Fi 120 Case Study: Reproducibly Downloading Data 120 Part ll. Practice: Bioinformatics Data skills 7. Unix Data Tools 鲁春鲁鲁 125 Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls 125 When to Use the Unix Pipeline approach and How to Use It Safely 127 Inspecting and manipulating text Data with Unix tools 128 Inspecting Data with Head and Tail 129 131 Plain-Text Data Summary Information with wc, ls, and awk 134 Working with Column Data with cut and Columns 138 Formatting Tabular Data with column 139 The All-Powerful Grep Decoding Plain-Text Data: hexdump 145 Sorting plain-Text Data with Sort 147 Finding Unique Values in Uniq 152 Join 155 Text Processing with Awk 157 Bioawk: An awk for Biological Formats 163 Stream Editing with Sed 165 Advanced shell tricks 169 Subshells 169 Named Pipes and Process Substitution 171 The Unix Philosophy Revisited 173 8. A Rapid Introduction to the r language ,,175 Getting Started with R and RStudio 176 R Language basics 178 Simple calculations in R, Calling Functions, and Getting Help in R 178 Variables and assignment 182 Vectors, Vectorization, and Indexing 183 Working with and Visualizing Data in R 193 Loading data into r 194 I Table of Contents