Bioinformatics Data Skills:REPRODUCIBLE AND ROBUST RESEARCH WITH OPEN SOURCE TOOLSLearn the data skills necessary for turning large sequencing datasets into reproducible and robust biological findings. With this practical guide, you’ll learn how to use freely available open source tools to extract Bioinformatics data skillsⅤ ince buffaloBeijing· Boston· Farnham· Sebastopol. Tokyo OREILLYBioinformatics Data skillsby vince buffaloCopyright C 2015 Vince Buffalo. All rights reservedPrinted in the United States of AmericaPublished by O reilly Media, InC, 1005 Gravenstein Highway North, Sebastopol, CA 95472OReilly books may be purchased for educational, business, or sales promotional use. Online editions arealsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporateinstitutionalsalesdepartment800-998-9938orcorporate@oreilly.comEditors: Courtney Nash and amy JollymoreIndexer ellen troutmanProduction Editor: Nicole ShelbyInterior Designer: David FutatoCopyeditor: Jasmine KwitynCover Designer: Ellie VolckhausenProofreader: Kim CoferIllustrator: rebecca demarestJune 2015First editionRevision History for the first Edition2015-06-30: First releaseSeehttp://oreilly.com/catalog/errata.csp?isbn=9781449367374forreleasedetailsThe OReilly logo is a registered trademark of O Reilly Media, Inc. Bioinformatics Data Skills, the coverimage, and related trade dress are trademarks of o reilly media, IncWhile the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights978-1-449-36737-4To my (rather large) family for their continued support: Mom, Dad, Anne, Lisa, Lauren,Violet, and Dalilah; the Buffalos, the Kihns, and the lambsAnd my earliest mentors for inspiring me to be who I am today: Randy iverson andDuncan Temple langTable of contentsPreface,Part L. Ideology: Data Skills for Robust and Reproducible Bioinformatics1. How to learn bioinformatics.................1Why Bioinformatics? Biology's Growing dataLearning Data Skills to Learn BioinformaticsNew Challenges for Reproducible and robust researchReproducible research456Robust research and the golden rule of bioinformaticsAdopting robust and reproducible Practices Will Make Your Life Easier, Too 9Recommendations for robust researchPay Attention to Experimental Design10Write Code for Humans, Write Data for ComputersLet Your Computer Do the Work For You12Make assertions and be loud in Code and in your methodsTest Code, or better Yet Let Code Test code13Use Existing Libraries Whenever Possible14Treat Data as Read-Onl14Spend Time Developing Frequently Used Scripts into Tools15Let Data Prove That It's High Quality15Recommendations for Reproducible research16Release your code and data16Dot everything16Make Figures and Statistics the results of Scripts17Use Code as documentation17Continually Improving Your Bioinformatics Data Skills17Part l. Prerequisites: Essential Skills for Getting Started witha bioinformatics project2. Setting Up and Managing a Bioinformatics Project.21Project Directories and Directory Structures21Project Documentation24Use Directories to Divide Up Your Project into Subprojects26Organizing data to Automate File Processing tasks26Markdown for Project Notebooks31Markdown Formatting basics31USing Pandoc to Render Markdown to HTML353. Remedial Unix Shelln37Why Do We Use Unix in Bioinformatics? Modularity and the UnixPhilosophy37Redirecting Standard Out to a fileWorking with Streams and Redirectio4141Redirecting Standard ErrorUsing Standard Input Redirection45The almighty Unix Pipe: Speed and Beauty in One45Pipes in Action: Creating Simple Programs with Grep and Pipes47Combining pipes and redirection48Even More Redirection: A tee in Your Pipe49Managing and Interacting with Processes50Background ProcessesKilling processes51Exit Status: How to Programmatically Tell Whether YourCommand worked52Command Substitution544. Working with Remote Machines.57Connecting to remote Machines with SSh57Quick authentication with SSH Keys59Maintaining Long-Running Jobs with nohup and tmux61nohup61Working with Remote Machines Through Tmux61Installing and Configuring Tmux62Creating, Detaching, and Attaching Tmux Sessions62Working with Tmux Windows64Table of contents5. Git for scientists67Why git Is Necessary in Bioinformatics ProjectsGit Allows You to Keep Snapshots of Your Project68Git Helps You Keep Track of Important Changes to CodeGit helps Keep software Organized and Available after People leave69alling git70Basic Git: Creating Repositories, Tracking Files, and Staging and CommittingChanges70Git Setup: Telling Git Who You Are70git init and git clone: Creating Repositories70Tracking Files in Git: git add and git status Part IStaging Files in Git: git add and git status Part II73git commit: Taking a Snapshot of Your Project76Seeing File Differences: git diffSeeing Your Commit History: git log79Moving and Removing Files: git mv and git rm80Telling Git What to Ignore:. gitignore81Undoing a Stage: git reset83Collaborating with Git: Git Remotes, git push, and git pull83Creating a Shared Central Repository with GitHub86Authenticating with Git Remotes87Connecting with Git Remotes: git remotePushing Commits to a Remote repository with git push88Pulling Commits from a remote repository with git pull89Working with Your Collaborators: Pushing and Pulling90Merge Conflicts92More GitHub Workflows: Forking and Pull requestsUsing git to Make Life Easier: Working with Past Commits97Getting Files from the Past: git checkout97Stashing Your Changes: git stash99More git diff: Comparing Commits and Files100Undoing and Editing Commits: git commit --amend102Working with branches102Creating and Working with branches: git branch and git checkout103Merging Branches: git merge105Branches and remotes106Continuing Your Git Education1086. Bioinformatics DataRetrieving bioinformatics Data110Downloading Data with wget and curl110Rsync and Secure Copy(scp)113Table of contents|ⅷiData Integrity114SHA and md5 checksums115Looking at Differences Between Data116Compressing Data and Working with Compressed Data118gzip119Working with Gzipped Compressed Files120Case Study: Reproducibly downloading data120Part ll. Practice: Bioinformatics Data skillsUnix Data Tools and the Unix One-Liner Approach: Lessons from∴……1257. Unix Data tools鲁鲁。鲁。鲁。鲁。鲁Programming Pearls125When to Use the Unix Pipeline Approach and How to Use It Safely127Inspecting and Manipulating Text Data with Unix Tools128Inspecting Data with Head and Tail129131Plain-Text Data Summary Information with wc, Is, and awk134Working with Column Data with cut and Columns138Formatting tabular Data with column139The All-Powerful Grep140Decoding Plain-Text Data: hexdump145Sorting Plain-Text Data with Sort147Finding Unique Values in Uniq152n155Text Processing with Awk157Bioawk: An awk for Biological Formats163Stream Editing with Sed165Advanced shell tricks169Subshells169Named Pipes and Process Substitution171The Unix Philosophy revisited1738. A Rapid Introduction to the r language.175Getting Started with R and rStudio176R Language basics178Simple calculations in R, Calling Functions, and Getting Help in R178Variables and assignment182Vectors, Vectorization, and Indexing183Working with and Visualizing Data in R193Loading data into r194ⅶ ii Table of Contents