MCB517A - Tools for Computational Biology

Lecture 7: Introduction to Sequencing Data Analysis

October 17, 2019

Course GitHub for Lecture 7

Now that we have a basic grasp of concepts surrounding data management, manipulation, and visualization, we’re ready to start focusing on some of the more specialized data encountered in computational biology research. Sequencing of nucleic acids is almost ubiquitous in biological research. In this lecture, we will introduce some common resources for depositing and retrieving sequence data generated by consortium efforts and independent laboratories. We will introduce concepts and practical steps of querying, inspecting, and visualizing sequence data. Then, we will cover the types of genomic variation and common tools used to predict these from sequencing data.

This lecture focuses on concepts surrounding genome sequence data and their associated workflows. Examples of code provided in this lecture include Unix command line tools. These are included for your future reference and will not be included in the homework for this section. We’ll cover the Unix command line later in the semester, at which point you may find it useful to refer back to some of the material presented here.

Learning objectives

Identify common databases and file formats used for sequence data
Describe the steps involved in processing and analyzing sequence data to predict different types of genomic variants
Recognize common tools (databases and software) used to assess variation in genomic data

Class materials

Outline of content from the Lecture 7 slides:

Sequence data
- Databases and online resources for sequence data
- Learn the common sequence data file formats
Tools for sequencing data
- Tools to query, inspect, visualize an aligned sequence file
- Learn the contents of sequence data files
- Learn to generate sequencing metrics and to process sequence data
- Learn about Python and R libraries/packages to read sequence data
Genome variant analysis
- Types of genomic variation
- Tools to predict genomic variations
- Learn the common file formats for variation data
- Databases and online resources for human variation data

For your reference, data and examples shown in this lecture are available here, and visualization uses the Integrative Genomics Viewer (IGV). It is not necessary to download these for this lecture, although we’ll use them in the next class.

Reminders

Homework 2 (data manipulation and visualization in R/tidyverse) is currently available and is due Tuesday, October 22 at noon. You should have received an email containing an invitation to create your repository using GitHub Classroom. Contact Kate (khertwec at fredhutch.org) with any questions or concerns.
The next class session (lecture 8) will include analysis of genomic data in R. To prepare for this session, please download all data files in this Dropbox folder and follow the instructions in this script to install required packages. You may also find it useful to install the Integrative Genomics Viewer (IGV) for visualization of genomic data.

Lecture 8: Genomic data in R

October 19, 2019

Course GitHub for Lecture 8

This lecture will unite the last lecture’s content on genomic analysis with our previous coding in R. The packages we’ll use this week are from Bioconductor, a collection of software specifically designed for genomic analysis in R.

Learning objectives

Use Bioconductor packages to work with genomic data in R
Load, inspect, and query genomic data (BED/SEG, BAM, VCF files)
Identify and annotate genomic variants

Class materials

Lecture 8 slides
If you have not done so already, update your local copy of the class repository from GitHub. You should have a directory (lecture08) containing the following three RMarkdown tutorials:
- Lecture8_GenomicData.Rmd: storing genomic data as objects, assessing genomic ranges, importing and assessing variant (vcf) data
- Lecture8_Annotations.Rmd: load and apply gene annotations to genomic data
- Lecture8_Rsamtools.Rmd: Compute “pile-up” statistics at genomic loci to identify genomic variants
Please download all data files found in this folder and add them to your lecture08 directory. The files should have the following filenames:
- BRCA.genome_wide_snp_6_broad_Level_3_scna.seg
- BRCA_IDC_cfDNA.bam
- BRCA_IDC_cfDNA.bam.bai
- GIAB_highconf_v.3.3.2.vcf.gz (if this file was automatically uncompressed on your computer, resulting in a file named GIAB_highconf_v.3.3.2.vcf, look in your Trash folder to find the original file ending in gz)
- GIAB_highconf_v.3.3.2.vcf.gz.tbi
You should run this script in RStudio to ensure all Bioconductor packages are installed.
Your homework will also require use of the Integrative Genomics Viewer (IGV)

Reminders

Homework 2 was due at noon on Tuesday, October 22 in GitHub Classroom.
Homework 3 (genomic data in R) is available through GitHub Classroom and is due Tuesday, October 28 at noon. You should receive an email containing an invitation to create your repository using GitHub Classroom. Contact Kate (khertwec at fredhutch.org) with any questions or concerns.
The next class session will cover Unix command line. Please ensure you have the software for accessing the Unix command line installed. We will be using compute clusters available through Fred Hutch, which will require use of your HutchNetID. If you have not yet received a notification of HutchNetID creation, please contact Kate.