Difference between revisions of "Genome"

From 太極
Jump to navigation Jump to search
(One intermediate revision by the same user not shown)
Line 505: Line 505:
[https://www.biostars.org/p/96176/ featureCounts vs HTSeq-count]
[https://www.biostars.org/p/96176/ featureCounts vs HTSeq-count]
[http://bioinformatics.cvr.ac.uk/blog/featurecounts-or-htseq-count/ featureCounts or htseq-count?]
== Inference ==
== Inference ==
Line 970: Line 972:
= PDX/Xenograft =
= PDX/Xenograft =
* https://en.wikipedia.org/wiki/Patient_derived_xenograft
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-1172 Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts?] by Tso et al, BMC Genomics, 2014.
* [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-1172 Are special read alignment strategies necessary and cost-effective when handling sequencing reads from patient-derived tumor xenografts?] by Tso et al, BMC Genomics, 2014.
* [http://www.arrayserver.com/wiki/index.php?title=Align_Ion_Torrent_reads Map xenograft reads] by [http://www.omicsoft.com/array-studio/ Array Suite]
* [http://www.arrayserver.com/wiki/index.php?title=Align_Ion_Torrent_reads Map xenograft reads] by [http://www.omicsoft.com/array-studio/ Array Suite]

Revision as of 10:00, 8 November 2019


See also Bioconductor > BiocViews > Visualization. Search 'genom' as the keyword.


nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome


File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png


  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).


GIVE: Genomic Interactive Visualization Engine

Build your own genome browser


Heat map plotting by genome coordinate.


NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.


R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.


A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).


See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal and TCGA

The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets.


Tutorial: retrieve full TCGA datasets from cBioportal with R


Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.


SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis


An accurate and powerful method for copy number variation detection



See NGS.

R and Bioconductor packages




Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens

RNA-Seq Data Analysis using R/Bioconductor


GenomicDataCommons package


  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.





  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.


Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.


Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.


The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.


IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.


The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.



GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.


Counting reads with summarizeOverlaps vignette


fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

### code chunk number 11: gff (eval = FALSE)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

### code chunk number 12: gff_parse (eval = FALSE)
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

### code chunk number 12: gff_parse (eval = FALSE)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object

### code chunk number 15: pasilla_exoncountset (eval = FALSE)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
geneCDS <- newCountDataSet(

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

### code chunk number 16: pasilla_genes (eval = FALSE)
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.



See this post for about C version of the featureCounts program.

featureCounts vs HTSeq-count

featureCounts or htseq-count?


DESeq or edgeR


An R package for gene and isoform differential expression analysis of RNA-seq data



Probe region expression estimation for RNA-seq data for improved microarray comparability


Inference of differential exon usage in RNA-Seq


A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data


Pathway analysis

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis


GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display





SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).


Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.


The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.


More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.


Bioinformatics paper

CRAN packages


Sample Size Calculation for RNA-Seq Experimental Design


Shiny app


Provides an interface to functions of the 'SAMtools' C-Library by Heng Li


The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.


Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.


Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.



The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")



http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0989-6 Data from TCGA ws used

Visualize multi-dimentional cancer genomics data including of patient information, gene expressions, DNA methylations, DNA copy number variations, and SNP/mutations in matrix layout or network layout.


The GetGeneList() function is useful to download Genomic Features (including gene features/symbols) from NCBI (ftp://ftp.ncbi.nih.gov/genomes/MapView/).

> library(Map2NCBI)
> GeneList = GetGeneList("Homo sapiens", build="ANNOTATION_RELEASE.107", savefiles=TRUE, destfile=path.expand("~/"))
  # choose [2], [n], and [1] to filter the build and feature information.
  # The destination folder will contain seq_gene.txt, seq_gene.md.gz and GeneList.txt files.
> str(GeneList)
'data.frame':	52157 obs. of  15 variables:
 $ tax_id       : chr  "9606" "9606" "9606" "9606" ...
 $ chromosome   : chr  "1" "1" "1" "1" ...
 $ chr_start    : num  11874 14362 17369 30366 34611 ...
 $ chr_stop     : num  14409 29370 17436 30503 36081 ...
 $ chr_orient   : chr  "+" "-" "-" "+" ...
 $ contig       : chr  "NT_077402.3" "NT_077402.3" "NT_077402.3" "NT_077402.3" ...
 $ ctg_start    : num  1874 4362 7369 20366 24611 ...
 $ ctg_stop     : num  4409 19370 7436 20503 26081 ...
 $ ctg_orient   : chr  "+" "-" "-" "+" ...
 $ feature_name : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2" ...
 $ feature_id   : chr  "GeneID:100287102" "GeneID:653635" "GeneID:102466751" "GeneID:100302278" ...
 $ feature_type : chr  "GENE" "GENE" "GENE" "GENE" ...
 $ group_label  : chr  "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" "GRCh38.p2-Primary" ...
 $ transcript   : chr  "Assembly" "Assembly" "Assembly" "Assembly" ...
 $ evidence_code: chr  "-" "-" "-" "-" ...
> GeneList$feature_name[grep("^NAP", GeneList$feature_name)]

TCseq: Time course sequencing data analysis



See the internal link at R-GEO.

GREIN: An interactive web platform for re-analyzing GEO RNA-seq data


GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data


Biometrical Journal



Genome Analysis section

BMC Bioinformatics









CCBR Exome Pipeliner


MOFA: Multi-Omics Factor Analysis


Essential guidelines for computational method benchmarking


Simulate RNA-Seq


Used by TopHat: discovering splice junctions with RNA-Seq

BEERS/Grant G.R. 2011

http://bioinformatics.oxfordjournals.org/content/27/18/2518.long#sec-2. The simulation method is called BEERS and it was used in the STAR software paper.

For the command line options of <reads_simulator.pl> and more details about the config files that are needed/prepared by BEERS, see this gist.

This can generate paired end data but they are in one FASTA file.

$ sudo apt-get install cpanminus
$ sudo cpanm Math::Random
$ wget http://cbil.upenn.edu/BEERS/beers.tar
$ tar -xvf beers.tar      # two perl files <make_config_files_for_subset_of_gene_ids.pl> and <reads_simulator.pl>
$ cd ~/Downloads/
$ mkdir beers_output  
$ mkdir beers_simulator_refseq && cd "$_"
$ wget http://itmat.rum.s3.amazonaws.com/simulator_config_refseq.tar.gz
$ tar xzvf simulator_config_refseq.tar.gz
$ ls -lth 
total 1.4G
-rw-r--r-- 1 brb brb  44M Sep 16  2010 simulator_config_featurequantifications_refseq
-rw-r--r-- 1 brb brb 7.7M Sep 15  2010 simulator_config_geneinfo_refseq
-rw-r--r-- 1 brb brb 106M Sep 15  2010 simulator_config_geneseq_refseq
-rw-r--r-- 1 brb brb 1.3G Sep 15  2010 simulator_config_intronseq_refseq
$ cd ~/Downloads/
$ perl reads_simulator.pl 100 testbeers \
   -configstem refseq \
   -customcfgdir ~/Downloads/beers_simulator_refseq \
   -outdir ~/Downloads/beers_output

$ ls -lh beers_output
total 3.9M
-rw-r--r-- 1 brb brb 1.8K Mar 16 15:25 simulated_reads2genes_testbeers.txt
-rw-r--r-- 1 brb brb 1.2M Mar 16 15:25 simulated_reads_indels_testbeers.txt
-rw-r--r-- 1 brb brb 1.6K Mar 16 15:25 simulated_reads_junctions-crossed_testbeers.txt
-rw-r--r-- 1 brb brb 2.7M Mar 16 15:25 simulated_reads_substitutions_testbeers.txt
-rw-r--r-- 1 brb brb 6.3K Mar 16 15:25 simulated_reads_testbeers.bed
-rw-r--r-- 1 brb brb  31K Mar 16 15:25 simulated_reads_testbeers.cig
-rw-r--r-- 1 brb brb  22K Mar 16 15:25 simulated_reads_testbeers.fa
-rw-r--r-- 1 brb brb  584 Mar 16 15:25 simulated_reads_testbeers.log

$ wc -l simulated_reads2genes_testbeers.txt
102 simulated_reads2genes_testbeers.txt
$ head -4 simulated_reads2genes_testbeers.txt
seq.1	GENE.5600
seq.2	GENE.35506
seq.3	GENE.506
seq.4	GENE.34922
$ tail -4 simulated_reads2genes_testbeers.txt
seq.97	GENE.4197
seq.98	GENE.8763
seq.99	GENE.19573
seq.100	GENE.18830
$ wc -l simulated_reads_indels_testbeers.txt
36131 simulated_reads_indels_testbeers.txt
$ head -2 simulated_reads_indels_testbeers.txt
chr1:6052304-6052531	25	1	G
chr2:73899436-73899622	141	3	ATA
$ tail -2 simulated_reads_indels_testbeers.txt
chr4:68619532-68621804	1298	-2	AA
chr21:32554738-32554962	174	1	T
$ wc -l simulated_reads_substitutions_testbeers.txt 
71678  simulated_reads_substitutions_testbeers.txt
$ head -2 simulated_reads_substitutions_testbeers.txt 
chr22:50902963-50903167	50903077	G->A
chr1:6052304-6052531	6052330	G->C
$ wc -l simulated_reads_junctions-crossed_testbeers.txt 
49   simulated_reads_junctions-crossed_testbeers.txt
$ head -2 simulated_reads_junctions-crossed_testbeers.txt 
seq.1a	chrX:49084601-49084713
seq.1b	chrX:49084909-49086682

$ cat beers_output/simulated_reads_testbeers.log
Simulator run: 'testbeers'
started: Thu Mar 16 15:25:39 EDT 2017
num reads: 100
readlength: 100
substitution frequency: 0.001
indel frequency: 0.0005
base error: 0.005
low quality tail length: 10
percent of tails that are low quality: 0
quality of low qulaity tails: 0.8
percent of alt splice forms: 0.2
number of alt splice forms per gene: 2
stem: refseq
sum of gene counts: 3,886,863,063
sum of intron counts = 1,304,815,198
sum of intron counts = 2,365,472,596
intron frequency: 0.355507598105262
padded intron frequency: 0.52453796437909
finished at Thu Mar 16 15:25:58 EDT 2017

$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ head simulated_reads_testbeers.fa

# Take a look at the true coordinates
$ head -4 simulated_reads_testbeers.bed # one-based coords and contains both endpoints of each span
chrX	49084529	49084601	+
chrX	49084713	49084739	+
chrX	49084863	49084909	+
chrX	49086682	49086734	+
$ head -4 simulated_reads_testbeers.cig # has a cigar string representation of the mapping coordinates, and a more human readable representation of the coordinates
$ wc -l simulated_reads_testbeers.fa
400 simulated_reads_testbeers.fa
$ wc -l simulated_reads_testbeers.bed
247 simulated_reads_testbeers.bed
$ wc -l simulated_reads_testbeers.cig
200 simulated_reads_testbeers.cig

Flux Sammeth 2010




A data-based simulation algorithm for rna-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source rna-seq dataset provided by the user.



The key function is simulateCounts, which takes a fitted DESeq2 data object as an input and returns a simulated data object (DESeq2 class) with the same sample size factors, total counts and dispersions for each gene as in real data, but without the effect of predictor variables.

Functions fdrTable, fdrBiCurve and empiricalFDR compare the DESeq2 results obtained for the real and simulated data, compute the empirical false discovery rate (the ratio of the number of differentially expressed genes detected in the simulated data and their number in the real data) and plot the results.



Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads.

Input: reference FASTA file (containing names and sequences of transcripts from which reads should be simulated) OR a GTF file denoting transcript structures, along with one FASTA file of the DNA sequence for each chromosome in the GTF file.

Output: FASTA files. Reads in the FASTA file will be labeled with the transcript from which they were simulated.

Too many dependencies. Got an error in installation.. It seems it has not considered splice junctions.


Simulate DNA-Seq




DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads


$ head simDNA_100bp_16del.fasta

Simulate Whole genome

Simulate whole exome

Variant simulator

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Convert FASTA to FASTQ

It is interesting to note that the simulated/generated FASTA files can be used by alignment/mapping tools like BWA just like FASTQ files.

If we want to convert FASTA files to FASTQ files, use https://code.google.com/archive/p/fasta-to-fastq/. The quality score 'I' means 40 (the highest) by Sanger (range [0,40]). See https://en.wikipedia.org/wiki/FASTQ_format. The Wikipedia website also mentions FASTQ read simulation tools and a comparison of these tools.

$ cat test.fasta
$ perl ~/Downloads/fasta_to_fastq.pl test.fasta

Alternatively we can use just one line of code by awk

$ awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"; for(c=0;c<length($2);c++) printf "H"; printf "\n"}' \
   test.fasta > test.fq
$ cat test.fq

Change the 'H' to the quality score value that you need (Depending what phred score scale you are using).

Simulate genetic data

‘Simulating genetic data with R: an example with deleterious variants (and a pun)’


module load gossamer
xenome index -M 24 -T 16 -P idx \
  -H $HOME/igenomes/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta/genome.fa \
  -G $HOME/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa



DNA Seq Data


  • Go to SRA/Sequence Read Archiveand type the keywords 'Whole Genome Sequencing human'. An example of the procedures to search whole genome sequencing data from human samples:
    1. Enter 'Whole Genome Sequencing human' in ncbi/sra search sra objects at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj
    2. The webpage will return the result in terms of SRA experiments, SRA studies, Biosamples, GEO datasets. I pick SRA studies from Public Access.
    3. The result is sorted by the Accession number (does not take the first 3 letters like DRP into account). The Accession number has a format SRPxxxx. So I just go to the Last page (page 98)
    4. I pick the first one Accession:SRP066837 from this page. The page shows the Study type is whole genome sequence. http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP066837
    5. (Important trick) Click the number next to Run. It will show a summary (SRR #, library name, MBases, age, biomaterial provider, isolate and sex) about all samples. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP066837
    6. Download the raw data from any one of them (eg SRR2968056). For whole genome, the Strategy is WGS. For whole exome, the Strategy is called WXS.
  • Search the keywords 'nonsynonymous' and 'human' in PMC

Use SRAToolKit instead of wget to download

Don't use the wget command since it requires the specification of right http address.

Downloading SRA data using command line utilities

SRA2R - a package to import SRA data directly into R.

(Method 1) Use the fastq-dump command. For example, the following command (modified from the document will download the first 5 reads and save it to a file called <SRR390728.fastq> (NOT sra format) in the current directory.

/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump -X 5 SRR390728 -O .
# OR 
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRR390728 # no progress bar

This will download the files in FASTQ format.

(Method 2) If we need to downloading by wget or FTP (works for ‘SRR’, ‘ERR’, or ‘DRR’ series):

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra

It will download the file in SRA format. In the case of SRR590795, the sra is 240M and fastq files are 615*2MB.

(Method 3) Download Ubuntu x86_64 tarball from http://downloads.asperasoft.com/en/downloads/8?list

[email protected] ~/Downloads $ tar xzvf aspera-connect-
[email protected] ~/Downloads $ ./aspera-connect-

Installing Aspera Connect

Deploying Aspera Connect (/home/brb/.aspera/connect) for the current user only.
Restart firefox manually to load the Aspera Connect plug-in

Install complete.

[email protected] ~/Downloads $ ~/.aspera/connect/bin/ascp -QT -l640M \
  -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \
  [email protected]:/sra/sra-instant/reads/ByRun/sra/SRR/SRR590/SRR590795/SRR590795.sra .
SRR590795.sra                                                                           100%  239MB  535Mb/s    00:06
Completed: 245535K bytes transferred in 7 seconds
 (272848K bits/sec), in 1 file.
[email protected] ~/Downloads $

Aspera is typically 10 times faster than FTP according to the website. For this case, wget takes 12s while ascp uses 7s.

Note that the URL on the website's is wrong. I got the correct URL from emailing to ncbi help. Google: ascp "[email protected]"

SRAdb package


First we install some required package for XML and RCurl.

sudo apt-get update
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

and then



Only the cancer types with expected cases > 10^5 in the US in 2015 are considered here. http://www.cancer.gov/types/common-cancers

SRA Explorer


SRP066363 - lung cancer

SRP015769 or SRP062882 - prostate cancer

SRP053134 - breast cancer

Look at the MBases value column. It determines the coverage for each run.

SRP050992 single cell RNA-Seq

Used in Design and computational analysis of single-cell RNA-sequencing experiments

Single cell RNA-Seq

Exploiting single-cell expression to characterize co-expression replicability

SRP040626 or SRP040540 - Colon and rectal cancer


OmicIDX on BigQuery


See the BWA section.

Whole Exome Seq

Whole Genome Seq


  1. http://www.ncbi.nlm.nih.gov/sra/?term=SRA059511
  2. http://www.ncbi.nlm.nih.gov/sra/SRX194938[accn] and click SRP004077
  3. http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004077 and click Runs from the RHS
  4. http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004077 and click RunInfoTable

Note that (For this study, it has 2377 rows)

  • Column A (AssemblyName_s) eg GRCh37
  • Column I (library_name_s) eg
  • column N (header=Run_s) shows all SRR or ERR accession numbers.
  • Column P (Sample_Name)
  • Column Y (header=Assay_Type_s) shows WGS.
  • Column AB (LibraryLayout_s): PAIRED

Public Data

ISB Cancer Genomics Cloud (ISB-CGC)

https://isb-cgc.appspot.com/ Leveraging Google Cloud Platform for TCGA Analysis

The ISB Cancer Genomics Cloud (ISB-CGC) is democratizing access to NCI Cancer Data (TCGA, TARGET, CCLE) and coupling it with unprecedented computational power to allow researchers to explore and analyze this vast data-space.

ISB-CGC Web Application


Next-generation characterization of the Cancer Cell Line Encyclopedia 2019

It has 1000+ cell lines profiled with different -omics including DNA methylation, RNA splicing, as well as some proteomics (and lots more!).ß

NCI's Genomic Data Commons (GDC)/TCGA

The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Genome Characterization Initiative (CGCI).


Sharing data

Gene set analysis

Hypergeometric test

Next-generation sequencing data



Bioinformatics advice I wish I learned 10 years ago from NIH

High Performance

Cloud Computing

Merge different datasets (different genechips)



Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

How to use UCSC Table Browser

File:Tablebrowser.png File:Tablebrowser2.png


  1. the UCSC browser will return the output on browser by default. Users need to use the browser to save the file with self-chosen file name.
  2. the output does not have a header
  3. The bed format is explained in https://genome.ucsc.edu/FAQ/FAQformat.html#format1

If I select "Whole Genome", I will get a file with 75,893 rows. If I choose "Coding Exons", I will get a file with 577,387 rows.

$ wc -l hg38Tables.bed 
75893 hg38Tables.bed
$ head -2 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
chr1	201283451	201332993	NM_000299	0	+	201283702	201328836	0	15	453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920,	0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
$ tail -2 hg38Tables.bed 
chr22_KI270734v1_random	131493	137393	NM_005675	0	+	131645	136994	0	5	262,161,101,141,549,	0,342,3949,4665,5351,
chr22_KI270734v1_random	138078	161852	NM_016335	0	-	138479	161586	0	15	589,89,99,176,147,93,82,80,117,65,150,35,209,313,164,	0,664,4115,5535,6670,6925,8561,9545,10037,10335,12271,12908,18210,23235,23610,

$ wc -l hg38CodingExon.bed 
577387 hg38CodingExon.bed
$ head -2 hg38CodingExon.bed 
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
$ tail -2 hg38CodingExon.bed 
chr22_KI270734v1_random	156288	156497	NM_016335_cds_12_0_chr22_KI270734v1_random_156289_r	0	-
chr22_KI270734v1_random	161313	161586	NM_016335_cds_13_0_chr22_KI270734v1_random_161314_r	0	-

# Focus on one NCBI refseq (https://www.ncbi.nlm.nih.gov/nuccore/444741698)
$ grep NM_001276352 hg38Tables.bed 
chr1	67092175	67134971	NM_001276352	0	-	67093579	67127240	0	9	1429,70,145,68,113,158,92,86,42,	0,4076,11062,19401,23176,33576,34990,38966,42754,
$ grep NM_001276352 hg38CodingExon.bed
chr1	67093579	67093604	NM_001276352_cds_0_0_chr1_67093580_r	0	-
chr1	67096251	67096321	NM_001276352_cds_1_0_chr1_67096252_r	0	-
chr1	67103237	67103382	NM_001276352_cds_2_0_chr1_67103238_r	0	-
chr1	67111576	67111644	NM_001276352_cds_3_0_chr1_67111577_r	0	-
chr1	67115351	67115464	NM_001276352_cds_4_0_chr1_67115352_r	0	-
chr1	67125751	67125909	NM_001276352_cds_5_0_chr1_67125752_r	0	-
chr1	67127165	67127240	NM_001276352_cds_6_0_chr1_67127166_r	0	-

This can be compared to refGene(?) directly downloaded via http

$ wget -c -O hg38.refGene.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
--2018-10-09 15:44:43--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)...
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7457957 (7.1M) [application/x-gzip]
Saving to: ‘hg38.refGene.txt.gz’

hg38.refGene.txt.gz                100%[===============================================================>]   7.11M   901KB/s    in 10s

2018-10-09 15:44:54 (708 KB/s) - ‘hg38.refGene.txt.gz’ saved [7457957/7457957]

$ zcat hg38.refGene.txt.gz | wc -l
15:45PM /tmp$ zcat hg38.refGene.txt.gz | head -2
1072	NM_003288	chr20	+	63865227	63891545	63865365	63889945	7	63865227,63869295,63873667,63875815,63882718,63889189,63889849,	63865384,63869441,63873816,63875875,63882820,63889238,63891545,	0	TPD52L2	cmpl	cmpl	0,1,0,2,2,2,0,
1815	NR_110164	chr2	+	161244738	161249050	161249050	161249050	2	161244738,161246874,	161244895,161249050,	0	LINC01806	unk	unk	-1,-1,

$ zcat hg38.refGene.txt.gz | tail -2
1006	NM_130467	chrX	+	55220345	55224108	55220599	55224003	5	55220345,55221374,55221766,55222620,55223986,	55220651,55221463,55221875,55222746,55224108,	0	PAGE5	cmpl	cmpl	0,1,0,1,1,
637	NM_001364814	chrY	-	6865917	6874027	6866072	6872608	7	6865917,6868036,6868731,6868867,6870005,6872554,6873971,	6866078,6868462,6868776,6868909,6870053,6872620,6874027,	0	AMELY	cmpl	cmpl	0,0,0,0,0,0,-1,

Where to download reference genome

Which human reference genome to use?

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use (11/13/2017)

RefSeq categories

See Table 1 of Chapter 18The Reference Sequence (RefSeq) Database.

Category Description
NC Complete genomic molecules
NG Incomplete genomic region
NP Protein
XM predicted mRNA model
XR predicted ncRNA model
XP predicted Protein model (eukaryotic sequences)
WP predicted Protein model (prokaryotic sequences)

UCSC version & NCBI release corresponding

Gene Annotation

How many DNA strands are there in humans?

How many base pairs in human

  • 3 billion base pairs. https://en.wikipedia.org/wiki/Human_genome
  • chromosome 22 has the smallest number of bps (~50 million).
  • chromosome 1 has the largest number of bps (245 million base pairs).
  • Illumina iGenome Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa file is 3.0GB (so is other genome.fa from human).

Gene, Transcript, Coding/Non-coding exon


Types of SNPs and number of SNPs in each chromosomes

NGS technology

DNA methylation

library(coloncancermeth) # 485512 x 26
data(coloncancermeth) # load meth (methylation data), pd (sample info ) and gr objects
table(pd$Status) # 9 normals, 17 cancers
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")

for(i in normalIndex){
### Add the cancer samples
for(i in cancerlIndex){

# finding regions of the genome that are different between cancer and normal samples
eb <- ebayes(fit)

# plot of the region surrounding the top hit
i <- which.min(eb$p.value[,2])
middle <- gr[i,]

plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
# http://www.ncbi.nlm.nih.gov/pubmed/22422453

# within each chromosome we usually have big gaps creating subgroups of regions to be analyzed
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")

table(table(cl)) ##shows the number of regions with 1,2,3, ... points in themconsider two example regions...

Whole Genome Sequencing, Whole Exome Sequencing, Transcriptome (RNA) Sequencing

Sequence + Expression

Integrate RNA-Seq and DNA-Seq

Integrate/combine Omics

Gene expression

Expression level is the amount of RNA in cell that was transcribed from that gene. Slides from Alyssa Frazee.

Quantile normalization

  • When to use Quantile Normalization? and its R package quantro
  • normalize.quantiles() from preprocessCore package. Note for ties, the average is used in normalize.quantiles(), ((4.666667 + 5.666667) / 2) = 5.166667.
    #load package
    #the function expects a matrix
    #create a matrix using the same example
    mat <- matrix(c(5,2,3,4,4,1,4,2,3,4,6,8),
    #     [,1] [,2] [,3]
    #[1,]    5    4    3
    #[2,]    2    1    4
    #[3,]    3    4    6
    #[4,]    4    2    8
    #quantile normalisation
    #         [,1]     [,2]     [,3]
    #[1,] 5.666667 5.166667 2.000000
    #[2,] 2.000000 2.000000 3.000000
    #[3,] 3.000000 5.166667 4.666667
    #[4,] 4.666667 3.000000 5.666667

Merging two gene expression studies

pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, 
                      par.prior=TRUE, prior.plots=FALSE)
# This returns an expression matrix, with the same dimensions 
# as your original dataset.
# By default, it performs parametric empirical Bayesian adjustments. 
# If you would like to use nonparametric empirical Bayesian adjustments, 
# use the par.prior=FALSE option (this will take longer).

Fusion gene


Structural variation

LUMPY, DELLY, ForestSV, Pindel, breakdancer , SVDetect.

RNASeq + ChipSeq


Biowulf2 at NIH


Hash BAM and FASTQ files to verify data integrity. The C++ code is based on OpenSSL and seqan libraries.

Selected Papers







Staying current

Staying Current in Bioinformatics & Genomics: 2017 Edition


Common issues in algorithmic bioinformatics papers

What are some common issues I find when reviewing algorithmic bioinformatics conference papers?

Precision Medicine courses

Personalized medicine

Cancer and gene markers

  • Colorectal cancer patients without KRAS mutations have far better outcomes with EGFR treatment than those with KRAS mutations.
    • Two EGFR inhibitors, cetuximab and panitumumab are not recommended for the treatment of colorectal cancer in patients with KRAS mutations in codon 12 and 13.
  • Breast cancer.

The shocking truth about space travel

7 percent of DNA belonging to NASA astronaut Scott Kelly changed in the time he was aboard the International Space Station

bioSyntax: syntax highlighting for computational biology


Deep learning

Deep learning: new computational modelling techniques for genomics




RNA sequencing 101



strand-specific vs non-strand specific experiment

Understand this info is necessary when we want to use summarizeOverlaps() function (GenomicAlignments) or htseq-count python program to get count data.

This post mentioned to use infer_experiment.py script to check whether the rna-seq run is stranded or not.

The rna-seq experiment used in this tutorial is not stranded-specific.


  • FASTQ=FASTA + Qual. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Phred quality score

q = -10log10(p) where p = error probability for the base.

q error probability base call accuracy
10 0.1 90%
13 0.05 95%
20 0.01 99%
30 0.001 99.9%
40 0.0001 99.99%
50 0.00001 99.999%


fasta/fa files can be used as reference genome in IGV. But we cannot load these files in order to view them.

Download sequence files

Compute the sequence length of a FASTA file


awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

head -2 file.fa | \
    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}'  | \
    tail -1

FASTA <=> FASTQ conversion

According to this post,

  • FastA are text files containing multiple DNA* seqs each with some text, some part of the text might be a name.
  • FastQ files are like fasta, but they also have quality scores for each base of each seq, making them appropriate for reads from an Illumina machine (or other brands)

Convert FASTA to FASTQ without quality scores

Biostars. For example, the bioawk by lh3 (Heng Li) worked.

Convert FASTA to FASTQ with quality score file

See the links on the above post.

Convert FASTQ to FASTA using Seqtk

Use the Seqtk program; see this post.

The Seqtk program by lh3 can be used to sample reads from a fastq file including paired-end; see this post.

RPKM (Mortazavi et al. 2008) and cpm (counts per million)

Reads per Kilobase of Exon per Million of Mapped reads.


  • The more we sequence, the more reads we expect from each gene. This is the most relevant correction of this method.
  • Longer transcript are expected to generate more reads. The latter is only relevant for comparisons among different genes which we rarely perform!. As such, the DESeq2 only creates a size factor for each library and normalize the counts by dividing counts by a size factor (scalar) for each library. Note that: H0: mu1=mu2 is equivalent to H0: c*mu1=c*mu2 where c is gene length.


  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.


RPKM = (10^9 * C)/(N * L), with 

C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene

y <- matrix(rnbinom(20,size=1,mu=10),5,4)
     [,1] [,2] [,3] [,4]
[1,]    0    0    5    0
[2,]    6    2    7    3
[3,]    5   13    7    2
[4,]    3    3    9   11
[5,]    1    2    1   15

d <- DGEList(counts=y, lib.size=1001:1004)
# Note that lib.size is optional
# By default, lib.size = colSums(counts)
cpm(d) # counts per million
   Sample1   Sample2  Sample3   Sample4
1    0.000     0.000 4985.045     0.000
2 5994.006  1996.008 6979.063  2988.048
3 4995.005 12974.052 6979.063  1992.032
4 2997.003  2994.012 8973.081 10956.175
5  999.001  1996.008  997.009 14940.239
> cpm(d,log=TRUE)
    Sample1   Sample2  Sample3   Sample4
1  7.961463  7.961463 12.35309  7.961463
2 12.607393 11.132027 12.81875 11.659911
3 12.355838 13.690089 12.81875 11.129470
4 11.663897 11.662567 13.17022 13.451207
5 10.285119 11.132027 10.28282 13.890078

d$genes$Length <- c(1000,2000,500,1500,3000)
    Sample1   Sample2    Sample3  Sample4
1    0.0000     0.000  4985.0449    0.000
2 2997.0030   998.004  3489.5314 1494.024
3 9990.0100 25948.104 13958.1256 3984.064
4 1998.0020  1996.008  5982.0538 7304.117
5  333.0003   665.336   332.3363 4980.080

> cpm
function (x, ...)
<environment: namespace:edgeR>
> showMethods("cpm")

Function "cpm":
 <not an S4 generic function>
> cpm.default
function (x, lib.size = NULL, log = FALSE, prior.count = 0.25,
    x <- as.matrix(x)
    if (is.null(lib.size))
        lib.size <- colSums(x)
    if (log) {
        prior.count.scaled <- lib.size/mean(lib.size) * prior.count
        lib.size <- lib.size + 2 * prior.count.scaled
    lib.size <- 1e-06 * lib.size
    if (log)
        log2(t((t(x) + prior.count.scaled)/lib.size))
    else t(t(x)/lib.size)
<environment: namespace:edgeR>
> rpkm.default
function (x, gene.length, lib.size = NULL, log = FALSE, prior.count = 0.25,
    y <- cpm.default(x = x, lib.size = lib.size, log = log, prior.count = prior.count)
    gene.length.kb <- gene.length/1000
    if (log)
        y - log2(gene.length.kb)
    else y/gene.length.kb
<environment: namespace:edgeR>

Here for example the 1st sample and the 2nd gene, its rpkm value is calculated as

# step 1:
6/(1.0e-6 *1001) = 5994.006    # cpm, compute column-wise
# step 2:
5994.006/ (2000/1.0e3) = 2997.003 # rpkm, compute row-wise

# Another way
# step 1 (RPK) 
6/ (2000/1.0e3) = 3
# step 2 (RPKM)
3/ (1.0e-6 * 1001) = 2997.003


Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios. Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!

(another critic) Union Exon Based Approach


In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach.

It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach.

FPKM (Trapnell et al. 2010)

Fragment per Kilobase of exon per Million of Mapped fragments (Cufflinks). FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).


> dds <- makeExampleDESeqDataSet(m=4)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1       0       1       1       6
gene2       2       0       0       3
gene3      18       9      19      12
gene4      12      25      13      13
gene5      22      26      10       8
gene6       6       5       8       6
> dds <- estimateSizeFactors(dds)
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1       0       1       1       6
gene2       2       0       0       3
gene3      18       9      19      12
gene4      12      25      13      13
gene5      22      26      10       8
gene6       6       5       8       6
> head(counts(dds, normalized=TRUE))
       sample1    sample2    sample3   sample4
gene1  0.00000  0.9654796  0.9858756  5.732657
gene2  1.96066  0.0000000  0.0000000  2.866328
gene3 17.64594  8.6893164 18.7316365 11.465314
gene4 11.76396 24.1369899 12.8163829 12.420756
gene5 21.56726 25.1024695  9.8587560  7.643542
gene6  5.88198  4.8273980  7.8870048  5.732657
P -- per
K -- kilobase (related to gene length)
M -- million (related to sequencing depth)

TMM (Robinson and Oshlack, 2010)

Trimmed Means of M values (EdgeR).

Sample size


~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
C = L N / G

where L=read length, N =number of reads and G=haploid genome length. So, if we take one lane of single read human sequence with v3 chemistry, we get C = (100 bp)*(189×10^6)/(3×10^9 bp) = 6.3. This tells us that each base in the genome will be sequenced between six and seven times on average.

# Assume the bam file is sorted by chromosome location
# took 40 min on 5.8G bam file. samtools depth has no threads option:(
# it is not right since it only account for regions that were covered with reads
samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'    # maybe 42

# The following is the right way! The result matches with Qualimap program.
samtools depth -a *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'  # maybe 8
# OR
LEN=`samtools view -H bamfile | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'`   # 3095693981
SUM=`samtools depth bamfile | awk '{sum+=$3} END { print "Sum = ", sum}'`   # 24473867730
echo $(( $LEN/$SUM ))

SAM/Sequence Alignment Format and BAM format specification

Single-end, pair-end, fragment, insert size

Germline vs Somatic mutation

Germline: inherit from parents. See the Wikipedia page.

Driver vs passenger mutation


Nonsynonymous mutation

It is related to the genetic code, Wikipedia. There are 20 amino acids though there are 64 codes.


isma: analysis of mutations detected by multiple pipelines

isma: an R package for the integrative analysis of mutations detected by multiple pipelines

Missense variants

aminoacid changing variants

Alternative and differential splicing

Best practices and appropriate workflows to analyse alternative and differential splicing

Allele vs Gene


  • A gene is a stretch of DNA or RNA that determines a certain trait.
  • Genes mutate and can take two or more alternative forms; an allele is one of these forms of a gene. For example, the gene for eye color has several variations (alleles) such as an allele for blue eye color or an allele for brown eyes.
  • An allele is found at a fixed spot on a chromosome?
  • Chromosomes occur in pairs so organisms have two alleles for each gene — one allele in each chromosome in the pair. Since each chromosome in the pair comes from a different parent, organisms inherit one allele from each parent for each gene. The two alleles inherited from parents may be same (homozygous) or different (heterozygotes).




Base quality, Mapping quality, Variant quality

Mapping quality (MAPQ) vs Alignment score (AS)

http://seqanswers.com/forums/showthread.php?t=66634 & SAM format specification

  • MAPQ (5th column): MAPping Quality. It equals −10 log10 Pr{mapping position is wrong} (defined by SAM documentation), rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. MAPQ is a metric that tells you how confident you can be that the read comes from the reported position. So given 1000 reads, for example, read alignments with mapping quality being 30, one of them will be wrong in average (10^(30/-10)=.001). Another example, if MAPQ=70, then the probability mapping position is wrong is 10^(70/-10)=1e-7. We can use 'samtools view -q 30 input.bam' to keep reads with MAPQ at least 30. Users should refer to the alignment program for the 'MAPQ' value it uses.
  • AS (optional, 14th column in my case): Alignment score is a metric that tells you how similar the read is to the reference. AS increases with the number of matches and decreases with the number of mismatches and gaps (rewards and penalties for matches and mismatches depend on the scoring matrix you use)


  1. MAPQ scores produced by the aligners typically involves the alignment score and other information.
  2. You can have high AS and low MAPQ if the read aligns perfectly at multiple positions, and you can have low AS and high MAPQ if the read aligns with mismatches but still the reported position is still much more probable than any other.
  3. You probably want to filter for MAPQ, but "good" alignment may refer to AS if what you care is similarity between read and reference.
  4. MAPQ values are really useful but their implementation is a mess by Simon Andrews

Other software




MeV v4.8 (11/18/2011) allows annotation from Bioconductor

IPA from Ingenuity

Login: There are web started version https://analysis.ingenuity.com/pa and Java applet version https://analysis.ingenuity.com/pa/login/choice.jsp. We can double click the file <IpaApplication.jnlp> in my machine's download folder.


  • easily search the scientific literature/integrate diverse biological information.
  • build dynamic pathway models
  • quickly analyze experimental data/Functional discovery: assign function to genes
  • share research and collaborate. On the other hand, IPA is web based, so it takes time for running analyses. Once submitted analyses are done, an email will be sent to the user.

Start Here

Expression data -> New core analysis -> Functions/Diseases -> Network analysis
                                        Canonical pathways        |
                                              |                   |
Simple or advanced search --------------------+                   |
                                              |                   |
                                              v                   |
                                        My pathways, Lists <------+
Creating a custom pathway --------------------+



  • The input data file can be an Excel file with at least one gene ID and expression value at the end of columns (just what BRB-ArrayTools requires in general format importer).
  • The data to be uploaded (because IPA is web-based; the projects/analyses will not be saved locally) can be in different forms. See http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions. It uses the term Single/Multiple Observation. An Observation is a list of molecule identifiers and their corresponding expression values for a given experimental treatment. A dataset file may contain a single observation or multiple observations. A Single Observation dataset contains only one experimental condition (i.e. wild-type). A Multiple Observation dataset contains more than one experimental condition (i.e. a time course experiment, a dose response experiment, etc) and can be uploaded into IPA in a single file (e.g. Excel). A maximum of 20 observations in a single file may be uploaded into IPA.
  • The instruction http://ingenuity.force.com/ipa/articles/Feature_Description/Data-Upload-definitions shows what kind of gene identifier types IPA accepts.
  • In this prostate example data tutorial, the term 'fold change' was used to replace log2 gene expression. The tutorial also uses 1.5 as the fold change expression cutoff.
  • The gene table given on the analysis output contains columns 'Fold change', 'ID', 'Notes', 'Symbol' (with tooltip), 'Entrez Gene Name', 'Location', 'Types', 'Drugs'. See a screenshot below.



DAVID Bioinformatics Resource

It offers an integrated annotation combining gene ontology, pathways and protein annotations.

It can be used to identify the pathways associated with a set of genes; e.g. this paper.


GOTrapper: a tool to navigate through branches of gene ontology hierarchy


Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR).



Genome-wide association studies in R