Genome

From 太極
Jump to navigation Jump to search

Contents

Visualization

See also Bioconductor > BiocViews > Visualization. Search 'genom' as the keyword.

IGV

nano ~/binary/IGV_2.3.52/igv.sh # Change -Xmx2000m to -Xmx4000m in order to increase the memory to 4GB
~/binary/IGV_2.3.52/igv.sh

Simulated DNA-Seq

The following shows 3 simulated DNA-Seq data; the top has 8 insertions (purple '|') per read, the middle has 8 deletions (black '-') per read and the bottom has 8 snps per read.

File:Igv dna simul.png

Whole genome

PRJEB1486

File:Igv prjeb1486 wgs.png

Whole exome

  • (Left) GSE48215, UCSC hg19. It is seen there is a good coverage on all exons.
  • (Right) 1 of 3 whole exome data from SRP066363, UCSC hg19.

File:Igv gse48215.png File:Igv srp066363.png

RNA-Seq

  • (Left) Anders2013, Drosophila_melanogaster/Ensembl/BDGP5. It is seen there are no coverages on some exons.
  • (Right) GSE46876, UCSC/hg19.

File:Igv anders2013 rna.png File:Igv gse46876 rna.png

Tell DNA or RNA

  • DNA: no matter it is whole genome or whole exome, the coverage is more even. For whole exome, there is no splicing.
  • RNA: focusing on expression so the coverage changes a lot. The base name still A,C,G,T (not A,C,G,U).

RNA-seq DRaMA

https://hssgenomics.shinyapps.io/RNAseq_DRaMA/ from 2nd Annual Shiny Contest

Gviz

GIVE: Genomic Interactive Visualization Engine

Build your own genome browser

ChromHeatMap

Heat map plotting by genome coordinate.

ggbio

NOISeq package

Exploratory analysis (Sequencing depth, GC content bias, RNA composition) and differential expression for RNA-seq data.

rtracklayer

R interface to genome browsers and their annotation tracks

  • Retrieve annotation from GTF file and parse the file to a GRanges instance. See the 'Counting reads with summarizeOverlaps' vignette from GenomicAlignments package.

ssviz

A small RNA-seq visualizer and analysis toolkit. It includes a function to draw bar plot of counts per million in tag length with two datasets (control and treatment).

Sushi

See fig on p22 of Sushi vignette where genes with different strands are shown with different directions when plotGenes() was used. plotGenes() can be used to plot gene structures that are stored in bed format.

cBioPortal and TCGA

Qualimap

Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

SeqMonk

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data.

Copy Number

Copy number work flow using Bioconductor

Detect copy number variation (CNV) from the whole exome sequencing

Whole exome sequencing != whole genome sequencing

Consensus CDS/CCDS

DBS segmentation algorithm

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

modSaRa2

An accurate and powerful method for copy number variation detection

NGS

File:CentralDogmaMolecular.png

See NGS.

R and Bioconductor packages

Resources

library(VariantAnnotation)
library(AnnotationHub)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(BSgenome.Hsapiens.UCSC.hg19)

Docker

Bioinstaller: A comprehensive R package to construct interactive and reproducible biological data analysis applications based on the R platform. Package on CRAN.

Some workflows

RNA-Seq workflow

Gene-level exploratory analysis and differential expression. A non stranded-specific and paired-end rna-seq experiment was used for the tutorial.

       STAR       Samtools         Rsamtools
fastq -----> sam ----------> bam  ----------> bamfiles  -|
                                                          \  GenomicAlignments       DESeq2 
                                                           --------------------> se --------> dds
      GenomicFeatures         GenomicFeatures             /        (SummarizedExperiment) (DESeqDataSet)
  gtf ----------------> txdb ---------------> genes -----|

RNA-Seq differential expression work flow using DESeq2

rnaseqGene

rnaseqGene - RNA-seq workflow: gene-level exploratory analysis and differential expression

tximport

CodeOcean - Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences (version: 1.17.5). Plan.

Sequence analysis

library(ShortRead) or library(Biostrings) (QA)
gtf + library(GenomicFeatures) or directly library(TxDb.Scerevisiae.UCSC.sacCer2.sgdGene) (gene information)
GenomicRanges::summarizeOverlaps or GenomicRanges::countOverlaps(count)
edgeR or DESeq2 (gene expression analysis)
library(org.Sc.sgd.db) or library(biomaRt)

Accessing Annotation Data

Use microarray probe, gene, pathway, gene ontology, homology and other annotations. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.

library(org.Hs.eg.db)  # Sample OrgDb Workflow
library("hgu95av2.db") # Sample ChipDb Workflow
library(TxDb.Hsapiens.UCSC.hg19.knownGene) # Sample TxDb Workflow
library(Homo.sapiens)  # Sample OrganismDb Workflow
library(AnnotationHub) # Sample AnnotationHub Workflow
library("biomaRt")     # Using biomaRt
library(BSgenome.Hsapiens.UCSC.hg19) # BSgenome packages
Object type example package name contents
OrgDb org.Hs.eg.db gene based information for Homo sapiens
TxDb TxDb.Hsapiens.UCSC.hg19.knownGene transcriptome ranges for Homo sapiens
OrganismDb Homo.sapiens composite information for Homo sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19 genome sequence for Homo sapiens
refGenome

RNA-Seq Data Analysis using R/Bioconductor

recount2

dbGap

dbgap2x: an R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

eQTL

Statistics for Genomic Data Science (Coursera) and MatrixEQTL from CRAN

GenomicDataCommons package

Note:

  1. The TCGA data such as TCGA-LUAD are not part of clinical trials (described here).
  2. Each patient has 4 categories data and the 'case_id' is common to them:
    • demographic: gender, race, year_of_birth, year_of_death
    • diagnoses: tumor_stage, age_at_diagnosis, tumor_grade
    • exposures: cigarettes_per_day, alcohol_history, years_smoked, bmi, alcohol_intensity, weight, height
    • main: disease_type, primary_site
  3. The original download (clinical.tsv file) data contains a column 'treatment_or_therapy' but it has missing values for all patients.

Visualization

GenVisR

ComplexHeatmap

limma

  • Differential expression analyses for RNA-sequencing and microarray studies
  • Case Study using a Bioconductor R pipeline to analyze RNA-seq data (this is linked from limma package user guide). Here we illustrate how to use two Bioconductor packages - Rsubread' and limma - to perform a complete RNA-seq analysis, including Subread'Bold text read mapping, featureCounts read summarization, voom normalization and limma differential expresssion analysis.
  • Unbalanced data, non-normal data, Bartlett's test for equal variance across groups and SAM tests (assumes equal variances just like limma). See this post.

easyRNASeq

Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.

ShortRead

Base classes, functions, and methods for representation of high-throughput, short-read sequencing data.

Rsamtools

The Rsamtools package provides an interface to BAM files.

The main purpose of the Rsamtools package is to import BAM files into R. Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating R objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts

This package provides an interface to the 'samtools', 'bcftools', and 'tabix' utilities (see 'LICENCE') for manipulating SAM (Sequence Alignment / Map), FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files.

IRanges

IRanges is a fundamental package (see how many packages depend on it) to other packages like GenomicRanges, GenomicFeatures and GenomicAlignments. The package defines the IRanges class.

The plotRanges() function given in the 'An Introduction to IRanges' vignette shows how to draw an IRanges object.

If we want to make the same plot using the ggplot2 package, we can follow the example in this post. Note that disjointBins() returns a vector the bin number for each bins counting on the y-axis.

flank

The example is obtained from ?IRanges::flank.

ir3 <- IRanges(c(2,5,1), c(3,7,3))
# IRanges of length 3
#     start end width
# [1]     2   3     2
# [2]     5   7     3
# [3]     1   3     3

flank(ir3, 2)
#     start end width
# [1]     0   1     2
# [2]     3   4     2
# [3]    -1   0     2
# Note: by default flank(ir3, 2) = flank(ir3, 2, start = TRUE, both=FALSE)
# For example, [2,3] => [2,X] => (..., 0, 1, 2) => [0, 1]
#                                     == ==

flank(ir3, 2, start=FALSE)
#     start end width
# [1]     4   5     2
# [2]     8   9     2
# [3]     4   5     2
# For example, [2,3] => [X,3] => (..., 3, 4, 5) => [4,5]
#                                        == == 

flank(ir3, 2, start=c(FALSE, TRUE, FALSE))
#     start end width
# [1]     4   5     2
# [2]     3   4     2
# [3]     4   5     2
# Combine the ideas of the previous 2 cases.

flank(ir3, c(2, -2, 2))
#     start end width
# [1]     0   1     2
# [2]     5   6     2
# [3]    -1   0     2
# The original statement is the same as flank(ir3, c(2, -2, 2), start=T, both=F)
# For example, [5, 7] => [5, X] => ( 5, 6) => [5, 6]
#                                   == ==

flank(ir3, -2, start=F)
#     start end width
# [1]     2   3     2
# [2]     6   7     2
# [3]     2   3     2
# For example, [5, 7] => [X, 7] => (..., 6, 7) => [6, 7]
#                                       == ==

flank(ir3, 2, both = TRUE)
#     start end width
# [1]     0   3     4
# [2]     3   6     4
# [3]    -1   2     4
# The original statement is equivalent to flank(ir3, 2, start=T, both=T)
# (From the manual) If both = TRUE, extends the flanking region width positions into the range. 
#        The resulting range thus straddles the end point, with width positions on either side.
# For example, [2, 3] => [2, X] => (..., 0, 1, 2, 3) => [0, 3]
#                                             ==
#                                       == == == ==

flank(ir3, 2, start=FALSE, both=TRUE)
#     start end width
# [1]     2   5     4
# [2]     6   9     4
# [3]     2   5     4
# For example, [2, 3] => [X, 3] => (..., 2, 3, 4, 5) => [4, 5]
#                                          ==
#                                       == == == ==

Both IRanges and GenomicRanges packages provide the flank function.

Flanking region is also a common term in High-throughput sequencing. The IGV user guide also has some option related to flanking.

  • General tab: Feature flanking regions (base pairs). IGV adds the flank before and after a feature locus when you zoom to a feature, or when you view gene/loci lists in multiple panels.
  • Alignments tab: Splice junction track options. The minimum amount of nucleotide coverage required on both sides of a junction for a read to be associated with the junction. This affects the coverage of displayed junctions, and the display of junctions covered only by reads with small flanking regions.

Biostrings

GenomicRanges

GenomicRanges depends on IRanges package. See the dependency diagram below.

GenomicFeatues ------- GenomicRanges -+- IRanges -- BioGenomics
                         |            +
                   +-----+            +- GenomeInfoDb
                   |                      |
GenomicAlignments  +--- Rsamtools --+-----+
                                    +--- Biostrings

The package defines some classes

  • GRanges
  • GRangesList
  • GAlignments
  • SummarizedExperiment: it has the following slots - expData, rowData, colData, and assays. Accessors include assays(), assay(), colData(), expData(), mcols(), ... The mcols() method is defined in the S4Vectors package.

(As of Jan 6, 2015) The introduction in GenomicRanges vignette mentions the GAlignments object created from a 'BAM' file discarding some information such as SEQ field, QNAME field, QUAL, MAPQ and any other information that is not needed in its document. This means that multi-reads don't receive any special treatment. Also pair-end reads will be treated as single-end reads and the pairing information will be lost. This might change in the future.

GenomicAlignments

Counting reads with summarizeOverlaps vignette

library(GenomicAlignments)
library(DESeq)
library(edgeR)

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
    recursive=TRUE, pattern="*bam$", full=TRUE)

features <- GRanges(
    seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 4000, 
        7500, 5000, 5400), width=c(rep(500, 3), 600, 900, 500, 300, 900, 
        300, 500, 500)), "-",
    group_id=c(rep("A", 4), rep("B", 5), rep("C", 2)))
features

# GRanges object with 11 ranges and 1 metadata column:
#       seqnames       ranges strand   |    group_id
#          <Rle>    <IRanges>  <Rle>   | <character>
#   [1]    chr2L [1000, 1499]      -   |           A
#   [2]    chr2L [3000, 3499]      -   |           A
#   [3]    chr2L [4000, 4499]      -   |           A
#   [4]    chr2L [7000, 7599]      -   |           A
#   [5]    chr2R [2000, 2899]      -   |           B
#   ...      ...          ...    ... ...         ...
#   [7]    chr2R [3600, 3899]      -   |           B
#   [8]    chr2R [4000, 4899]      -   |           B
#   [9]    chr2R [7500, 7799]      -   |           B
#  [10]    chr3L [5000, 5499]      -   |           C
#  [11]    chr3L [5400, 5899]      -   |           C
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
olap
# class: SummarizedExperiment 
# dim: 11 2 
# exptData(0):
# assays(1): counts
# rownames: NULL
# rowData metadata column names(1): group_id
# colnames(2): sm_treated1.bam sm_untreated1.bam
# colData names(0):

assays(olap)$counts
#       sm_treated1.bam sm_untreated1.bam
#  [1,]               0                 0
#  [2,]               0                 0
#  [3,]               0                 0
#  [4,]               0                 0
#  [5,]               5                 1
#  [6,]               5                 0
#  [7,]               2                 0
#  [8,]             376               104
#  [9,]               0                 0
# [10,]               0                 0
# [11,]               0                 0

Pasilla data. Note that the bam files are not clear where to find them. According to the message, we can download SAM files first and then convert them to BAM files by samtools (Not verify yet).

samtools view -h -o outputFile.bam inputFile.sam

A modified R code that works is

###################################################
### code chunk number 11: gff (eval = FALSE)
###################################################
library(rtracklayer)
fl <- paste0("ftp://ftp.ensembl.org/pub/release-62/",
             "gtf/drosophila_melanogaster/",
             "Drosophila_melanogaster.BDGP5.25.62.gtf.gz")
gffFile <- file.path(tempdir(), basename(fl))
download.file(fl, gffFile)
gff0 <- import(gffFile, asRangedData=FALSE)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
idx <- mcols(gff0)$source == "protein_coding" & 
           mcols(gff0)$type == "exon" & 
           seqnames(gff0) == "4"
gff <- gff0[idx]
## adjust seqnames to match Bam files
seqlevels(gff) <- paste("chr", seqlevels(gff), sep="")
chr4genes <- split(gff, mcols(gff)$gene_id)

###################################################
### code chunk number 12: gff_parse (eval = FALSE)
###################################################
library(GenomicAlignments)

# fls <- c("untreated1_chr4.bam", "untreated3_chr4.bam")
fls <- list.files(system.file("extdata", package="pasillaBamSubset"),
     recursive=TRUE, pattern="*bam$", full=TRUE)
path <- system.file("extdata", package="pasillaBamSubset")
bamlst <- BamFileList(fls)
genehits <- summarizeOverlaps(chr4genes, bamlst, mode="Union") # SummarizedExperiment object
assays(genehits)$counts

###################################################
### code chunk number 15: pasilla_exoncountset (eval = FALSE)
###################################################
library(DESeq)

expdata = MIAME(
              name="pasilla knockdown",
              lab="Genetics and Developmental Biology, University of 
                  Connecticut Health Center",
              contact="Dr. Brenton Graveley",
              title="modENCODE Drosophila pasilla RNA Binding Protein RNAi 
                  knockdown RNA-Seq Studies",
              pubMedIds="20921232",
              url="http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE18508",
              abstract="RNA-seq of 3 biological replicates of from the Drosophila
                  melanogaster S2-DRSC cells that have been RNAi depleted of mRNAs 
                  encoding pasilla, a mRNA binding protein and 4 biological replicates 
                  of the the untreated cell line.")

design <- data.frame(
              condition=c("untreated", "untreated"),
              replicate=c(1,1),
              type=rep("single-read", 2), stringsAsFactors=TRUE)
library(DESeq)
geneCDS <- newCountDataSet(
                  countData=assay(genehits),
                  conditions=design)

experimentData(geneCDS) <- expdata
sampleNames(geneCDS) = colnames(genehits)

###################################################
### code chunk number 16: pasilla_genes (eval = FALSE)
###################################################
chr4tx <- split(gff, mcols(gff)$transcript_id)
txhits <- summarizeOverlaps(chr4tx, bamlst)
txCDS <- newCountDataSet(assay(txhits), design) 
experimentData(txCDS) <- expdata

We can also check out ?summarizeOverlaps to find some fake examples.

tidybulk

chromoMap

Rsubread

See this post for about C version of the featureCounts program.

featureCounts vs HTSeq-count

featureCounts or htseq-count?

Inference

tximport

$ head -5 quant.sf 
Name	Length	EffectiveLength	TPM	NumReads
ENST00000456328.2	1657	1410.79	0.083908	2.46885
ENST00000450305.2	632	410.165	0	0
ENST00000488147.1	1351	1035.94	10.4174	225.073
ENST00000619216.1	68	24	0	0
ENST00000473358.1	712	453.766	0	0
  • Another real data from PDMR -> PDX. Select Genomic Analysis -> RNASeq and TPM(genes) column. Consider Patient ID=114348, Specimen ID=004-R, Sample ID=ATHY22, CTEP SDC Code=10038045,
$ head -2 114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results 
gene_id	transcript_id.s.	length	effective_length	expected_count	TPM	FPKM
5S_rRNA	uc021ofx.1,uc021olu.1,uc021ome.1,uc021oov.1,uc021opl.1,uc021osl.1,uc021ovd.1,uc021ovp.1,uc021pdq.1,uc021pdv.1,uc021pdw.1,uc021ped.1,uc021pff.1,uc021pft.1,uc021phz.1,uc021pmi.1,uc021pmw.1,uc021pnb.1,uc021poq.1,uc021pou.1,uc021ptv.1,uc021pwj.1,uc021pwo.1,uc021pxx.1,uc021qdg.1,uc021qeb.1,uc021qfe.1,uc021qmn.1,uc021qox.1,uc021qqc.1,uc021qqh.1,uc021qsf.1,uc021qvm.1,uc021qzz.1,uc021rcl.1,uc021rij.1,uc021rjt.1,uc021rli.1,uc021rlm.1,uc021rln.1,uc021rmi.1,uc021rmm.1,uc021rva.1,uc021rwx.1,uc021rym.1,uc021sjn.1,uc021snj.1,uc021stz.1,uc021tcu.1,uc021tcv.1,uc021tfq.1,uc021tjk.1,uc021tly.1,uc021tnx.1,uc021tov.1,uc021ubz.1,uc021ujp.1,uc021urx.1,uc021vfh.1,uc021vfo.1,uc021vfu.1,uc021vhy.1,uc021vib.1,uc021vih.1,uc021vjl.1,uc021vjp.1,uc021vju.1,uc021vnc.1,uc021vqb.1,uc021vqp.1,uc021vrv.1,uc021vrx.1,uc021vtp.1,uc021vvw.1,uc021vzi.1,uc021wah.1,uc021wds.1,uc021wek.1,uc021wfa.1,uc021whx.1,uc021wjp.1,uc021wog.1,uc021wpw.1,uc021wuf.1,uc021wuz.1,uc021wyq.1,uc021wzq.1,uc021wzu.1,uc021wzx.1,uc021xai.1,uc021xde.1,uc021xdi.1,uc021xeg.1,uc021xeo.1,uc021xev.1,uc021xez.1,uc021xfx.1,uc021xlz.1,uc021xqf.1,uc021xsq.1,uc021xug.1,uc021xwn.1,uc021xxd.1,uc021xxi.1,uc021xys.1,uc021yks.1,uc021ymh.1,uc021zdo.1,uc021zea.1,uc021zez.1,uc021zfd.1,uc021zft.1,uc021zgc.1,uc021zgn.1,uc022abq.1,uc022akm.1,uc022amt.1,uc022aor.1,uc022atj.1,uc022atr.1,uc022att.1,uc022avo.1,uc022axx.1,uc022bex.1,uc022bia.1,uc022bkv.1,uc022blh.1,uc022bll.1,uc022blx.1,uc022bmb.1,uc022bnt.1,uc022bvb.1,uc022byo.1,uc022bzg.1,uc022cdx.1,uc022cer.1,uc022cgh.1,uc031tkn.1	105.04	2.23	2039.99	49629.87	35353.97

$ R
x = read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.genes.results")
dim(x)
# [1] 28109     7
names(x)
# [1] "gene_id"          "transcript_id.s." "length"           "effective_length"
# [5] "expected_count"   "TPM"              "FPKM"            
x[1:3, -2]
#     gene_id length effective_length expected_count      TPM     FPKM
# 1   5S_rRNA 105.04             2.23        2039.99 49629.87 35353.97
# 2 5_8S_rRNA 161.00            21.19           0.00     0.00     0.00
# 3    6M1-18 473.00           302.74           0.00     0.00     0.00

y <- read.delim("114348_004-R_ATHY22_v2.0.1.4.0_RNASeq.RSEM.isoforms.results")
dim(y)
# [1] 78375     8
names(y)
# [1] "transcript_id"    "gene_id"          "length"           "effective_length"
# [5] "expected_count"   "TPM"              "FPKM"             "IsoPct" 
y[1:3, -1]
#   gene_id length effective_length expected_count TPM FPKM IsoPct
# 1 5S_rRNA    110             3.06              0   0    0      0
# 2 5S_rRNA    133             9.08              0   0    0      0
# 3 5S_rRNA     92             0.00              0   0    0      0

File:RSEM PDX.png

DESeq or edgeR

EBSeq

An R package for gene and isoform differential expression analysis of RNA-seq data

http://www.rna-seqblog.com/analysis-of-ebv-transcription-using-high-throughput-rna-sequencing/

prebs

Probe region expression estimation for RNA-seq data for improved microarray comparability

DEXSeq

Inference of differential exon usage in RNA-Seq

rSeqNP

A non-parametric approach for detecting differential expression and splicing from RNA-Seq data

voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data

http://www.biosoft.hacettepe.edu.tr/voomDDA/

Pathway analysis

GSOAP

GSOAP: a tool for visualization of gene set over-representation analysis

fgsea: Fast Gene Set Enrichment Analysis

GSEABenchmarkeR: Reproducible GSEA Benchmarking

Towards a gold standard for benchmarking gene set enrichment analysis

hypeR

GSEPD

GSEPD: a Bioconductor package for RNA-seq gene set enrichment and projection display

BAGSE

BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis 2020

Pipeline

GeneTEFlow

GeneTEFlow – A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data

pipeComp

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

SARTools

http://www.rna-seqblog.com/sartools-a-deseq2-and-edger-based-r-pipeline-for-comprehensive-differential-analysis-of-rna-seq-data/

SEQprocess

SEQprocess: a modularized and customizable pipeline framework for NGS processing in R package

pasilla and pasillaBamSubset Data

pasilla - Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.

pasillaBamSubset - Subset of BAM files untreated1.bam (single-end reads) and untreated3.bam (paired-end reads) from "Pasilla" experiment (Pasilla knock-down by Brooks et al., Genome Research 2011).

BitSeq

Transcript expression inference and differential expression analysis for RNA-seq data. The homepage of Antti Honkela.

ReportingTools

The ReportingTools software package enables users to easily display reports of analysis results generated from sources such as microarray and sequencing data.

Figures can be included in a cell in output table. See Using ReportingTools in an Analysis of Microarray Data.

It is suggested by e.g. EnrichmentBrowser.

sequences

More or less an educational package. It has 2 c and c++ source code. It is used in Advanced R programming and package development.

QuasR

Bioinformatics paper

CRAN packages

ssizeRNA

Sample Size Calculation for RNA-Seq Experimental Design

RnaSeqSampleSize

Shiny app

rbamtools

Provides an interface to functions of the 'SAMtools' C-Library by Heng Li

refGenome

The packge contains functionality for import and managing of downloaded genome annotation Data from Ensembl genome browser (European Bioinformatics Institute) and from UCSC genome browser (University of California, Santa Cruz) and annotation routines for genomic positions and splice site positions.

WhopGenome

Provides very fast access to whole genome, population scale variation data from VCF files and sequence data from FASTA-formatted files. It also reads in alignments from FASTA, Phylip, MAF and other file formats. Provides easy-to-use interfaces to genome annotation from UCSC and Bioconductor and gene ontology data from AmiGO and is capable to read, modify and write PLINK .PED-format pedigree files.

TCGA2STAT

Simple TCGA Data Access for Integrated Statistical Analysis in R

TCGA2STAT depends on Bioconductor package CNTools which cannot be installed automatically.

source("https://bioconductor.org/biocLite.R")
biocLite("CNTools")

install.packages("TCGA2STAT")

The getTCGA() function allows to download various kind of data:

  • gene expression which includes mRNA-microarray gene expression data (data.type="mRNA_Array") & RNA-Seq gene expression data (data.type="RNASeq")
  • miRNA expression which includes miRNA-array data (data.type="miRNA_Array") & miRNA-Seq data (data.type="miRNASeq")
  • mutation data (data.type="Mutation")
  • methylation expression (data.type="Methylation")
  • copy number changes (data.type="CNA_SNP")

TCGAbiolinks

  • An example from Public Data Resources in Bioconductor workshop 2020. According to ?GDCquery, for the legacy data arguments project, data.category, platform and/or file.extension should be used.
    library(TCGAbiolinks)
    library(SummarizedExperiment)
    query <- GDCquery(project = "TCGA-ACC",
                               data.category = "Gene expression",
                               data.type = "Gene expression quantification",
                               platform = "Illumina HiSeq", 
                               file.type  = "normalized_results",
                               experimental.strategy = "RNA-Seq",
                               legacy = TRUE)
    
    gdcdir <- file.path("Waldron_PublicData", "GDCdata")
    GDCdownload(query, method = "api", files.per.chunk = 10,
                directory = gdcdir)  # 79 files
    ACCse <- GDCprepare(query, directory = gdcdir)
    ACCse
    class(ACCse)
    dim(assay(ACCse))  # 19947 x 79
    assay(ACCse)[1:3, 1:2] # symbol id
    length(unique(rownames(assay(ACCse))))   #  19672
    rowData(ACCse)[1:2, ]
    # DataFrame with 2 rows and 3 columns
    #          gene_id entrezgene ensembl_gene_id
    #      <character>  <integer>     <character>
    # A1BG        A1BG          1 ENSG00000121410
    # A2M          A2M          2 ENSG00000175899
    
  • HTSeq counts data example
    query2 <- GDCquery(project = "TCGA-ACC",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type="HTSeq - Counts") 
    gdcdir2 <- file.path("Waldron_PublicData", "GDCdata2")
    GDCdownload(query2, method = "api", files.per.chunk = 10,
                directory = gdcdir2)  # 79 files
    ACCse2 <- GDCprepare(query2, directory = gdcdir2)
    ACCse2
    dim(assay(ACCse2))  # 56457 x 79
    assay(ACCse2)[1:3, 1:2]  # ensembl id
    rowData(ACCse2)[1:2, ]
    DataFrame with 2 rows and 3 columns
                    ensembl_gene_id external_gene_name original_ensembl_gene_id
                        <character>        <character>              <character>
    ENSG00000000003 ENSG00000000003             TSPAN6       ENSG00000000003.13
    ENSG00000000005 ENSG00000000005               TNMD        ENSG00000000005.5
    
  • Clinical data
    acc_clin <- GDCquery_clinic(project = "TCGA-ACC", type = "Clinical")
    dim(acc_clin)
    # [1] 92 71
    
  • TCGAanalyze_DEA(). Differentially Expression Analysis (DEA) Using edgeR Package.
    dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
    dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut =  0.25)
    samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
    samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
    dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
                          dataFilt[,samplesTP],"Normal", "Tumor")
    # 2nd example
    dataDEGs <- TCGAanalyze_DEA(mat1 = dataFiltLGG, mat2 = dataFiltGBM,
                               Cond1type = "LGG", Cond2type = "GBM",
                               fdr.cut = 0.01,  logFC.cut = 1,
                               method = "glmLRT")
    
  • Enrichment analysis
    ansEA <– TCGAanalyze_EAcomplete(TFname="DEA genes LGG Vs GBM", 
                                    RegulonList = rownames(dataDEGs))
    
    TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP),
                            GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC,
                            GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat,
                            nRGTab = rownames(dataDEGs),
                            nBar = 20)
    
  • mRNA Analysis Pipeline from GDC documentation.