Reproducible

From 太極
Jump to navigation Jump to search

Common Workflow Language (CWL)

R

  • A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker, Slides, Talks & Video. The whole idea is written in an R package repro package. The package create an R project Template where we can use it by RStudio -> New Project -> Create Example Repro Template. Note that the Makefile and Dockerfile can be inferred from the markdown.Rmd file. Note this approach does not make use the renv package. Also it cannot handle Bioconductor packages. Four elements
    • Git folder of source code for version control (R project)
    • Makefile. Make is a “recipe” language that describes how files depend on each other and how to resolve these dependencies.
    • Docker software environment (Containerization)
    • RMarkdown (dynamic document generation)
    automake() # Create '.repro/Dockerfile_packages', 
               #        '.repro/Makefile_Rmds' & 'Dockerfile'
               # and open <Makefile>
    
    # Modify <Makefile> by following the console output
    
    rerun() # will inspects the files of a project and suggest a way to 
            # reproduce the project. So just follow the console output
            # by opening a terminal and typing
    make docker && make -B DOCKER=TRUE
    
    # The above will generate the output html file in your browser
    

    In the end, it calls the following command according to the console output where 'reproproject' in this example is the Docker image name (same as my project name except it automatically converts the name to lower cases).

    docker run --rm --user 368262265 \
      -v "/Full_Path_To_Project":"/home/rstudio/" \
      reproproject Rscript \
      -e 'rmarkdown::render("/home/rstudio//markdown.Rmd", "all")' 
    

Rmarkdown

Rmarkdown package

packrat and renv

R packages → packrat/renv

checkpoint

R → Reproducible Research

dockr package

'dockr': easy containerization for R

Docker & Singularity

Docker

targets package

targets: Democratizing Reproducible Analysis Pipelines Will Landau

Snakemake

Papers

High-throughput analysis suggests differences in journal false discovery rate by subject area and impact factor but not open access status

Share your code and data

Misc

  • 4 great free tools that can make your R work more efficient, reproducible and robust
  • digest: Create Compact Hash Digests of R Objects
  • memoise: Memoisation of Functions. Great for shiny applications. Need to understand how it works in order to take advantage. I modify the example from Efficient R by moving the data out of the function. The cache works in the 2nd call. I don't use benchmark() function since it performs the same operation each time (so favor memoise and mask some detail).
    library(ggplot2) # mpg 
    library(memoise) 
    plot_mpg2 <- function(mpgdf, row_to_remove) {
      mpgdf = mpgdf[-row_to_remove,]
      plot(mpgdf$cty, mpgdf$hwy)
      lines(lowess(mpgdf$cty, mpgdf$hwy), col=2)
    }
    m_plot_mpg2 = memoise(plot_mpg2)
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.019   0.003   0.025
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.018   0.003   0.024
    system.time(m_plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.000   0.000   0.001
    system.time(plot_mpg2(mpg, 12))
    #   user  system elapsed
    #  0.032   0.008   0.047
    
And be careful when it is used in simulation.
f <- function(n=1e5) { 
  a <- rnorm(n)
  a
} 
system.time(f1 <- f())
mf <- memoise::memoise(f)
system.time(f2 <- mf())
system.time(f3 <- mf())
all.equal(f2, f3) # TRUE