Difference between revisions of "R"

From 太極
Jump to navigation Jump to search
(3 intermediate revisions by the same user not shown)
Line 396: Line 396:
 
* [https://blog.methodsconsultants.com/posts/introduction-to-interactive-graphics-in-r-with-plotly/ Introduction to Interactive Graphics in R with plotly]
 
* [https://blog.methodsconsultants.com/posts/introduction-to-interactive-graphics-in-r-with-plotly/ Introduction to Interactive Graphics in R with plotly]
 
* [https://plotly-r.com/controlling-tooltips.html#tooltip-text-ggplotly Tooltip]
 
* [https://plotly-r.com/controlling-tooltips.html#tooltip-text-ggplotly Tooltip]
 +
 +
=== highcharter ===
 +
https://cran.r-project.org/web/packages/highcharter/ Good for time series plot.
  
 
== Amazon ==
 
== Amazon ==
Line 1,207: Line 1,210:
 
* [https://ntguardian.wordpress.com/2019/02/04/organizing-r-research-projects-cpat-case-study/ Organizing R Research Projects: CPAT, A Case Study]
 
* [https://ntguardian.wordpress.com/2019/02/04/organizing-r-research-projects-cpat-case-study/ Organizing R Research Projects: CPAT, A Case Study]
 
* [https://www.tidyverse.org/articles/2017/12/workflow-vs-script/ Project-oriented workflow]. It suggests the [https://github.com/r-lib/here here] package. Don't use '''setwd()''' and '''rm(list = ls())'''.
 
* [https://www.tidyverse.org/articles/2017/12/workflow-vs-script/ Project-oriented workflow]. It suggests the [https://github.com/r-lib/here here] package. Don't use '''setwd()''' and '''rm(list = ls())'''.
 +
** [https://rstats.wtf/safe-paths.html Practice safe paths]. Use projects and the [https://cran.r-project.org/web/packages/here/index.html here] package.
 
** In RStudio, if we try to send a few lines of code and one of the line contains '''setwd()''', it will give a message: ''The working directory was changed to XXX inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.''
 
** In RStudio, if we try to send a few lines of code and one of the line contains '''setwd()''', it will give a message: ''The working directory was changed to XXX inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.''
 
** No update for the ''here'' package. Consider [https://github.com/r-lib/usethis usethis] package (Automate project and package setup).
 
** No update for the ''here'' package. Consider [https://github.com/r-lib/usethis usethis] package (Automate project and package setup).
Line 2,819: Line 2,823:
  
 
A copy of the Rconsole is saved in [https://gist.github.com/arraytools/ed16a486e19702ae94bde4212ad59ecb github].
 
A copy of the Rconsole is saved in [https://gist.github.com/arraytools/ed16a486e19702ae94bde4212ad59ecb github].
 +
 +
=== How R starts up ===
 +
https://rstats.wtf/r-startup.html
  
 
== Saving and loading history automatically: .Rprofile & local() ==
 
== Saving and loading history automatically: .Rprofile & local() ==
Line 6,071: Line 6,078:
 
</pre>
 
</pre>
 
* [https://learningstatisticswithr.com/book/ Learning statistics with R: A tutorial for psychology students and other beginners] by Danielle Navarro
 
* [https://learningstatisticswithr.com/book/ Learning statistics with R: A tutorial for psychology students and other beginners] by Danielle Navarro
 +
* [https://rstats.wtf/ What They Forgot to Teach You About R] Jennifer Bryan & Jim Hester
  
 
== Videos ==
 
== Videos ==

Revision as of 14:59, 8 November 2019

Contents

Install and upgrade R

Here

Online Editor

We can run R on web browsers without installing it on local machines (similar to [/ideone.com Ideone.com] for C++. It does not require an account either (cf RStudio).

rdrr.io

It can produce graphics too. The package I am testing (cobs) is available too.

rstudio.cloud

RDocumentation

The interactive engine is based on DataCamp Light

For example, tbl_df function from dplyr package.

The website DataCamp allows to run library() on the Script window. After that, we can use the packages on R Console.

Here is a list of (common) R packages that users can use on the web.

The packages on RDocumentation may be outdated. For example, the current stringr on CRAN is v1.2.0 (2/18/2017) but RDocumentation has v1.1.0 (8/19/2016).

Web Applications

See also CRAN Task View: Web Technologies and Services

Rmarkdown: create HTML5 web, slides and more

Rmarkdown

HTTP protocol

An HTTP server is conceptually simple:

  1. Open port 80 for listening
  2. When contact is made, gather a little information (get mainly - you can ignore the rest for now)
  3. Translate the request into a file request
  4. Open the file and spit it back at the client

It gets more difficult depending on how much of HTTP you want to support - POST is a little more complicated, scripts, handling multiple requests, etc.

Example in R

> co <- socketConnection(port=8080, server=TRUE, blocking=TRUE) 
> # Now open a web browser and type http://localhost:8080/index.html
> readLines(co,1)
[1] "GET /index.html HTTP/1.1"
> readLines(co,1)
[1] "Host: localhost:8080"
> readLines(co,1)
[1] "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0"
> readLines(co,1)
[1] "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
> readLines(co,1)
[1] "Accept-Language: en-US,en;q=0.5"
> readLines(co,1)
[1] "Accept-Encoding: gzip, deflate"
> readLines(co,1)
[1] "Connection: keep-alive"
> readLines(co,1)
[1] ""

Example in C (Very simple http server written in C, 187 lines)

Create a simple hello world html page and save it as <index.html> in the current directory (/home/brb/Downloads/)

Launch the server program (assume we have done gcc http_server.c -o http_server)

$ ./http_server -p 50002
Server started at port no. 50002 with root directory as /home/brb/Downloads

Secondly open a browser and type http://localhost:50002/index.html. The server will respond

GET /index.html HTTP/1.1
Host: localhost:50002
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

file: /home/brb/Downloads/index.html
GET /favicon.ico HTTP/1.1
Host: localhost:50002
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

file: /home/brb/Downloads/favicon.ico
GET /favicon.ico HTTP/1.1
Host: localhost:50003
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

file: /home/brb/Downloads/favicon.ico

The browser will show the page from <index.html> in server.

The only bad thing is the code does not close the port. For example, if I have use Ctrl+C to close the program and try to re-launch with the same port, it will complain socket() or bind(): Address already in use.


Another Example in C (55 lines)

http://mwaidyanatha.blogspot.com/2011/05/writing-simple-web-server-in-c.html

The response is embedded in the C code.

If we test the server program by opening a browser and type "http://localhost:15000/", the server received the follwing 7 lines

GET / HTTP/1.1
Host: localhost:15000
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

If we include a non-executable file's name in the url, we will be able to download that file. Try "http://localhost:15000/client.c".

If we use telnet program to test, wee need to type anything we want

$ telnet localhost 15000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
ThisCanBeAnything        <=== This is what I typed in the client and it is also shown on server
HTTP/1.1 200 OK          <=== From here is what I got from server
Content-length: 37Content-Type: text/html

HTML_DATA_HERE_AS_YOU_MENTIONED_ABOVE <=== The html tags are not passed from server, interesting!
Connection closed by foreign host.
$ 

See also more examples under C page.

Others

shiny

See Shiny.

plumber: Turning your R code into a RESTful Web API

Docker

httpuv

http and WebSocket library.

See also the servr package which can start an HTTP server in R to serve static files, or dynamic documents that can be converted to HTML files (e.g., R Markdown) under a given directory.

RApache

gWidgetsWWW

Rook

See Rook.

sumo

Sumo is a fully-functional web application template that exposes an authenticated user's R session within java server pages. See the paper http://journal.r-project.org/archive/2012-1/RJournal_2012-1_Bergsma+Smith.pdf.

Stockplot

FastRWeb

http://cran.r-project.org/web/packages/FastRWeb/index.html

WebDriver

'WebDriver' Client for 'PhantomJS'

https://github.com/rstudio/webdriver

Rwui

CGHWithR and WebDevelopR

CGHwithR is still working with old version of R although it is removed from CRAN. Its successor is WebDevelopR. Its The vignette (year 2013) provides a review of several available methods.

manipulate from RStudio

This is not a web application. But the manipulate package can be used to create interactive plot within R(Studio) environment easily. Its source is available at here.

Mathematica also has manipulate function for plotting; see here.

RCloud

RCloud is an environment for collaboratively creating and sharing data analysis scripts. RCloud lets you mix analysis code in R, HTML5, Markdown, Python, and others. Much like Sage, iPython notebooks and Mathematica, RCloud provides a notebook interface that lets you easily record a session and annotate it with text, equations, and supporting images.

See also the Talk in UseR 2014.

cloudyr and flyio - Input Output Files in R from Cloud or Local

https://blog.socialcops.com/inside-sc/announcements/flyio-r-package-interact-data-cloud/ Announcing flyio, an R Package to Interact with Data in the Cloud]

Dropbox access

rdrop2 package

Web page scraping

http://www.slideshare.net/schamber/web-data-from-r#btnNext

xml2 package

rvest package depends on xml2.

rvest

Easy web scraping with R

On Ubuntu, we need to install two packages first!

sudo apt-get install libcurl4-openssl-dev # OR libcurl4-gnutls-dev

sudo apt-get install libxml2-dev

Animate

V8: Embedded JavaScript Engine for R

R⁶ — General (Attys) Distributions: V8, rvest, ggbeeswarm, hrbrthemes and tidyverse packages are used.

pubmed.mineR

Text mining of PubMed Abstracts (http://www.ncbi.nlm.nih.gov/pubmed). The algorithms are designed for two formats (text and XML) from PubMed.

R code for scraping the P-values from pubmed, calculating the Science-wise False Discovery Rate, et al (Jeff Leek)

These R packages import sports, weather, stock data and more

Diving Into Dynamic Website Content with splashr

https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/

Send email

curl

https://www.rdocumentation.org/packages/curl/versions/4.2/topics/send_mail send_mail()]

blastula

How to Send Custom E-mails with R

mailR

Easiest. Require rJava package (not trivial to install, see rJava). mailR is an interface to Apache Commons Email to send emails from within R. See also send bulk email

Before we use the mailR package, we have followed here to have Allow less secure apps: 'ON' ; or you might get an error Error: EmailException (Java): Sending the email to the following server failed : smtp.gmail.com:465. Once we turn on this option, we may get an email for the notification of this change. Note that the recipient can be other than a gmail.

> send.mail(from = "[email protected]",
  to = c("[email protected]", "Recipient 2 <[email protected]>"),
  replyTo = c("Reply to someone else <[email protected]>")
  subject = "Subject of the email",
  body = "Body of the email",
  smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = "gmail_username", passwd = "password", ssl = TRUE),
  attach.files ="./myattachment.txt",
  authenticate = TRUE,
  send = TRUE)
[1] "Java-Object{[email protected]}"

MailR SMTP Setup (Gmail, Outlook, Yahoo) | STARTTLS

gmailr

More complicated. gmailr provides access the Google's gmail.com RESTful API. Vignette and an example on here. Note that it does not use a password; it uses a json file for oauth authentication downloaded from https://console.cloud.google.com/. See also https://github.com/jimhester/gmailr/issues/1.

library(gmailr)
gmail_auth('mysecret.json', scope = 'compose') 

test_email <- mime() %>%
  to("[email protected]") %>%
  from("[email protected]") %>%
  subject("This is a subject") %>%
  html_body("<html><body>I wish this was bold</body></html>")
send_message(test_email)

sendmailR

sendmailR provides a simple SMTP client. It is not clear how to use the package (i.e. where to enter the password).

emayili

emayili: Sending Email from R

GEO (Gene Expression Omnibus)

See this internal link.

Interactive html output

sendplot

RIGHT

The supported plot types include scatterplot, barplot, box plot, line plot and pie plot.

In addition to tooltip boxes, the package can create a table showing all information about selected nodes.

r2d3

r2d3 - R Interface to D3 Visualizations

d3Network

library(d3Network)

Source <- c("A", "A", "A", "A", "B", "B", "C", "C", "D") 
Target <- c("B", "C", "D", "J", "E", "F", "G", "H", "I") 
NetworkData <- data.frame(Source, Target) 

d3SimpleNetwork(NetworkData, height = 800, width = 1024, file="tmp.html")

htmlwidgets for R

Embed widgets in R Markdown documents and Shiny web applications.

networkD3

This is a port of Christopher Gandrud's d3Network package to the htmlwidgets framework.

scatterD3

scatterD3 is an HTML R widget for interactive scatter plots visualization. It is based on the htmlwidgets R package and on the d3.js javascript library.

dygraphs

rthreejs - Create interactive 3D scatter plots, network plots, and globes

Examples

d3heatmap

See R

svgPanZoom

This 'htmlwidget' provides pan and zoom interactivity to R graphics, including 'base', 'lattice', and 'ggplot2'. The interactivity is provided through the 'svg-pan-zoom.js' library.

DT: An R interface to the DataTables library

plotly

highcharter

https://cran.r-project.org/web/packages/highcharter/ Good for time series plot.

Amazon

Download product information and reviews from Amazon.com

sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

and in R

install.packages("devtools")
install.packages("XML")
install.packages("pbapply")
install.packages("dplyr")
devtools::install_github("56north/Rmazon")
product_info <- Rmazon::get_product_info("1593273843")
reviews <- Rmazon::get_reviews("1593273843")
reviews[1,6] # only show partial characters from the 1st review
nchar(reviews[1,6])
as.character(reviews[1,6]) # show the complete text from the 1st review

reviews <- Rmazon::get_reviews("B07BNGJXGS")
# Fetching 30 reviews of 'BOOX Note Ereader,Android 6.0 32 GB 10.3" Dual Touch HD Display'
#   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
reviews
# A tibble: 30 x 6
   reviewRating reviewDate reviewFormat Verified_Purcha… reviewHeadline
          <dbl> <chr>      <lgl>        <lgl>            <chr>         
 1            4 May 23, 2… NA           TRUE             Good for PDF …
 2            3 May 8, 20… NA           FALSE            The reading s…
 3            5 May 17, 2… NA           TRUE             E-reader and …
 4            3 May 24, 2… NA           TRUE             Good hardware…
 5            3 June 21, … NA           TRUE             Poor QC       
 6            5 August 5,… NA           TRUE             Excellent for…
 7            5 May 31, 2… NA           TRUE             Especially li…
 8            5 July 4, 2… NA           TRUE             Android 6 rea…
 9            4 July 15, … NA           TRUE             Remember the …
10            4 June 9, 2… NA           TRUE             Overall fanta…
# ... with 20 more rows, and 1 more variable: reviewText <chr>
reviews[1, 6] # 6-th column is the review text

gutenbergr

Edinbr: Text Mining with R

Twitter

Faces of #rstats Twitter

OCR

Wikipedia

WikipediR: R's MediaWiki API client library

Creating local repository for CRAN and Bioconductor

R repository

Parallel Computing

See R parallel.

Cloud Computing

Install R on Amazon EC2

http://randyzwitch.com/r-amazon-ec2/

Bioconductor on Amazon EC2

http://www.bioconductor.org/help/bioconductor-cloud-ami/

Big Data Analysis

bigmemory, biganalytics, bigtabulate

ff, ffbase

biglm

data.table

See data.table.

disk.frame

Split-apply-combine for Maximum Likelihood Estimation of a linear model

Reproducible Research

http://cran.r-project.org/web/views/ReproducibleResearch.html

Reproducible Environments

https://rviews.rstudio.com/2019/04/22/reproducible-environments/

Some lessons in R coding

  1. don't use rand() and srand() in c. The result is platform dependent. My experience is Ubuntu/Debian/CentOS give the same result but they are different from macOS and Windows. Use Rcpp package and R's random number generator instead.
  2. don't use list.files() directly. The result is platform dependent even different Linux OS. An extra sorting helps!

Useful R packages

Rcpp

http://cran.r-project.org/web/packages/Rcpp/index.html. See more here.

RInside : embed R in C++ code

Ubuntu

With RInside, R can be embedded in a graphical application. For example, $HOME/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/qt directory includes source code of a Qt application to show a kernel density plot with various options like kernel functions, bandwidth and an R command text box to generate the random data. See my demo on Youtube. I have tested this qtdensity example successfully using Qt 4.8.5.

  1. Follow the instruction cairoDevice to install required libraries for cairoDevice package and then cairoDevice itself.
  2. Install Qt. Check 'qmake' command becomes available by typing 'whereis qmake' or 'which qmake' in terminal.
  3. Open Qt Creator from Ubuntu start menu/Launcher. Open the project file $HOME/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/qt/qtdensity.pro in Qt Creator.
  4. Under Qt Creator, hit 'Ctrl + R' or the big green triangle button on the lower-left corner to build/run the project. If everything works well, you shall see the interactive program qtdensity appears on your desktop.

File:qtdensity.png

With RInside + Wt web toolkit installed, we can also create a web application. To demonstrate the example in examples/wt directory, we can do

cd ~/R/x86_64-pc-linux-gnu-library/3.0/RInside/examples/wt
make
sudo ./wtdensity --docroot . --http-address localhost --http-port 8080

Then we can go to the browser's address bar and type http://localhost:8080 to see how it works (a screenshot is in here).

Windows 7

To make RInside works on Windows OS, try the following

  1. Make sure R is installed under C:\ instead of C:\Program Files if we don't want to get an error like g++.exe: error: Files/R/R-3.0.1/library/RInside/include: No such file or directory.
  2. Install RTools
  3. Instal RInside package from source (the binary version will give an error )
  4. Create a DOS batch file containing necessary paths in PATH environment variable
@echo off
set PATH=C:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;%PATH%
set PATH=C:\R\R-3.0.1\bin\i386;%PATH%
set PKG_LIBS=`Rscript -e "Rcpp:::LdFlags()"`
set PKG_CPPFLAGS=`Rscript -e "Rcpp:::CxxFlags()"`
set R_HOME=C:\R\R-3.0.1
echo Setting environment for using R
cmd

In the Windows command prompt, run

cd C:\R\R-3.0.1\library\RInside\examples\standard
make -f Makefile.win

Now we can test by running any of executable files that make generates. For example, rinside_sample0.

rinside_sample0

As for the Qt application qdensity program, we need to make sure the same version of MinGW was used in building RInside/Rcpp and Qt. See some discussions in

So the Qt and Wt web tool applications on Windows may or may not be possible.

GUI

Qt and R

tkrplot

On Ubuntu, we need to install tk packages, such as by

sudo apt-get install tk-dev

reticulate - Interface to 'Python'

def add_three(x):
    z = x + 3
    return z
---
title: "R Notebook"
output: html_notebook
---

```{r}
library(reticulate)
py_discover_config()
x <- 5
source_python("test.py")
y <- add_three(x)
print(y)
```

Pass R variables to Python. Works
```{python}
a = 7
print(r.x)
```

Pass python variables to R. Works.
```{r}
py$a
py_run_string("y = 10"); py$y
```

Hadoop (eg ~100 terabytes)

See also HighPerformanceComputing

RHadoop

Snowdoop: an alternative to MapReduce algorithm

XML

On Ubuntu, we need to install libxml2-dev before we can install XML package.

sudo apt-get update
sudo apt-get install libxml2-dev

On CentOS,

yum -y install libxml2 libxml2-devel

XML

library(XML)

# Read and parse HTML file
doc.html = htmlTreeParse('http://apiolaza.net/babel.html', useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all by spaces
doc.text = gsub('\n', ' ', doc.text)

# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')

This post http://stackoverflow.com/questions/25315381/using-xpathsapply-to-scrape-xml-attributes-in-r can be used to monitor new releases from github.com.

> library(RCurl) # getURL()
> library(XML)   # htmlParse and xpathSApply
> xData <- getURL("https://github.com/alexdobin/STAR/releases")
> doc = htmlParse(xData)
> plain.text <- xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue)
  # I look at the source code and search 2.5.3a and find the tag as
  # 2.5.3a
> plain.text
 [1] "2.5.3a"      "2.5.2b"      "2.5.2a"      "2.5.1b"      "2.5.1a"     
 [6] "2.5.0c"      "2.5.0b"      "STAR_2.5.0a" "STAR_2.4.2a" "STAR_2.4.1d"
>
> # try bwa
> > xData <- getURL("https://github.com/lh3/bwa/releases")
> doc = htmlParse(xData)
> xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue)
[1] "v0.7.15" "v0.7.13"

> # try picard
> xData <- getURL("https://github.com/broadinstitute/picard/releases")
> doc = htmlParse(xData)
> xpathSApply(doc, "//span[@class='css-truncate-target']", xmlValue)
 [1] "2.9.1" "2.9.0" "2.8.3" "2.8.2" "2.8.1" "2.8.0" "2.7.2" "2.7.1" "2.7.0"
[10] "2.6.0"

This method can be used to monitor new tags/releases from some projects like Cura, BWA, Picard, STAR. But for some projects like sratools the class attribute in the span element ("css-truncate-target") can be different (such as "tag-name").

xmlview

RCurl

On Ubuntu, we need to install the packages (the first one is for XML package that RCurl suggests)

# Test on Ubuntu 14.04
sudo apt-get install libxml2-dev
sudo apt-get install libcurl4-openssl-dev

Scrape google scholar results

https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R

No google ID is required

Seems not work

 Error in data.frame(footer = xpathLVApply(doc, xpath.base, "/font/span[@class='gs_fl']",  : 
  arguments imply differing number of rows: 2, 0 

devtools

devtools package depends on Curl. It actually depends on some system files. If we just need to install a package, consider the remotes package which was suggested by the BiocManager package.

# Ubuntu 14.04
sudo apt-get install libcurl4-openssl-dev

# Ubuntu 16.04
sudo apt-get install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

Lazy-load database XXX is corrupt. internal error -3. It often happens when you use install_github to install a package that's currently loaded; try restarting R and running the app again.

httr

httr imports curl, jsonlite, mime, openssl and R6 packages.

When I tried to install httr package, I got an error and some message:

Configuration failed because openssl was not found. Try installing:
 * deb: libssl-dev (Debian, Ubuntu, etc)
 * rpm: openssl-devel (Fedora, CentOS, RHEL)
 * csw: libssl_dev (Solaris)
 * brew: openssl (Mac OSX)
If openssl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a openssl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘openssl’

It turns out after I run sudo apt-get install libssl-dev in the terminal (Debian), it would go smoothly with installing httr package. Nice httr!

Real example: see this post. Unfortunately I did not get a table result; I only get an html file (R 3.2.5, httr 1.1.0 on Ubuntu and Debian).

Since httr package was used in many other packages, take a look at how others use it. For example, aRxiv package.

curl

curl is independent of RCurl package.

library(curl)
h <- new_handle()
handle_setform(h,
  name="aaa", email="bbb"
)
req <- curl_fetch_memory("http://localhost/d/phpmyql3_scripts/ch02/form2.html", handle = h)
rawToChar(req$content)

rOpenSci packages

rOpenSci contains packages that allow access to data repositories through the R statistical programming environment

remotes

Download and install R packages stored in 'GitHub', 'BitBucket', or plain 'subversion' or 'git' repositories. This package is a lightweight replacement of the 'install_*' functions in 'devtools'. Also remotes does not require any extra OS level library (at least on Ubuntu 16.04).

Example:

# https://github.com/henrikbengtsson/matrixstats
remotes::install_github('HenrikBengtsson/[email protected]')

DirichletMultinomial

On Ubuntu, we do

sudo apt-get install libgsl0-dev

Create GUI

gWidgets

GenOrd: Generate ordinal and discrete variables with given correlation matrix and marginal distributions

here

rjson

http://heuristically.wordpress.com/2013/05/20/geolocate-ip-addresses-in-r/

RJSONIO

Accessing Bitcoin Data with R

http://blog.revolutionanalytics.com/2015/11/accessing-bitcoin-data-with-r.html

Plot IP on google map

The following example is modified from the first of above list.

require(RJSONIO) # fromJSON
require(RCurl)   # getURL

temp = getURL("https://gist.github.com/arraytools/6743826/raw/23c8b0bc4b8f0d1bfe1c2fad985ca2e091aeb916/ip.txt", 
                           ssl.verifypeer = FALSE)
ip <- read.table(textConnection(temp), as.is=TRUE)
names(ip) <- "IP"
nr = nrow(ip)
 
Lon <- as.numeric(rep(NA, nr))
Lat <- Lon
Coords <- data.frame(Lon, Lat)
 
ip2coordinates <- function(ip) {
  api <- "http://freegeoip.net/json/"
  get.ips <- getURL(paste(api, URLencode(ip), sep=""))
  # result <- ldply(fromJSON(get.ips), data.frame)
  result <- data.frame(fromJSON(get.ips))
  names(result)[1] <- "ip.address"
  return(result)
}

for (i in 1:nr){
  cat(i, "\n")
  try(
  Coords[i, 1:2] <- ip2coordinates(ip$IP[i])[c("longitude", "latitude")]
  )
}
 
# append to log-file:
logfile <- data.frame(ip, Lat = Coords$Lat, Long = Coords$Lon,
                                       LatLong = paste(round(Coords$Lat, 1), round(Coords$Lon, 1), sep = ":")) 
log_gmap <- logfile[!is.na(logfile$Lat), ]

require(googleVis) # gvisMap
gmap <- gvisMap(log_gmap, "LatLong",
                options = list(showTip = TRUE, enableScrollWheel = TRUE,
                               mapType = 'hybrid', useMapTypeControl = TRUE,
                               width = 1024, height = 800))
plot(gmap)

File:GoogleVis.png

The plot.gvis() method in googleVis packages also teaches the startDynamicHelp() function in the tools package, which was used to launch a http server. See Jeffrey Horner's note about deploying Rook App.

Map

leaflet

choroplethr

ggplot2

How to make maps with Census data in R

googleVis

See an example from RJSONIO above.

googleAuthR

Create R functions that interact with OAuth2 Google APIs easily, with auto-refresh and Shiny compatibility.

gtrendsR - Google Trends

quantmod

Maintaining a database of price files in R. It consists of 3 steps.

  1. Initial data downloading
  2. Update existing data
  3. Create a batch file

caret

Tool for connecting Excel with R

Read/Write Excel files package

  • http://www.milanor.net/blog/?p=779
  • flipAPI. One useful feature of DownloadXLSX, which is not supported by the readxl package, is that it can read Excel files directly from the URL.
  • xlsx: depends on Java
  • openxlsx: not depend on Java. Depend on zip application. On Windows, it seems to be OK without installing Rtools. But it can not read xls file; it works on xlsx file.
  • readxl: it does not depend on anything although it can only read but not write Excel files.
    • It is part of tidyverse package. The readxl website provides several articles for more examples.
    • readxl webinar.
    • One advantage of read_excel (as with read_csv in the readr package) is that the data imports into an easy to print object with three attributes a tbl_df, a tbl and a data.frame.
    • For writing to Excel formats, use writexl or openxlsx package.
  • writexl: zero dependency xlsx writer for R
library(readxl)
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE, 
    col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, 
    guess_max = min(1000, n_max), progress = readxl_progress(), 
    .name_repair = "unique")
# Example
read_excel(path, range = cell_cols("c:cx"), col_types = "numeric")

For the Chromosome column, integer values becomes strings (but converted to double, so 5 becomes 5.000000) or NA (empty on sheets).

> head(read_excel("~/Downloads/BRCA.xls", 4)[ , -9], 3)
  UniqueID (Double-click) CloneID UGCluster
1                   HK1A1   21652 Hs.445981
2                   HK1A2   22012 Hs.119177
3                   HK1A4   22293 Hs.501376
                                                    Name Symbol EntrezID
1 Catenin (cadherin-associated protein), alpha 1, 102kDa CTNNA1     1495
2                              ADP-ribosylation factor 3   ARF3      377
3                          Uroporphyrinogen III synthase   UROS     7390
  Chromosome      Cytoband ChimericClusterIDs Filter
1   5.000000        5q31.2               <NA>      1
2  12.000000         12q13               <NA>      1
3       <NA> 10q25.2-q26.3               <NA>      1

The hidden worksheets become visible (Not sure what are those first rows mean in the output).

> excel_sheets("~/Downloads/BRCA.xls")
DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 01 00 00 00 9a 0c 00 00 1a 00 
DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 03 00 00 00 9b 0c 00 00 0a 00 
DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 02 00 00 00 9a 0c 00 00 06 00 
[1] "Experiment descriptors" "Filtered log ratio"     "Gene identifiers"      
[4] "Gene annotations"       "CollateInfo"            "GeneSubsets"           
[7] "GeneSubsetsTemp"       

The Chinese character works too.

> read_excel("~/Downloads/testChinese.xlsx", 1)
   中文 B C
1     a b c
2     1 2 3

To read all worksheets we need a convenient function

read_excel_allsheets <- function(filename) {
    sheets <- readxl::excel_sheets(filename)
    sheets <- sheets[-1] # Skip sheet 1
    x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_types = "numeric"))
    names(x) <- sheets
    x
}
dcfile <- "table0.77_dC_biospear.xlsx"
dc <- read_excel_allsheets(dcfile)
# Each component (eg dc1) is a tibble.

readr

Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names.

1.0.0 released.

Consider a text file where the table (6100 x 22) has duplicated row names and the (1,1) element is empty. The column names are all unique.

  • read.delim() will treat the first column as rownames but it does not allow duplicated row names. Even we use row.names=NULL, it still does not read correctly. It does give warnings (EOF within quoted string & number of items read is not a multiple of the number of columns). The dim is 5177 x 22.
  • readr::read_delim(Filename, "\t") will miss the last column. The dim is 6100 x 21.
  • data.table::fread(Filename, sep = "\t") will detect the number of column names is less than the number of columns. Added 1 extra default column name for the first column which is guessed to be row names or an index. The dim is 6100 x 22. (Winner!)

The readr::read_csv() function is as fast as data.table::fread() function. For files beyond 100MB in size fread() and read_csv() can be expected to be around 5 times faster than read.csv(). See 5.3 of Efficient R Programming book.

Note that data.table::fread() can read a selection of the columns.

Colors

library(plotrix)
sapply(rainbow(4), color.id)
sapply(RColorBrewer::brewer.pal(4, "Set1"), color.id)

Below is an example using the option scale_fill_brewer(palette = "Paired"). See the source code at gist. Note that only 'set1' and 'set3' palettes in qualitative scheme can support up to 12 classes.

According to the information from the colorbrew website, qualitative schemes do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes.

File:GgplotPalette.svg

ggplot2

See ggplot2

Data Manipulation & Tidyverse

See Tidyverse.

Data Science

See Data science page

microbenchmark & rbenchmark

jpeg

If we want to create the image on this wiki left hand side panel, we can use the jpeg package to read an existing plot and then edit and save it.

We can also use the jpeg package to import and manipulate a jpg image. See Fun with Heatmaps and Plotly.

Cairo

See White strips problem in png() or tiff().

cairoDevice

PS. Not sure the advantage of functions in this package compared to R's functions (eg. Cairo_svg() vs svg()).

For ubuntu OS, we need to install 2 libraries and 1 R package RGtk2.

sudo apt-get install libgtk2.0-dev libcairo2-dev

On Windows OS, we may got the error: unable to load shared object 'C:/Program Files/R/R-3.0.2/library/cairoDevice/libs/x64/cairoDevice.dll' . We need to follow the instruction in here.

igraph

creating directed networks with igraph

Identifying dependencies of R functions and scripts

https://stackoverflow.com/questions/8761857/identifying-dependencies-of-r-functions-and-scripts

library(mvbutils)
foodweb(where = "package:batr")

foodweb( find.funs("package:batr"), prune="survRiskPredict", lwd=2)

foodweb( find.funs("package:batr"), prune="classPredict", lwd=2)

iterators

Iterator is useful over for-loop if the data is already a collection. It can be used to iterate over a vector, data frame, matrix, file

Iterator can be combined to use with foreach package http://www.exegetic.biz/blog/2013/11/iterators-in-r/ has more elaboration.

Colors

colortools

Tools that allow users generate color schemes and palettes

colourpicker

A Colour Picker Tool for Shiny and for Selecting Colours in Plots

inlmisc

GetTolColors(). Lots of examples.

rex

Friendly Regular Expressions

formatR

The best strategy to avoid failure is to put comments in complete lines or after complete R expressions.

See also this discussion on stackoverflow talks about R code reformatting.

library(formatR)
tidy_source("Input.R", file = "output.R", width.cutoff=70)
tidy_source("clipboard") 
# default width is getOption("width") which is 127 in my case.

Some issues

  • Comments appearing at the beginning of a line within a long complete statement. This will break tidy_source().
cat("abcd",
    # This is my comment
    "defg")

will result in

> tidy_source("clipboard")
Error in base::parse(text = code, srcfile = NULL) : 
  3:1: unexpected string constant
2: invisible(".BeGiN_TiDy_IdEnTiFiEr_HaHaHa# This is my comment.HaHaHa_EnD_TiDy_IdEnTiFiEr")
3: "defg"
   ^
  • Comments appearing at the end of a line within a long complete statement won't break tidy_source() but tidy_source() cannot re-locate/tidy the comma sign.
cat("abcd"
    ,"defg"   # This is my comment
  ,"ghij")

will become

cat("abcd", "defg"  # This is my comment
, "ghij") 

Still bad!!

  • Comments appearing at the end of a line within a long complete statement breaks tidy_source() function. For example,
cat("

", "
", ifelse(codeSurv == 0,"

<a name='Genes'>Genes which are differentially expressed among classes:</a>

", #4/9/09 "

<a name='Genes'>Genes significantly associated with survival:</a>

"), file=ExternalFileName, sep="\n", append=T)

will result in

> tidy_source("clipboard", width.cutoff=70)
Error in base::parse(text = code, srcfile = NULL) : 
  3:129: unexpected SPECIAL
2: "
" , 3: ifelse ( codeSurv == 0 , "

<a name='Genes'>Genes which are differentially expressed among classes:</a>

" , %InLiNe_IdEnTiFiEr%
  • width.cutoff parameter is not always working. For example, there is no any change for the following snippet though I hope it will move the cat() to the next line.
if (codePF & !GlobalTest & !DoExactPermTest) cat(paste("Multivariate Permutations test was computed based on", 
    NumPermutations, "random permutations"), "<BR>", " ", file = ExternalFileName, 
    sep = "\n", append = T)
  • It merges lines though I don't always want to do that. For example
cat("abcd"
    ,"defg"  
  ,"ghij")

will become

cat("abcd", "defg", "ghij") 

Download papers

biorxivr

Search and Download Papers from the bioRxiv Preprint Server

aRxiv

Interface to the arXiv API

pdftools

aside: set it aside

An RStudio addin to run long R commands aside your current session.

Teaching

  • smovie: Some Movies to Illustrate Concepts in Statistics

Organize R research project

How to save (and load) datasets in R (.Rdata vs .Rds file)

How to save (and load) datasets in R: An overview

Text to speech

Text-to-Speech with the googleLanguageR package

Weather data

logR

https://github.com/jangorecki/logR

Progress bar

https://github.com/r-lib/progress#readme

Configurable Progress bars, they may include percentage, elapsed time, and/or the estimated completion time. They work in terminals, in 'Emacs' 'ESS', 'RStudio', 'Windows' 'Rgui' and the 'macOS'.

cron

cronR

Different ways of using R

Extending R by John M. Chambers (2016)

10 things R can do that might surprise you

https://simplystatistics.org/2019/03/13/10-things-r-can-do-that-might-surprise-you/

R call C/C++

Mainly talks about .C() and .Call().

Note that scalars and arrays must be passed using pointers. So if we want to access a function not exported from a package, we may need to modify the function to make the arguments as pointers.

NAMESPACE file & useDynLib

(From Writing R Extensions manual) Loading is most often done automatically based on the useDynLib() declaration in the NAMESPACE file, but may be done explicitly via a call to library.dynam(). This has the form

library.dynam("libname", package, lib.loc) 

library.dynam.unload()

gcc

Coping with varying `gcc` versions and capabilities in R packages

SEXP

Some examples from packages

  • sva package has one C code function

R call Fortran

Embedding R

An very simple example (do not return from shell) from Writing R Extensions manual

The command-line R front-end, R_HOME/bin/exec/R, is one such example. Its source code is in file <src/main/Rmain.c>.

This example can be run by

R_HOME/bin/R CMD R_HOME/bin/exec/R

Note:

  1. R_HOME/bin/exec/R is the R binary. However, it couldn't be launched directly unless R_HOME and LD_LIBRARY_PATH are set up. Again, this is explained in Writing R Extension manual.
  2. R_HOME/bin/R is a shell-script front-end where users can invoke it. It sets up the environment for the executable. It can be copied to /usr/local/bin/R. When we run R_HOME/bin/R, it actually runs R_HOME/bin/R CMD R_HOME/bin/exec/R (see line 259 of R_HOME/bin/R as in R 3.0.2) so we know the important role of R_HOME/bin/exec/R.

More examples of embedding can be found in tests/Embedding directory. Read <index.html> for more information about these test examples.

An example from Bioconductor workshop

Example: Create <embed.c> file

#include <Rembedded.h>
#include <Rdefines.h>

static void doSplinesExample();
int
main(int argc, char *argv[])
{
    Rf_initEmbeddedR(argc, argv);
    doSplinesExample();
    Rf_endEmbeddedR(0);
    return 0;
}
static void
doSplinesExample()
{
    SEXP e, result;
    int errorOccurred;

    // create and evaluate 'library(splines)'
    PROTECT(e = lang2(install("library"), mkString("splines")));
    R_tryEval(e, R_GlobalEnv, &errorOccurred);
    if (errorOccurred) {
        // handle error
    }
    UNPROTECT(1);

    // 'options(FALSE)' ...
    PROTECT(e = lang2(install("options"), ScalarLogical(0)));
    // ... modified to 'options(example.ask=FALSE)' (this is obscure)
    SET_TAG(CDR(e), install("example.ask"));
    R_tryEval(e, R_GlobalEnv, NULL);
    UNPROTECT(1);

    // 'example("ns")'
    PROTECT(e = lang2(install("example"), mkString("ns")));
    R_tryEval(e, R_GlobalEnv, &errorOccurred);
    UNPROTECT(1);
}

Then build the executable. Note that I don't need to create R_HOME variable.

cd 
tar xzvf 
cd R-3.0.1
./configure --enable-R-shlib
make
cd tests/Embedding
make
~/R-3.0.1/bin/R CMD ./Rtest

nano embed.c
# Using a single line will give an error and cannot not show the real problem.
# ../../bin/R CMD gcc -I../../include -L../../lib -lR embed.c
# A better way is to run compile and link separately
gcc -I../../include -c embed.c
gcc -o embed embed.o -L../../lib -lR -lRblas
../../bin/R CMD ./embed

Note that if we want to call the executable file ./embed directly, we shall set up R environment by specifying R_HOME variable and including the directories used in linking R in LD_LIBRARY_PATH. This is based on the inform provided by Writing R Extensions.

export R_HOME=/home/brb/Downloads/R-3.0.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/brb/Downloads/R-3.0.2/lib
./embed # No need to include R CMD in front.

Question: Create a data frame in C? Answer: Use data.frame() via an eval() call from C. Or see the code is stats/src/model.c, as part of model.frame.default. Or using Rcpp as here.

Reference http://bioconductor.org/help/course-materials/2012/Seattle-Oct-2012/AdvancedR.pdf

Create a Simple Socket Server in R

This example is coming from this paper.

Create an R function

simpleServer <- function(port=6543)
{
  sock <- socketConnection ( port=port , server=TRUE)
  on.exit(close( sock ))
  cat("\nWelcome to R!\nR>" ,file=sock )
  while(( line <- readLines ( sock , n=1)) != "quit")
  {
    cat(paste("socket >" , line , "\n"))
    out<- capture.output (try(eval(parse(text=line ))))
    writeLines ( out , con=sock )
    cat("\nR> " ,file =sock )
  }
}

Then run simpleServer(). Open another terminal and try to communicate with the server

$ telnet localhost 6543
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

Welcome to R!
R> summary(iris[, 3:5])
  Petal.Length    Petal.Width          Species  
 Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :4.350   Median :1.300   virginica :50  
 Mean   :3.758   Mean   :1.199                  
 3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :6.900   Max.   :2.500                  

R> quit
Connection closed by foreign host.

Rserve

Note the way of launching Rserve is like the way we launch C program when R was embedded in C. See Example from Bioconductor workshop.

See my Rserve page.

(Commercial) StatconnDcom

R.NET

rJava

Terminal

# jdk 7
sudo apt-get install openjdk-7-*
update-alternatives --config java
# oracle jdk 8
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo apt-get -y install openjdk-8-jdk

and then run the following (thanks to http://stackoverflow.com/questions/12872699/error-unable-to-load-installed-packages-just-now) to fix an error: libjvm.so: cannot open shared object file: No such file or directory.

  • Create the file /etc/ld.so.conf.d/java.conf with the following entries:
/usr/lib/jvm/java-8-oracle/jre/lib/amd64
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server
  • And then run sudo ldconfig

Now go back to R

install.packages("rJava")

Done!

If above does not work, a simple way is by (under Ubuntu) running

sudo apt-get install r-cran-rjava

which will create new package 'default-jre' (under /usr/lib/jvm) and 'default-jre-headless'.

RCaller

RApache

littler

Provides hash-bang (#!) capability for R

FAQs:

[email protected]:/# ls -l /usr/bin/{r,R*}
# R 3.5.2 docker container
-rwxr-xr-x 1 root root 82632 Jan 26 18:26 /usr/bin/r        # binary, can be used for 'shebang' lines, r --help
                                              # Example: r --verbose -e "date()"

-rwxr-xr-x 1 root root  8722 Dec 20 11:35 /usr/bin/R        # text, R --help
                                              # Example: R -q -e "date()"

-rwxr-xr-x 1 root root 14552 Dec 20 11:35 /usr/bin/Rscript  # binary, can be used for 'shebang' lines, Rscript --help
                                              # It won't show the startup message when it is used in the command line.
                                              # Example: Rscript -e "date()"

We can install littler using two ways.

  • install.packages("littler"). This will install the latest version but the binary 'r' program is only available under the package/bin directory (eg ~/R/x86_64-pc-linux-gnu-library/3.4/littler/bin/r). You need to create a soft link in order to access it globally.
  • sudo apt install littler. This will install 'r' globally; however, the installed version may be old.

After the installation, vignette contains several examples. The off-line vignette has a table of contents. Nice! The web version of examples does not have the TOC.

r was not meant to run interactively like R. See man r.

RInside: Embed R in C++

See RInside

(From RInside documentation) The RInside package makes it easier to embed R in your C++ applications. There is no code you would execute directly from the R environment. Rather, you write C++ programs that embed R which is illustrated by some the included examples.

The included examples are armadillo, eigen, mpi, qt, standard, threads and wt.

To run 'make' when we don't have a global R, we should modify the file <Makefile>. Also if we just want to create one executable file, we can do, for example, 'make rinside_sample1'.

To run any executable program, we need to specify LD_LIBRARY_PATH variable, something like

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/brb/Downloads/R-3.0.2/lib 

The real build process looks like (check <Makefile> for completeness)

g++ -I/home/brb/Downloads/R-3.0.2/include \
    -I/home/brb/Downloads/R-3.0.2/library/Rcpp/include \
    -I/home/brb/Downloads/R-3.0.2/library/RInside/include -g -O2 -Wall \
    -I/usr/local/include   \
    rinside_sample0.cpp  \
    -L/home/brb/Downloads/R-3.0.2/lib -lR  -lRblas -lRlapack \
    -L/home/brb/Downloads/R-3.0.2/library/Rcpp/lib -lRcpp \
    -Wl,-rpath,/home/brb/Downloads/R-3.0.2/library/Rcpp/lib \
    -L/home/brb/Downloads/R-3.0.2/library/RInside/lib -lRInside \
    -Wl,-rpath,/home/brb/Downloads/R-3.0.2/library/RInside/lib \
    -o rinside_sample0

Hello World example of embedding R in C++.

#include <RInside.h>                    // for the embedded R via RInside

int main(int argc, char *argv[]) {

    RInside R(argc, argv);              // create an embedded R instance 

    R["txt"] = "Hello, world!\n";	// assign a char* (string) to 'txt'

    R.parseEvalQ("cat(txt)");           // eval the init string, ignoring any returns

    exit(0);
}

The above can be compared to the Hello world example in Qt.

#include <QApplication.h>
#include <QPushButton.h>

int main( int argc, char **argv )
{
    QApplication app( argc, argv );

    QPushButton hello( "Hello world!", 0 );
    hello.resize( 100, 30 );

    app.setMainWidget( &hello );
    hello.show();

    return app.exec();
}

RFortran

RFortran is an open source project with the following aim:

To provide an easy to use Fortran software library that enables Fortran programs to transfer data and commands to and from R.

It works only on Windows platform with Microsoft Visual Studio installed:(

Call R from other languages

C

Using R from C/C++

Error: “not resolved from current namespace” error, when calling C routines from R

Solution: add getNativeSymbolInfo() around your C/Fortran symbols. Search Google:r dyn.load not resolved from current namespace

JRI

http://www.rforge.net/JRI/

ryp2

http://rpy.sourceforge.net/rpy2.html

Create a standalone Rmath library

R has many math and statistical functions. We can easily use these functions in our C/C++/Fortran. The definite guide of doing this is on Chapter 9 "The standalone Rmath library" of R-admin manual.

Here is my experience based on R 3.0.2 on Windows OS.

Create a static library <libRmath.a> and a dynamic library <Rmath.dll>

Suppose we have downloaded R source code and build R from its source. See Build_R_from_its_source. Then the following 2 lines will generate files <libRmath.a> and <Rmath.dll> under C:\R\R-3.0.2\src\nmath\standalone directory.

cd C:\R\R-3.0.2\src\nmath\standalone
make -f Makefile.win

Use Rmath library in our code

set CPLUS_INCLUDE_PATH=C:\R\R-3.0.2\src\include
set LIBRARY_PATH=C:\R\R-3.0.2\src\nmath\standalone
# It is not LD_LIBRARY_PATH in above.

# Created <RmathEx1.cpp> from the book "Statistical Computing in C++ and R" web site
# http://math.la.asu.edu/~eubank/CandR/ch4Code.cpp
# It is OK to save the cpp file under any directory.

# Force to link against the static library <libRmath.a>
g++ RmathEx1.cpp -lRmath -lm -o RmathEx1.exe
# OR
g++ RmathEx1.cpp -Wl,-Bstatic -lRmath -lm -o RmathEx1.exe

# Force to link against dynamic library <Rmath.dll>
g++ RmathEx1.cpp Rmath.dll -lm -o RmathEx1Dll.exe

Test the executable program. Note that the executable program RmathEx1.exe can be transferred to and run in another computer without R installed. Isn't it cool!

c:\R>RmathEx1
Enter a argument for the normal cdf:
1
Enter a argument for the chi-squared cdf:
1
Prob(Z <= 1) = 0.841345
Prob(Chi^2 <= 1)= 0.682689

Below is the cpp program <RmathEx1.cpp>.

//RmathEx1.cpp
#define MATHLIB_STANDALONE 
#include <iostream>
#include "Rmath.h"

using std::cout; using std::cin; using std::endl;

int main()
{
  double x1, x2;
  cout << "Enter a argument for the normal cdf:" << endl;
  cin >> x1;
  cout << "Enter a argument for the chi-squared cdf:" << endl;
  cin >> x2;

  cout << "Prob(Z <= " << x1 << ") = " << 
    pnorm(x1, 0, 1, 1, 0)  << endl;
  cout << "Prob(Chi^2 <= " << x2 << ")= " << 
    pchisq(x2, 1, 1, 0) << endl;
  return 0;
}

Calling R.dll directly

See Chapter 8.2.2 of R Extensions. This is related to embedding R under Windows. The file <R.dll> on Windows is like <libR.so> on Linux.

Create HTML report

ReportingTools (Jason Hackney) from Bioconductor.

htmlTable package

The htmlTable package is intended for generating tables using HTML formatting. This format is compatible with Markdown when used for HTML-output. The most basic table can easily be created by just passing a matrix or a data.frame to the htmlTable-function.

formattable

htmltab package

This package is NOT used to CREATE html report but EXTRACT html table.

ztable package

Makes zebra-striped tables (tables with alternating row colors) in LaTeX and HTML formats easily from a data.frame, matrix, lm, aov, anova, glm or coxph objects.

Create academic report

reports package in CRAN and in github repository. The youtube video gives an overview of the package.

Create pdf and epub files

# Idea:
#        knitr        pdflatex
#   rnw -------> tex ----------> pdf
library(knitr)
knit("example.rnw") # create example.tex file
  • A very simple example <002-minimal.Rnw> from yihui.name works fine on linux.
git clone https://github.com/yihui/knitr-examples.git
  • <knitr-minimal.Rnw>. I have no problem to create pdf file on Windows but still cannot generate pdf on Linux from tex file. Some people suggested to run sudo apt-get install texlive-fonts-recommended to install missing fonts. It works!

To see a real example, check out DESeq2 package (inst/doc subdirectory). In addition to DESeq2, I also need to install DESeq, BiocStyle, airway, vsn, gplots, and pasilla packages from Bioconductor. Note that, it is best to use sudo/admin account to install packages.

Or starts with markdown file. Download the example <001-minimal.Rmd> and remove the last line of getting png file from internet.

# Idea:
#        knitr        pandoc
#   rmd -------> md ----------> pdf

git clone https://github.com/yihui/knitr-examples.git
cd knitr-examples
R -e "library(knitr); knit('001-minimal.Rmd')"
pandoc 001-minimal.md -o 001-minimal.pdf # require pdflatex to be installed !!

To create an epub file (not success yet on Windows OS, missing figures on Linux OS)

# Idea:
#        knitr        pandoc
#   rnw -------> tex ----------> markdown or epub

library(knitr)
knit("DESeq2.Rnw") # create DESeq2.tex
system("pandoc  -f latex -t markdown -o DESeq2.md DESeq2.tex")
## Windows OS, epub cannot be built
pandoc:
Error:
"source" (line 41, column 7):
unexpected "k"
expecting "{document}"

## Linux OS, epub missing figures and R codes.
## First install texlive base and extra packages
## sudo apt-get install texlive-latex-base texlive-latex-extra
pandoc: Could not find media `figure/SchwederSpjotvoll-1', skipping...
pandoc: Could not find media `figure/sortedP-1', skipping...
pandoc: Could not find media `figure/figHeatmap2c-1', skipping...
pandoc: Could not find media `figure/figHeatmap2b-1', skipping...
pandoc: Could not find media `figure/figHeatmap2a-1', skipping...
pandoc: Could not find media `figure/plotCountsAdv-1', skipping...
pandoc: Could not find media `figure/plotCounts-1', skipping...
pandoc: Could not find media `figure/MA-1', skipping...
pandoc: Could not find media `figure/MANoPrior-1', skipping...

The problems are at least

  • figures need to be generated under the same directory as the source code
  • figures cannot be in the format of pdf (DESeq2 generates both pdf and png files format)
  • missing R codes

Convert tex to epub

kable() for tables

Create Tables In LaTeX, HTML, Markdown And ReStructuredText

Create Word report

knitr + pandoc

It is better to create rmd file in RStudio. Rstudio provides a template for rmd file and it also provides a quick reference to R markdown language.

# Idea:
#        knitr       pandoc
#   rmd -------> md --------> docx
library(knitr)
knit2html("example.rmd") #Create md and html files

and then

FILE <- "example"
system(paste0("pandoc -o ", FILE, ".docx ", FILE, ".md"))

Note. For example reason, if I play around the above 2 commands for several times, the knit2html() does not work well. However, if I click 'Knit HTML' button on the RStudio, it then works again.

Another way is

library(pander)
name = "demo"
knit(paste0(name, ".Rmd"), encoding = "utf-8")
Pandoc.brew(file = paste0(name, ".md"), output = paste0(-name, "docx"), convert = "docx")

Note that once we have used knitr command to create a md file, we can use pandoc shell command to convert it to different formats:

  • A pdf file: pandoc -s report.md -t latex -o report.pdf
  • A html file: pandoc -s report.md -o report.html (with the -c flag html files can be added easily)
  • Openoffice: pandoc report.md -o report.odt
  • Word docx: pandoc report.md -o report.docx

We can also create the epub file for reading on Kobo ereader. For example, download this file and save it as example.Rmd. I need to remove the line containing the link to http://i.imgur.com/RVNmr.jpg since it creates an error when I run pandoc (not sure if it is the pandoc version I have is too old). Now we just run these 2 lines to get the epub file. Amazing!

knit("example.Rmd")
pandoc("example.md", format="epub")

PS. If we don't remove the link, we will get an error message (pandoc 1.10.1 on Windows 7)

> pandoc("Rmd_to_Epub.md", format="epub")
executing pandoc   -f markdown -t epub -o Rmd_to_Epub.epub "Rmd_to_Epub.utf8md"
pandoc.exe: .\.\http://i.imgur.com/RVNmr.jpg: openBinaryFile: invalid argument (Invalid argument)
Error in (function (input, format, ext, cfg)  : conversion failed
In addition: Warning message:
running command 'pandoc   -f markdown -t epub -o Rmd_to_Epub.epub "Rmd_to_Epub.utf8md"' had status 1

pander

Try pandoc[1] with a minimal reproducible example, you might give a try to my "pander" package [2] too:

library(pander)
Pandoc.brew(system.file('examples/minimal.brew', package='pander'),
            output = tempfile(), convert = 'docx')

Where the content of the "minimal.brew" file is something you might have got used to with Sweave - although it's using "brew" syntax instead. See the examples of pander [3] for more details. Please note that pandoc should be installed first, which is pretty easy on Windows.

  1. http://johnmacfarlane.net/pandoc/
  2. http://rapporter.github.com/pander/
  3. http://rapporter.github.com/pander/#examples

R2wd

Use R2wd package. However, only 32-bit R is allowed and sometimes it can not produce all 'table's.

> library(R2wd)
> wdGet()
Loading required package: rcom
Loading required package: rscproxy
rcom requires a current version of statconnDCOM installed.
To install statconnDCOM type
     installstatconnDCOM()

This will download and install the current version of statconnDCOM

You will need a working Internet connection
because installation needs to download a file.
Error in if (wdapp[["Documents"]][["Count"]] == 0) wdapp[["Documents"]]$Add() : 
  argument is of length zero 

The solution is to launch 32-bit R instead of 64-bit R since statconnDCOM does not support 64-bit R.

Convert from pdf to word

The best rendering of advanced tables is done by converting from pdf to Word. See http://biostat.mc.vanderbilt.edu/wiki/Main/SweaveConvert

rtf

Use rtf package for Rich Text Format (RTF) Output.

xtable

Package xtable will produce html output.

print(xtable(X), type="html")

If you save the file and then open it with Word, you will get serviceable results. I've had better luck copying the output from xtable and pasting it into Excel.

ReporteRs

Microsoft Word, Microsoft Powerpoint and HTML documents generation from R. The source code is hosted on https://github.com/davidgohel/ReporteRs

A quick exploration

PDF manipulation

staplr

R Graphs Gallery

COM client or server

Client

RDCOMClient where excel.link depends on it.

Server

RDCOMServer

Use R under proxy

http://support.rstudio.org/help/kb/faq/configuring-r-to-use-an-http-proxy

RStudio

  • Github
  • Installing RStudio (1.0.44) on Ubuntu will not install Java even the source code contains 37.5% Java??
  • Preview

rstudio.cloud

https://rstudio.cloud/

Launch RStudio

Multiple versions of R

Create .Rproj file

If you have an existing package that doesn't have an .Rproj file, you can use devtools::use_rstudio("path/to/package") to add it.

With an RStudio project file, you can

  • Restore .RData into workspace at startup
  • Save workspace to .RData on exit
  • Always save history (even if no saving .RData)
  • etc

package search

https://github.com/RhoInc/CRANsearcher

Git

Visual Studio

R and Python support now built in to Visual Studio 2017

List files using regular expression

  • Extension
list.files(pattern = "\\.txt$")

where the dot (.) is a metacharacter. It is used to refer to any character.

  • Start with
list.files(pattern = "^Something")

Using Sys.glob()"' as

> Sys.glob("~/Downloads/*.txt")
[1] "/home/brb/Downloads/ip.txt"       "/home/brb/Downloads/valgrind.txt"

Hidden tool: rsync in Rtools

c:\Rtools\bin>rsync -avz "/cygdrive/c/users/limingc/Downloads/a.exe" "/cygdrive/c/users/limingc/Documents/"
sending incremental file list
a.exe

sent 323142 bytes  received 31 bytes  646346.00 bytes/sec
total size is 1198416  speedup is 3.71

c:\Rtools\bin>

And rsync works best when we need to sync folder.

c:\Rtools\bin>rsync -avz "/cygdrive/c/users/limingc/Downloads/binary" "/cygdrive/c/users/limingc/Documents/"
sending incremental file list
binary/
binary/Eula.txt
binary/cherrytree.lnk
binary/depends64.chm
binary/depends64.dll
binary/depends64.exe
binary/mtputty.exe
binary/procexp.chm
binary/procexp.exe
binary/pscp.exe
binary/putty.exe
binary/sqlite3.exe
binary/wget.exe

sent 4115294 bytes  received 244 bytes  1175868.00 bytes/sec
total size is 8036311  speedup is 1.95

c:\Rtools\bin>rm c:\users\limingc\Documents\binary\procexp.exe
cygwin warning:
  MS-DOS style path detected: c:\users\limingc\Documents\binary\procexp.exe
  Preferred POSIX equivalent is: /cygdrive/c/users/limingc/Documents/binary/procexp.exe
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

c:\Rtools\bin>rsync -avz "/cygdrive/c/users/limingc/Downloads/binary" "/cygdrive/c/users/limingc/Documents/"
sending incremental file list
binary/
binary/procexp.exe

sent 1767277 bytes  received 35 bytes  3534624.00 bytes/sec
total size is 8036311  speedup is 4.55

c:\Rtools\bin>

Unforunately, if the destination is a network drive, I could get a permission denied (13) error. See also http://superuser.com/questions/69620/rsync-file-permissions-on-windows

Install rgdal package (geospatial Data) on ubuntu

Terminal

sudo apt-get install libgdal1-dev libproj-dev # https://stackoverflow.com/a/44389304
sudo apt-get install libgdal1i # Ubuntu 16.04 https://stackoverflow.com/a/12143411

R

install.packages("rgdal")

Install sf package

I got the following error even I have installed some libraries.

checking GDAL version >= 2.0.1... no
configure: error: sf is not compatible with GDAL versions below 2.0.1

Then I follow the instruction here

sudo apt remove libgdal-dev
sudo apt remove libproj-dev
sudo apt remove gdal-bin
sudo add-apt-repository ppa:ubuntugis/ubuntugis-stable

sudo apt update
sudo apt-cache policy libgdal-dev # Make sure a version >= 2.0 appears 
sudo apt install libgdal-dev

Set up Emacs on Windows

Edit the file C:\Program Files\GNU Emacs 23.2\site-lisp\site-start.el with something like

(setq-default inferior-R-program-name
              "c:/program files/r/r-2.15.2/bin/i386/rterm.exe")

Database

RSQLite

Creating a new database:

library(DBI)

mydb <- dbConnect(RSQLite::SQLite(), "my-db.sqlite")
dbDisconnect(mydb)
unlink("my-db.sqlite")

# temporary database
mydb <- dbConnect(RSQLite::SQLite(), "")
dbDisconnect(mydb)

Loading data:

mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "mtcars", mtcars)
dbWriteTable(mydb, "iris", iris)

dbListTables(mydb)

dbListFields(con, "mtcars")

dbReadTable(con, "mtcars")

Queries:

dbGetQuery(mydb, 'SELECT * FROM mtcars LIMIT 5')

dbGetQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" < 4.6')

dbGetQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" < :x', params = list(x = 4.6))

res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
dbFetch(res)

Batched queries:

dbClearResult(rs)
rs <- dbSendQuery(mydb, 'SELECT * FROM mtcars')
while (!dbHasCompleted(rs)) {
  df <- dbFetch(rs, n = 10)
  print(nrow(df))
}

dbClearResult(rs)

Multiple parameterised queries:

rs <- dbSendQuery(mydb, 'SELECT * FROM iris WHERE "Sepal.Length" = :x')
dbBind(rs, param = list(x = seq(4, 4.4, by = 0.1)))
nrow(dbFetch(rs))
#> [1] 4
dbClearResult(rs)

Statements:

dbExecute(mydb, 'DELETE FROM iris WHERE "Sepal.Length" < 4')
#> [1] 0
rs <- dbSendStatement(mydb, 'DELETE FROM iris WHERE "Sepal.Length" < :x')
dbBind(rs, param = list(x = 4.5))
dbGetRowsAffected(rs)
#> [1] 4
dbClearResult(rs)

sqldf

Manipulate R data frames using SQL. Depends on RSQLite. A use of gsub, reshape2 and sqldf with healthcare data

RPostgreSQL

RMySQL

MongoDB

odbc

RODBC

DBI

dbplyr

Create a new SQLite database:

surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")

my_db_file <- "portal-database.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)

copy_to(my_db, surveys)
copy_to(my_db, plots)
my_db

Connect to a database:

download.file(url = "https://ndownloader.figshare.com/files/2292171",
              destfile = "portal_mammals.sqlite", mode = "wb")

library(dbplyr)
library(dplyr)
mammals <- src_sqlite("portal_mammals.sqlite")

Querying the database with the SQL syntax:

tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))

Querying the database with the dplyr syntax:

surveys <- tbl(mammals, "surveys")
surveys %>%
    select(year, species_id, plot_id)
head(surveys, n = 10)

show_query(head(surveys, n = 10)) # show which SQL commands are actually sent to the database

Simple database queries:

surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)

Laziness (instruct R to stop being lazy):

data_subset <- surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight) %>%
  collect()

Complex database queries:

plots <- tbl(mammals, "plots")
plots # # The plot_id column features in the plots table

surveys # The plot_id column also features in the surveys table

# Join databases method 1
plots %>%
  filter(plot_id == 1) %>%
  inner_join(surveys) %>%
  collect()

NoSQL

nodbi: the NoSQL Database Connector

Github

R source

https://github.com/wch/r-source/ Daily update, interesting, should be visited every day. Clicking 1000+ commits to look at daily changes.

If we are interested in a certain branch (say 3.2), look for R-3-2-branch.

R packages (only) source (metacran)

Bioconductor packages source

Announcement, https://github.com/Bioconductor-mirror

Send local repository to Github in R by using reports package

http://www.youtube.com/watch?v=WdOI_-aZV0Y

My collection

How to download

Clone ~ Download.

  • Command line
git clone https://gist.github.com/4484270.git

This will create a subdirectory called '4484270' with all cloned files there.

  • Within R
library(devtools)
source_gist("4484270")

or First download the json file from

https://api.github.com/users/MYUSERLOGIN/gists

and then

library(RJSONIO)
x <- fromJSON("~/Downloads/gists.json")
setwd("~/Downloads/")
gist.id <- lapply(x, "[[", "id")
lapply(gist.id, function(x){
  cmd <- paste0("git clone https://gist.github.com/", x, ".git")
  system(cmd)
})

Jekyll

An Easy Start with Jekyll, for R-Bloggers

Connect R with Arduino

Android App

Common plots tips

Grouped boxplots

Weather Time Line

The plot looks similar to a boxplot though it is not. See a screenshot on Android by Sam Ruston.

Horizontal bar plot

library(ggplot2)
dtf <- data.frame(x = c("ETB", "PMA", "PER", "KON", "TRA", 
                        "DDR", "BUM", "MAT", "HED", "EXP"),
                  y = c(.02, .11, -.01, -.03, -.03, .02, .1, -.01, -.02, 0.06))
ggplot(dtf, aes(x, y)) +
  geom_bar(stat = "identity", aes(fill = x), show.legend = FALSE) + 
  coord_flip() + xlab("") + ylab("Fold Change")   

File:Ggplot2bar.svg

Include bar values in a barplot

Use text().

Or use geom_text() if we are using the ggplot2 package. See an example here or this.

For stacked barplot, see this post.

Grouped barplots

library(ggplot2)

  1. mydata <- data.frame(OUTGRP, INGRP, value)

ggplot(mydata, aes(fill=INGRP, y=value, x=OUTGRP)) +

      geom_bar(position="dodge", stat="identity")
> 1 - 2*(1-pnorm(1))
[1] 0.6826895
> 1 - 2*(1-pnorm(1.96))
[1] 0.9500042

Math expression

# Expressions
plot(x,y, xlab = expression(hat(x)[t]),
     ylab = expression(phi^{rho + a}),
     main = "Pure Expressions")

# Expressions with Spacing
# '~' is to add space and '*' is to squish characters together
plot(1:10, xlab= expression(Delta * 'C'))
plot(x,y, xlab = expression(hat(x)[t] ~ z ~ w),
     ylab = expression(phi^{rho + a} * z * w),
     main = "Pure Expressions with Spacing")

# Expressions with Text
plot(x,y, 
     xlab = expression(paste("Text here ", hat(x), " here ", z^rho, " and here")), 
     ylab = expression(paste("Here is some text of ", phi^{rho})), 
     main = "Expressions with Text")

# Substituting Expressions
plot(x,y, 
     xlab = substitute(paste("Here is ", pi, " = ", p), list(p = py)), 
     ylab = substitute(paste("e is = ", e ), list(e = ee)), 
     main = "Substituted Expressions")

Impose a line to a scatter plot

  • abline + lsfit # least squares
plot(cars)
abline(lsfit(cars[, 1], cars[, 2]))
# OR
abline(lm(cars[,2] ~ cars[,1]))
  • abline + line # robust line fitting
plot(cars)
(z <- line(cars))
abline(coef(z), col = 'green')
  • lines
plot(cars)
fit <- lm(cars[,2] ~ cars[,1])
lines(cars[,1], fitted(fit), col="blue")
lines(stats::lowess(cars), col='red')

Rotating x axis labels for barplot

https://stackoverflow.com/questions/10286473/rotating-x-axis-labels-in-r-for-barplot

barplot(mytable,main="Car makes",ylab="Freqency",xlab="make",las=2)

Set R plots x axis to show at y=0

https://stackoverflow.com/questions/3422203/set-r-plots-x-axis-to-show-at-y-0

plot(1:10, rnorm(10), ylim=c(0,10), yaxs="i")

Different colors of axis labels in barplot

See Vary colors of axis labels in R based on another variable

Method 1: Append labels for the 2nd, 3rd, ... color gradually because 'col.axis' argument cannot accept more than one color.

tN <- table(Ni <- stats::rpois(100, lambda = 5))
r <- barplot(tN, col = rainbow(20))
axis(1, 1, LETTERS[1], col.axis="red", col="red")
axis(1, 2, LETTERS[2], col.axis="blue", col = "blue")

Method 2: text() which can accept multiple colors in 'col' parameter but we need to find out the (x, y) by ourselves.

barplot(tN, col = rainbow(20), axisnames = F)
text(4:6, par("usr")[3]-2 , LETTERS[4:6], col=c("black","red","blue"), xpd=TRUE)

Use text() to draw labels on X/Y-axis including rotation

par(mar = c(5, 6, 4, 5) + 0.1)
plot(..., xaxt = "n") # "n" suppresses plotting of the axis; need mtext() and axis() to supplement
text(x = barCenters, y = par("usr")[3] - 1, srt = 45,
     adj = 1, labels = myData$names, xpd = TRUE)

Vertically stacked plots with the same x axis

https://stackoverflow.com/questions/11794436/stacking-multiple-plots-vertically-with-the-same-x-axis-but-different-y-axes-in

Increase/decrease legend font size

https://stackoverflow.com/a/36842578

plot(rnorm(100))
op <- par(cex=2)
legend("topleft", legend = 1:4, col=1:4, pch=1)
par(op)

Superimpose a density plot or any curves

Use lines().

Example 1

plot(cars, main = "Stopping Distance versus Speed")
lines(stats::lowess(cars))

Example 2

require(survival)
n = 10000
beta1 = 2; beta2 = -1
lambdaT = 1 # baseline hazard
lambdaC = 2  # hazard of censoring
set.seed(1234)
x1 = rnorm(n,0)
x2 = rnorm(n,0)
# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2)) 
C <- rweibull(n, shape=1, scale=lambdaC)   
time = pmin(T,C)  
status <- 1*(T <= C) 
status2 <- 1-status
plot(survfit(Surv(time, status2) ~ 1), 
     ylab="Survival probability",
     main = 'Exponential censoring time')
xseq <- seq(.1, max(time), length =100)
func <- function(x) 1-pweibull(x, shape = 1, scale = lambdaC)
lines(xseq, func(xseq), col = 'red') # survival function of Weibull

Custom scales

Using custom scales with the 'scales' package

Time series

Time series stock price plot

library(quantmod)
getSymbols("AAPL")
getSymbols("IBM") # similar to AAPL
getSymbols("CSCO") # much smaller than AAPL, IBM
getSymbols("DJI") # Dow Jones, huge 
chart_Series(Cl(AAPL), TA="add_TA(Cl(IBM), col='blue', on=1); add_TA(Cl(CSCO), col = 'green', on=1)", 
    col='orange', subset = '2017::2017-08')

tail(Cl(DJI))

Timeline plot

https://stackoverflow.com/questions/20695311/chronological-timeline-with-points-in-time-and-format-date

Circular plot

Word cloud

World map

Visualising SSH attacks with R (rworldmap and rgeolocate packages)

Diagram/flowchart/Directed acyclic diagrams (DAGs)

DiagrammeR

diagram

Functions for Visualising Simple Graphs (Networks), Plotting Flow Diagrams

DAGitty (browser-based and R package)

dagR

Venn Diagram

# systemPipeR package method
library(systemPipeR)
setlist <- list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18)) 
OLlist <- overLapper(setlist[1:3], type="vennsets")
vennPlot(list(OLlist))                             

# R script source method
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") 
setlist <- list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18)) 
# or (obtained by dput(setlist))
setlist <- structure(list(A = c("o", "h", "u", "p", "i", "s", "a", "w", 
"b", "z", "n", "c", "k", "j", "y", "m", "t", "q"), B = c("h", 
"r", "x", "y", "b", "t", "d", "o", "m", "q", "g", "v", "c", "u", 
"f", "z"), C = c("b", "e", "t", "u", "s", "j", "o", "k", "d", 
"l", "g", "i", "w", "n", "p", "a", "y", "x", "m", "z"), D = c("f", 
"g", "b", "k", "j", "m", "e", "q", "i", "d", "o", "l", "c", "t", 
"x", "r", "s", "u", "w", "a", "z", "n"), E = c("u", "w", "o", 
"k", "n", "h", "p", "z", "l", "m", "r", "d", "q", "s", "x", "b", 
"v", "t"), F = c("o", "j", "r", "c", "l", "l", "u", "b", "f", 
"d", "u", "m", "y", "t", "y", "s", "a", "g", "t", "m", "x", "m"
)), .Names = c("A", "B", "C", "D", "E", "F"))

OLlist <- overLapper(setlist[1:3], type="vennsets")
counts <- list(sapply(OLlist$Venn_List, length))  
vennPlot(counts=counts)                           

File:Vennplot.png

Bump chart/Metro map

https://dominikkoch.github.io/Bump-Chart/

Amazing plots

See Amazing plot.

Google Analytics

GAR package

http://www.analyticsforfun.com/2015/10/query-your-google-analytics-data-with.html

Linear Programming

http://www.r-bloggers.com/modeling-and-solving-linear-programming-with-r-free-book/

Linear Algebra

Amazon Alexa

R and Singularity

https://www.rstudio.com/rviews/2017/03/29/r-and-singularity/

Teach kids about R with Minecraft

http://blog.revolutionanalytics.com/2017/06/teach-kids-about-r-with-minecraft.html

Secure API keys

Securely store API keys in R scripts with the "secret" package

Hide a password: keyring package

Vision and image recognition

Turn pictures into coloring pages

https://gist.github.com/jeroen/53a5f721cf81de2acba82ea47d0b19d0

Numerical optimization

R packages

R packages

Tricks

Getting help

Better Coder/coding, best practices

E-notation

6.022E23 (or 6.022e23) is equivalent to 6.022×10^23

Getting user's home directory

See What are HOME and working directories?

# Windows
normalizePath("~")   # "C:\\Users\\brb\\Documents"
Sys.getenv("R_USER") # "C:/Users/brb/Documents"
Sys.getenv("HOME")   # "C:/Users/brb/Documents"

# Mac
normalizePath("~")   # [1] "/Users/brb"
Sys.getenv("R_USER") # [1] ""
Sys.getenv("HOME")   # "/Users/brb"

# Linux
normalizePath("~")   # [1] "/home/brb"
Sys.getenv("R_USER") # [1] ""
Sys.getenv("HOME")   # [1] "/home/brb"

Rprofile.site, Renviron.site (all platforms) and Rconsole (Windows only)

If we like to install R packages to a personal directory, follow this. Just add the line

R_LIBS_SITE=F:/R/library

to the file R_HOME/etc/x64/Renviron.site.

Note that on Windows OS, R/etc contains

$ ls -l /c/Progra~1/r/r-3.2.0/etc
total 142
-rw-r--r--    1   Administ     1043 Jun 20  2013 Rcmd_environ
-rw-r--r--    1   Administ     1924 Mar 17  2010 Rconsole
-rw-r--r--    1   Administ      943 Oct  3  2011 Rdevga
-rw-r--r--    1   Administ      589 May 20  2013 Rprofile.site
-rw-r--r--    1   Administ   251894 Jan 17  2015 curl-ca-bundle.crt
drwxr-xr-x    1   Administ        0 Jun  8 10:30 i386
-rw-r--r--    1   Administ     1160 Dec 31  2014 repositories
-rw-r--r--    1   Administ    30188 Mar 17  2010 rgb.txt
drwxr-xr-x    3   Administ        0 Jun  8 10:30 x64

$ ls /c/Progra~1/r/r-3.2.0/etc/i386
Makeconf

$ cat /c/Progra~1/r/r-3.2.0/etc/Rconsole
# Optional parameters for the console and the pager
# The system-wide copy is in R_HOME/etc.
# A user copy can be installed in `R_USER'.

## Style
# This can be `yes' (for MDI) or `no' (for SDI).
  MDI = yes
# MDI = no

# the next two are only relevant for MDI
toolbar = yes
statusbar = no

## Font.
# Please use only fixed width font.
# If font=FixedFont the system fixed font is used; in this case
# points and style are ignored. If font begins with "TT ", only
# True Type fonts are searched for.
font = TT Courier New
points = 10
style = normal # Style can be normal, bold, italic

# Dimensions (in characters) of the console.
rows = 25
columns = 80
# Dimensions (in characters) of the internal pager.
pgrows = 25
pgcolumns = 80
# should options(width=) be set to the console width?
setwidthonresize = yes

# memory limits for the console scrolling buffer, in chars and lines
# NB: bufbytes is in bytes for R < 2.7.0, chars thereafter.
bufbytes = 250000
buflines = 8000

# Initial position of the console (pixels, relative to the workspace for MDI)
# xconsole = 0
# yconsole = 0

# Dimension of MDI frame in pixels
# Format (w*h+xorg+yorg) or use -ve w and h for offsets from right bottom
# This will come up maximized if w==0
# MDIsize = 0*0+0+0
# MDIsize = 1000*800+100+0
# MDIsize = -50*-50+50+50  # 50 pixels space all round

# The internal pager can displays help in a single window
# or in multiple windows (one for each topic)
# pagerstyle can be set to `singlewindow' or `multiplewindows'
pagerstyle = multiplewindows

## Colours for console and pager(s)
# (see rwxxxx/etc/rgb.txt for the known colours).
background = White
normaltext = NavyBlue
usertext = Red
highlight = DarkRed

## Initial position of the graphics window
## (pixels, <0 values from opposite edge)
xgraphics = -25
ygraphics = 0

## Language for messages
language =

## Default setting for console buffering: 'yes' or 'no'
buffered = yes

and on Linux

[email protected]:~$ whereis R
R: /usr/bin/R /etc/R /usr/lib/R /usr/bin/X11/R /usr/local/lib/R /usr/share/R /usr/share/man/man1/R.1.gz

[email protected]:~$ ls /usr/lib/R
bin  COPYING  etc  lib  library  modules  site-library  SVN-REVISION

[email protected]:~$ ls /usr/lib/R/etc
javaconf  ldpaths  Makeconf  Renviron  Renviron.orig  Renviron.site  Renviron.ucf  repositories  Rprofile.site

[email protected]:~$ ls /usr/local/lib/R
site-library

and

[email protected]:~$ cat /usr/lib/R/etc/Rprofile.site
##                                              Emacs please make this -*- R -*-
## empty Rprofile.site for R on Debian
##
## Copyright (C) 2008 Dirk Eddelbuettel and GPL'ed
##
## see help(Startup) for documentation on ~/.Rprofile and Rprofile.site

# ## Example of .Rprofile
# options(width=65, digits=5)
# options(show.signif.stars=FALSE)
# setHook(packageEvent("grDevices", "onLoad"),
#         function(...) grDevices::ps.options(horizontal=FALSE))
# set.seed(1234)
# .First <- function() cat("\n   Welcome to R!\n\n")
# .Last <- function()  cat("\n   Goodbye!\n\n")

# ## Example of Rprofile.site
# local({
#  # add MASS to the default packages, set a CRAN mirror
#  old <- getOption("defaultPackages"); r <- getOption("repos")
#  r["CRAN"] <- "http://my.local.cran"
#  options(defaultPackages = c(old, "MASS"), repos = r)
#})
[email protected]:~$ cat /usr/lib/R/etc/Renviron.site
##                                              Emacs please make this -*- R -*-
## empty Renviron.site for R on Debian
##
## Copyright (C) 2008 Dirk Eddelbuettel and GPL'ed
##
## see help(Startup) for documentation on ~/.Renviron and Renviron.site

# ## Example ~/.Renviron on Unix
# R_LIBS=~/R/library
# PAGER=/usr/local/bin/less

# ## Example .Renviron on Windows
# R_LIBS=C:/R/library
# MY_TCLTK="c:/Program Files/Tcl/bin"

# ## Example of setting R_DEFAULT_PACKAGES (from R CMD check)
# R_DEFAULT_PACKAGES='utils,grDevices,graphics,stats'
# # this loads the packages in the order given, so they appear on
# # the search path in reverse order.
[email protected]:~$

What is the best place to save Rconsole on Windows platform

Put/create the file <Rconsole> under C:/Users/USERNAME/Documents folder so no matter how R was upgraded/downgraded, it always find my preference.

My preferred settings:

  • Font: Consolas (it will be shown as "TT Consolas" in Rconsole)
  • Size: 12
  • background: black
  • normaltext: white
  • usertext: GreenYellow or orange (close to RStudio's Cobalt theme) or sienna1 or SpringGreen or tan1 or yellow

and others (default options)

  • pagebg: white
  • pagetext: navy
  • highlight: DarkRed
  • dataeditbg: white
  • dataedittext: navy (View() function)
  • dataedituser: red
  • editorbg: white (edit() function)
  • editortext: black

A copy of the Rconsole is saved in github.

How R starts up

https://rstats.wtf/r-startup.html

Saving and loading history automatically: .Rprofile & local()

options(continue="  ") # default is "+ "
options(editor="nano") # default is "vi" on Linux
# options(htmlhelp=TRUE) 

local((r <- getOption("repos")
  r["CRAN"] <- "http://cran.rstudio.com"
  options(repos = r)))

.First <- function(){
 # library(Hmisc)
 cat("\nWelcome at", date(), "\n")
}

.Last <- function(){
 cat("\nGoodbye at ", date(), "\n")
}  
  • https://stackoverflow.com/questions/16734937/saving-and-loading-history-automatically
  • The history file will always be read from the $HOME directory and the history file will be overwritten by a new session. These two problems can be solved if we define R_HISTFILE system variable.
  • local() function can be used in .Rprofile file to set up the environment even no new variables will be created (change repository, install packages, load libraries, source R files, run system() function, file/directory I/O, etc)

Linux or Mac

In ~/.profile or ~/.bashrc I put:

export R_HISTFILE=~/.Rhistory

In ~/.Rprofile I put:

if (interactive()) {
  if (.Platform$OS.type == "unix")  .First <- function() try(utils::loadhistory("~/.Rhistory")) 
  .Last <- function() try(savehistory(file.path(Sys.getenv("HOME"), ".Rhistory")))
}

Windows

If you launch R by clicking its icon from Windows Desktop, the R starts in C:\User\$USER\Documents directory. So we can create a new file .Rprofile in this directory.

if (interactive()) {
  .Last <- function() try(savehistory(file.path(Sys.getenv("HOME"), ".Rhistory")))
}

R release versions

rversions: Query the main 'R' 'SVN' repository to find the released versions & dates.

Detect number of running R instances in Windows

C:\Program Files\R>tasklist /FI "IMAGENAME eq Rscript.exe"
INFO: No tasks are running which match the specified criteria.

C:\Program Files\R>tasklist /FI "IMAGENAME eq Rgui.exe"

Image Name                     PID Session Name        Session#    Mem Usage
============================================================================
Rgui.exe                      1096 Console                    1     44,712 K

C:\Program Files\R>tasklist /FI "IMAGENAME eq Rserve.exe"

Image Name                     PID Session Name        Session#    Mem Usage
============================================================================
Rserve.exe                    6108 Console                    1    381,796 K

In R, we can use

> system('tasklist /FI "IMAGENAME eq Rgui.exe" ', intern = TRUE)
[1] ""                                                                            
[2] "Image Name                     PID Session Name        Session#    Mem Usage"
[3] "============================================================================"
[4] "Rgui.exe                      1096 Console                    1     44,804 K"

> length(system('tasklist /FI "IMAGENAME eq Rgui.exe" ', intern = TRUE))-3

Editor

http://en.wikipedia.org/wiki/R_(programming_language)#Editors_and_IDEs

  • Emacs + ESS. The ESS is useful in the case I want to tidy R code (the tidy_source() function in the formatR package sometimes gives errors; eg when I tested it on an R file like <GetComparisonResults.R> from BRB-ArrayTools v4.4 stable).
  • Rstudio - editor/R terminal/R graphics/file browser/package manager. The new version (0.98) also provides a new feature for debugging step-by-step. See also RStudio Tricks
  • geany - I like the feature that it shows defined functions on the side panel even for R code. RStudio can also do this (see the bottom of the code panel).
  • Rgedit which includes a feature of splitting screen into two panes and run R in the bottom panel. See here.
  • Komodo IDE with browser preview http://www.youtube.com/watch?v=wv89OOw9roI at 4:06 and http://docs.activestate.com/komodo/4.4/editor.html

GUI for Data Analysis

Rcmdr

http://cran.r-project.org/web/packages/Rcmdr/index.html

Deducer

http://cran.r-project.org/web/packages/Deducer/index.html

jamovi

Scope

See

  • Assignments within functions in the An Introduction to R manual.
  • source() does not work like C's preprocessor where statements in header files will be literally inserted into the code. It does not work when you define a variable in a function but want to use it outside the function (even through source())
## foo.R ##
cat(ArrayTools, "\n")
## End of foo.R

# 1. Error
predict <- function() {
  ArrayTools <- "C:/Program Files" # or through load() function 
  source("foo.R")                  # or through a function call; foo()
}
predict()   # Object ArrayTools not found

# 2. OK. Make the variable global
predict <- function() {
  ArrayTools <<- "C:/Program Files'
  source("foo.R")
}
predict()  
ArrayTools

# 3. OK. Create a global variable
ArrayTools <- "C:/Program Files"
predict <- function() {
  source("foo.R")
}
predict()

Note that any ordinary assignments done within the function are local and temporary and are lost after exit from the function.

Example 1.

> ttt <- data.frame(type=letters[1:5], JpnTest=rep("999", 5), stringsAsFactors = F)
> ttt
  type JpnTest
1    a     999
2    b     999
3    c     999
4    d     999
5    e     999
> jpntest <- function() { ttt$JpnTest[1] ="N5"; print(ttt)}
> jpntest()
  type JpnTest
1    a      N5
2    b     999
3    c     999
4    d     999
5    e     999
> ttt
  type JpnTest
1    a     999
2    b     999
3    c     999
4    d     999
5    e     999

Example 2. How can we set global variables inside a function? The answer is to use the "<<-" operator or assign(, , envir = .GlobalEnv) function.

Other resource: Advanced R by Hadley Wickham.

Example 3. Writing functions in R, keeping scoping in mind

New environment

http://adv-r.had.co.nz/Environments.html

Run the same function on a bunch of R objects

mye = new.env()
load(<filename>, mye)
for(n in names(mye)) n = as_tibble(myen)

View all objects present in a package, ls()

https://stackoverflow.com/a/30392688. In the case of an R package created by Rcpp.package.skeleton("mypackage"), we will get

> devtools::load_all("mypackage")
> search()
 [1] ".GlobalEnv"        "devtools_shims"    "package:mypackage"
 [4] "package:stats"     "package:graphics"  "package:grDevices"
 [7] "package:utils"     "package:datasets"  "package:methods"
[10] "Autoloads"         "package:base"

> ls("package:mypackage")
[1] "_mypackage_rcpp_hello_world" "evalCpp"                     "library.dynam.unload"       
[4] "rcpp_hello_world"            "system.file"

Note that the first argument of ls() (or detach()) is used to specify the environment. It can be

  • an integer (the position in the ‘search’ list);
  • the character string name of an element in the search list;
  • an explicit ‘environment’ (including using ‘sys.frame’ to access the currently active function calls).

Speedup R code

Profiler

(Video) Understand Code Performance with the profiler

&& vs &

See https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/Logic.

The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector.

stopifnot(): function argument sanity check

stopifnot()

Vectorization

sapply vs vectorization

Speed test: sapply vs vectorization

lapply vs for loop

split() and sapply()

split() can be used to split a vector, columns or rows. See How to split a data frame?

  • Split divides the data in the vector or data frame x into the groups defined by f. The syntax is
split(x, f, drop = FALSE, …)
  • Split rows of a data frame/matrix
split(mtcars,mtcars$cyl)
  • Split columns of a data frame/matrix.
ma <- cbind(x = 1:10, y = (-4:5)^2, z = 11:20)
split(ma, cbind(rep(1,10), rep(2, 10), rep(1,10)))
# $`1`
#  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#
# $`2`
#  [1] 16  9  4  1  0  1  4  9 16 25
  • split() + sapply() to merge columns. See below Mean of duplicated columns for more detail.
  • split() + sapply() to split a vector. See nsFilter() function which can remove duplicated probesets/rows using unique Entrez Gene IDs (genefilter package). The source code of nsFilter() and findLargest().
tSsp = split.default(testStat, lls) 
# testStat is a vector of numerics including probeset IDs as names
# lls is a vector of entrez IDs (same length as testStat)
# tSSp is a list of the same length as unique elements of lls.

sapply(tSsp, function(x) names(which.max(x))) 
# return a vector of probset IDs of length of unique entrez IDs
And here is another example from the bigmemory vignette,
planeindices <- split(1:nrow(x), x[,'TailNum'])
planeStart <- sapply(planeindices,
                     function(i) birthmonth(x[i, c('Year','Month'),
                                            drop=FALSE]))

strsplit and sapply

> namedf <- c("John ABC", "Mary CDE", "Kat FGH")
> strsplit(namedf, " ")
1
[1] "John" "ABC" 

2
[1] "Mary" "CDE" 

3
[1] "Kat" "FGH"

> sapply(strsplit(namedf, " "), "[", 1)
[1] "John" "Mary" "Kat" 
> sapply(strsplit(namedf, " "), "[", 2)
[1] "ABC" "CDE" "FGH"

Mean of duplicated columns: rowMeans; compute Means by each row

x <- matrix(1:60, nr=10); x[1, 2:3] <- NA
colnames(x) <- c("A","A", "b", "b", "b", "c"); x
res <- sapply(split(1:ncol(x), colnames(x)), 
              function(i) rowMeans(x[, i, drop=F], na.rm = TRUE))
res

# vapply() is safter than sapply(). 
# The 3rd arg in vapply() is a template of the return value.
res2 <- vapply(split(1:ncol(x), colnames(x)), 
               function(i) rowMeans(x[, i, drop=F], na.rm = TRUE),
               rep(0, nrow(x)))
  • colSums, rowSums, colMeans, rowMeans (no group variable). These functions are equivalent to use of ‘apply’ with ‘FUN = mean’ or ‘FUN = sum’ with appropriate margins, but are a lot faster.
rowMeans(x, na.rm=T)
# [1] 31 27 28 29 30 31 32 33 34 35

apply(x, 1, mean, na.rm=T)
# [1] 31 27 28 29 30 31 32 33 34 35
  • matrixStats: Functions that Apply to Rows and Columns of Matrices (and to Vectors)

Mean of duplicated rows: colMeans and rowsum

  • colMeans(x, na.rm = FALSE, dims = 1)
x <- matrix(1:60, nr=10); x[1, 2:3] <- NA; x
rownames(x) <- c(rep("a", 2), rep("b", 3), rep("c", 4), "d")
res <- sapply(split(1:nrow(x), rownames(x)), 
              function(i) colMeans(x[i, , drop=F], na.rm = TRUE))
res <- t(res) # transpose is needed since sapply() will form the resulting matrix by columns
  • rowsum(x, group, reorder = TRUE, …)
x <- matrix(runif(100), ncol = 5) # 20 x 5
group <- sample(1:8, 20, TRUE)
(xsum <- rowsum(x, group)) # 8 x 5
> attach(mtcars)
dim(mtcars)
[1] 32 11
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> with(mtcars, table(cyl, vs))
   vs
cyl  0  1
  4  1 10
  6  3  4
  8 14  0
> aggdata <-aggregate(mtcars, by=list(cyl,vs),  FUN=mean, na.rm=TRUE)
> print(aggdata)
  Group.1 Group.2      mpg cyl   disp       hp     drat       wt     qsec vs
1       4       0 26.00000   4 120.30  91.0000 4.430000 2.140000 16.70000  0
2       6       0 20.56667   6 155.00 131.6667 3.806667 2.755000 16.32667  0
3       8       0 15.10000   8 353.10 209.2143 3.229286 3.999214 16.77214  0
4       4       1 26.73000   4 103.62  81.8000 4.035000 2.300300 19.38100  1
5       6       1 19.12500   6 204.55 115.2500 3.420000 3.388750 19.21500  1
         am     gear     carb
1 1.0000000 5.000000 2.000000
2 1.0000000 4.333333 4.666667
3 0.1428571 3.285714 3.500000
4 0.7000000 4.000000 1.500000
5 0.0000000 3.500000 2.500000
> detach(mtcars)

# Another example: select rows with a minimum value from a certain column (yval in this case)
> mydf <- read.table(header=T, text='
 id xval yval
 A 1  1
 A -2  2
 B 3  3
 B 4  4
 C 5  5
 ')
> x = mydf$xval
> y = mydf$yval
> aggregate(mydf[, c(2,3)], by=list(id=mydf$id), FUN=function(x) x[which.min(y)])
  id xval yval
1  A    1    1
2  B    3    3
3  C    5    5

Apply family

Vectorize, aggregate, apply, by, eapply, lapply, mapply, rapply, replicate, scale, sapply, split, tapply, and vapply.

The following list gives a hierarchical relationship among these functions.

  • apply(X, MARGIN, FUN, ...) – Apply a Functions Over Array Margins
  • lapply(X, FUN, ...) – Apply a Function over a List (including a data frame) or Vector X.
    • sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) – Apply a Function over a List or Vector
      • replicate(n, expr, simplify = "array")
    • mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) – Multivariate version of sapply
      • Vectorize(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, USE.NAMES = TRUE) - Vectorize a Scalar Function
      • Map(FUN, ...) A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.
    • vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE) – similar to sapply, but has a pre-specified type of return value
    • rapply(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...) – A recursive version of lapply
  • tapply(V, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) – Apply a Function Over a "Ragged" Array. V is typically a vector where split() will be applied. INDEX is a list of one or more factors.
    • aggregate(D, by, FUN, ..., simplify = TRUE, drop = TRUE) - Apply a function to each columns of subset data frame split by factors. FUN (such as mean(), weighted.mean(), sum()) is a simple function applied to a vector. D is typically a data frame.
    • by(D, INDICES, FUN, ..., simplify = TRUE) - Apply a Function to each subset data frame split by factors. FUN (such as summary(), lm()) is applied to a data frame. D is typically a data frame.
  • eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) – Apply a Function over values in an environment

Difference between apply vs sapply vs lapply vs tapply?

  • apply - When you want to apply a function to the rows or columns or both of a matrix and output is a one-dimensional if only row or column is selected else it is a 2D-matrix
  • lapply - When you want to apply a function to each element of a list in turn and get a list back.
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
  • tapply - When you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

Some short examples:

Apply vs for loop

Note that, apply's performance is not always better than a for loop. See

Progress bar

What is the cost of a progress bar in R?

The package 'pbapply' creates a text-mode progress bar - it works on any platforms. On Windows platform, check out this post. It uses winProgressBar() and setWinProgressBar() functions.

lapply and its friends Map(), Reduce(), Filter() from the base package for manipulating lists

  • Examples of using lapply() + split() on a data frame. See rollingyours.wordpress.com.
  • mapply() documentation. Use mapply() to merge lists.
  • Map() and Reduce() in functional programming
  • Map(), Reduce(), and Filter() from Advanced R by Hadley
    • If you have two or more lists (or data frames) that you need to process in parallel, use Map(). One good example is to compute the weighted.mean() function that requires two input objects. Map() is similar to mapply() function and is more concise than lapply(). Advanced R has a comment that Map() is better than mapply().
# Syntax: Map(f, ...)

xs <- replicate(5, runif(10), simplify = FALSE)
ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE)
Map(weighted.mean, xs, ws)

# instead of a more clumsy way
lapply(seq_along(xs), function(i) {
  weighted.mean(xsi, wsi)
})
# Syntax: Reduce(f, x, ...)

> m1 <- data.frame(id=letters[1:4], val=1:4)
> m2 <- data.frame(id=letters[2:6], val=2:6)
> merge(m1, m2, "id", all = T)
  id val.x val.y
1  a     1    NA
2  b     2     2
3  c     3     3
4  d     4     4
5  e    NA     5
6  f    NA     6
> m <- list(m1, m2)
> Reduce(function(x,y) merge(x,y, "id",all=T), m)
  id val.x val.y
1  a     1    NA
2  b     2     2
3  c     3     3
4  d     4     4
5  e    NA     5
6  f    NA     6

sapply & vapply

See parallel::parSapply() for a parallel version of sapply(1:n, function(x)). We can this technique to speed up this example.

rapply - recursive version of lapply

replicate

https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r

> replicate(5, rnorm(3))
           [,1]       [,2]       [,3]      [,4]        [,5]
[1,]  0.2509130 -0.3526600 -0.3170790  1.064816 -0.53708856
[2,]  0.5222548  1.5343319  0.6120194 -1.811913 -1.09352459
[3,] -1.9905533 -0.8902026 -0.5489822  1.308273  0.08773477

See parSapply() for a parallel version of replicate().

Vectorize

> rep(1:4, 4:1)
 [1] 1 1 1 1 2 2 2 3 3 4
> vrep <- Vectorize(rep.int)
> vrep(1:4, 4:1)
1
[1] 1 1 1 1

2
[1] 2 2 2

3
[1] 3 3

4
[1] 4
> rweibull(1, 1, c(1, 2)) # no error but not sure what it gives?
[1] 2.17123
> Vectorize("rweibull")(n=1, shape = 1, scale = c(1, 2)) 
[1] 1.6491761 0.9610109
myfunc <- function(a, b) a*b
myfunc(1, 2) # 2
myfunc(3, 5) # 15
myfunc(c(1,3), c(2,5)) # 2 15
Vectorize(myfunc)(c(1,3), c(2,5)) # 2 15

myfunc2 <- function(a, b) if (length(a) == 1) a * b else NA
myfunc2(1, 2) # 2 
myfunc2(3, 5) # 15
myfunc2(c(1,3), c(2,5)) # NA
Vectorize(myfunc2)(c(1, 3), c(2, 5)) # 2 15
Vectorize(myfunc2)(c(1, 3, 6), c(2, 5)) # 2 15 12
                                        # parameter will be re-used

plyr and dplyr packages

Practical Data Science for Stats - a PeerJ Collection

The Split-Apply-Combine Strategy for Data Analysis (plyr package) in J. Stat Software.

A quick introduction to plyr with a summary of apply functions in R and compare them with functions in plyr package.

  1. plyr has a common syntax -- easier to remember
  2. plyr requires less code since it takes care of the input and output format
  3. plyr can easily be run in parallel -- faster

Tutorials

Examples of using dplyr:

tibble

Tibbles are data frames, but slightly tweaked to work better in the tidyverse.

Tibble objects

  • it does not have row names (cf data frame),
  • it never changes the type of the inputs (e.g. it never converts strings to factors!),
  • it never changes the names of variables

Tibbles Vignette

> data(pew, package = "efficient")
> dim(pew) 
[1] 18 10
> class(pew) # tibble is also a data frame!!
[1] "tbl_df"     "tbl"        "data.frame"

> tidyr::gather(pew, key=Income, value = Count, -religion) # make wide tables long
# A tibble: 162 x 3
                                                       religion Income Count
                                                          <chr>  <chr> <int>
 1                                                     Agnostic  <$10k    27
 2                                                      Atheist  <$10k    12
 ...
> mean(tidyr::gather(pew, key=Income, value = Count, -religion)[, 3])
[1] NA
Warning message:
In mean.default(tidyr::gather(pew, key = Income, value = Count,  :
  argument is not numeric or logical: returning NA
> mean(tidyr::gather(pew, key=Income, value = Count, -religion)3)
[1] 181.6975

If we try to do a match on some column of a tibble object, we will get zero matches. The issue is we cannot use an index to get a tibble column.

Subsetting: to extract a column from a tibble object, use [[ or $ or dplyr::pull(). Select Data Frame Columns in R.

TibbleObject$VarName
# OR
TibbleObject"VarName"
# OR
pull(TibbleObject, VarName) # won't be a tibble object anymore

dplyr::select(TibbleObject, -c(VarName1, VarName2)) # still a tibble object
# OR
dplyr::select(TibbleObject, 2:5) # 

llply()

llply is equivalent to lapply except that it will preserve labels and can display a progress bar. This is handy if we want to do a crazy thing.

LLID2GOIDs <- lapply(rLLID, function(x) get("org.Hs.egGO")[[x]])

where rLLID is a list of entrez ID. For example,

get("org.Hs.egGO")[["6772"]]

returns a list of 49 GOs.

ddply()

http://lamages.blogspot.com/2012/06/transforming-subsets-of-data-in-r-with.html

ldply()

An R Script to Automatically download PubMed Citation Counts By Year of Publication

Using R's set.seed() to set seeds for use in C/C++ (including Rcpp)

http://rorynolan.rbind.io/2018/09/30/rcsetseed/

get_seed()

See the same blog

get_seed <- function() {
  sample.int(.Machine$integer.max, 1)
}

Note: .Machine$integer.max = 2147483647 = 2^31 - 1.

Random seeds

set.seed(as.numeric(Sys.time()))

.Machine and the largest integer, double

See ?.Machine.

                          Linux/Mac  32-bit Windows 64-bit Windows
double.eps              2.220446e-16   2.220446e-16   2.220446e-16
double.neg.eps          1.110223e-16   1.110223e-16   1.110223e-16
double.xmin            2.225074e-308  2.225074e-308  2.225074e-308
double.xmax            1.797693e+308  1.797693e+308  1.797693e+308
double.base             2.000000e+00   2.000000e+00   2.000000e+00
double.digits           5.300000e+01   5.300000e+01   5.300000e+01
double.rounding         5.000000e+00   5.000000e+00   5.000000e+00
double.guard            0.000000e+00   0.000000e+00   0.000000e+00
double.ulp.digits      -5.200000e+01  -5.200000e+01  -5.200000e+01
double.neg.ulp.digits  -5.300000e+01  -5.300000e+01  -5.300000e+01
double.exponent         1.100000e+01   1.100000e+01   1.100000e+01
double.min.exp         -1.022000e+03  -1.022000e+03  -1.022000e+03
double.max.exp          1.024000e+03   1.024000e+03   1.024000e+03
integer.max             2.147484e+09   2.147484e+09   2.147484e+09
sizeof.long             8.000000e+00   4.000000e+00   4.000000e+00
sizeof.longlong         8.000000e+00   8.000000e+00   8.000000e+00
sizeof.longdouble       1.600000e+01   1.200000e+01   1.600000e+01
sizeof.pointer          8.000000e+00   4.000000e+00   8.000000e+00

How to select a seed for simulation or randomization

set.seed() allow alphanumeric seeds

https://stackoverflow.com/a/10913336

set.seed(), for loop and saving random seeds

http://r.789695.n4.nabble.com/set-seed-and-for-loop-td3585857.html. This question is legitimate when we want to debug on a certain iteration.

set.seed(1001) 
data <- vector("list", 30) 
seeds <- vector("list", 30) 
for(i in 1:30) { 
  seeds[[i]] <- .Random.seed 
  data[[i]] <- runif(5) 
} 
 
.Random.seed <- seeds[[23]]  # restore 
data.23 <- runif(5) 
data.23 
data[[23]] 
  • Duncan Murdoch: This works in this example, but wouldn't work with all RNGs, because some of them save state outside of .Random.seed. See ?.Random.seed for details.
  • Uwe Ligges's comment: set.seed() actually generates a seed. See ?set.seed that points us to .Random.seed (and relevant references!) which contains the actual current seed.
  • Petr Savicky's comment is also useful in the situation when it is not difficult to re-generate the data.
  • Local randomness in R.

sample() inaccurate on very large populations, fixed in R 3.6.0

# R 3.5.3
set.seed(123)
m <- (2/5)*2^32
m > 2^31
# [1] FALSE
log10(m)
# [1] 9.23502
x <- sample(m, 1000000, replace = TRUE)
table(x %% 2)
#      0      1 
# 400070 599930 
# R 3.5.3
# docker run --net=host -it --rm r-base:3.5.3
> set.seed(1234)
> sample(5)
[1] 1 3 2 4 5

# R 3.6.0
# docker run --net=host -it --rm r-base:3.6.0
> set.seed(1234)
> sample(5)
[1] 4 5 2 3 1
> RNGkind(sample.kind = "Rounding")
Warning message:
In RNGkind(sample.kind = "Rounding") : non-uniform 'Rounding' sampler used
> set.seed(1234)
> sample(5)
[1] 1 3 2 4 5

Regular Expression

See here.

Read rrd file

file, connection

  • cat() and scan() (read data into a vector or list from the console or file)
  • read() and write()
  • read.table() and write.table()
out = file('tmp.txt', 'w')
writeLines("abcd", out)
writeLines("eeeeee", out)
close(out)
readLines('tmp.txt')
unlink('tmp.txt')
args(writeLines)
# function (text, con = stdout(), sep = "\n", useBytes = FALSE)

Clipboard (?connections), textConnection(), pipe()

  • On Windows, we can use readClipboard() and writeClipboard().
source("clipboard")
read.table("clipboard")
x <- read.delim(textConnection("<USE_KEYBOARD_TO_PASTE_FROM_CLIPBOARD>"))
# Or on Mac
x <- read.delim(pipe("pbpaste"))
# safely ignore the warning: incomplete final line found by readTableHeader on 'pbpaste'

An example is to copy data from this post. In this case we need to use read.table() instead of read.delim().

clip <- pipe("pbcopy", "w")
write.table(apply(x, 1, mean), file = clip, row.names=F, col.names=F)
close(clip)
# sudo apt-get install xclip
read.table(pipe("xclip -selection clipboard -o",open="r"))

read/manipulate binary data

  • x <- readBin(fn, raw(), file.info(fn)$size)
  • rawToChar(x[1:16])
  • See Biostrings C API

String Manipulation

format(): padding with zero

ngenes <- 10
genenames <- paste0("bm", gsub(" ", "0", format(1:ngenes))); genenames
#  [1] "bm01" "bm02" "bm03" "bm04" "bm05" "bm06" "bm07" "bm08" "bm09" "bm10"

stringr package

glue package

HTTPs connection

HTTPS connection becomes default in R 3.2.2. See

R 3.3.2 patched The internal methods of ‘download.file()’ and ‘url()’ now report if they are unable to follow the redirection of a ‘http://’ URL to a ‘https://’ URL (rather than failing silently)

setInternet2

There was a bug in ftp downloading in R 3.2.2 (r69053) Windows though it is fixed now in R 3.2 patch.

Read the discussion reported on 8/8/2015. The error only happened on ftp not http connection. The final solution is explained in this post. The following demonstrated the original problem.

url <- paste0("ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/",
              "GCF_000001405.13.assembly.txt")
f1 <- tempfile()
download.file(url, f1)

It seems the bug was fixed in R 3.2-branch. See 8/16/2015 patch r69089 where a new argument INTERNET_FLAG_PASSIVE was added to InternetOpenUrl() function of wininet library. This article and this post explain differences of active and passive FTP.

The following R command will show the exact svn revision for the R you are currently using.

R.Version()$"svn rev"

If setInternet2(T), then https protocol is supported in download.file().

When setInternet(T) is enabled by default, download.file() does not work for ftp protocol (this is used in getGEO() function of the GEOquery package). If I use setInternet(F), download.file() works again for ftp protocol.

The setInternet2() function is defined in R> src> library> utils > R > windows > sysutils.R.

R up to 3.2.2

setInternet2 <- function(use = TRUE) .Internal(useInternet2(use))

See also

  • <src/include/Internal.h> (declare do_setInternet2()),
  • <src/main/names.c> (show do_setInternet2() in C)
  • <src/main/internet.c> (define do_setInternet2() in C).

Note that: setInternet2(T) becomes default in R 3.2.2. To revert to the previous default use setInternet2(FALSE). See the <doc/NEWS.pdf> file. If we use setInternet2(F), then it solves the bug of getGEO() error. But it disables the https file download using the download.file() function. In R < 3.2.2, it is also possible to download from https by setIneternet2(T).

R 3.3.0

setInternet2 <- function(use = TRUE) {
    if(!is.na(use)) stop("use != NA is defunct")
    NA
}

Note that setInternet2.Rd says As from \R 3.3.0 it changes nothing, and only \code{use = NA} is accepted. Also NEWS.Rd says setInternet2() has no effect and will be removed in due course.

File operation

  • list.files()
  • file.info()
  • dir.create()
  • file.create()
  • file.copy()
  • file.exists()
  • basename() - remove the parent path
  • tools::file_path_sans_ext() - remove the file extension or the sub() function.

read/download/source a file from internet

Simple text file http

retail <- read.csv("http://robjhyndman.com/data/ausretail.csv",header=FALSE)

Zip file and url() function

con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
source(con)
close(con)

Here url() function is like file(), gzfile(), bzfile(), xzfile(), unz(), pipe(), fifo(), socketConnection(). They are used to create connections. By default, the connection is not opened (except for ‘socketConnection’), but may be opened by setting a non-empty value of argument ‘open’. See ?url.

Another example of using url() is

load(url("http:/www.example.com/example.RData"))

downloader package

This package provides a wrapper for the download.file function, making it possible to download files over https on Windows, Mac OS X, and other Unix-like platforms. The RCurl package provides this functionality (and much more) but can be difficult to install because it must be compiled with external dependencies. This package has no external dependencies, so it is much easier to install.

Google drive file based on https using RCurl package

require(RCurl)
myCsv <- getURL("https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AkuuKBh0jM2TdGppUFFxcEdoUklCQlJhM2kweGpoUUE&single=true&gid=0&output=csv")
read.csv(textConnection(myCsv))

Google sheet file using googlesheets package

Reading data from google sheets into R

Github files https using RCurl package

x = getURL("https://gist.github.com/arraytools/6671098/raw/c4cb0ca6fe78054da8dbe253a05f7046270d5693/GeneIDs.txt", 
            ssl.verifypeer = FALSE)
read.table(text=x)

summarytools: create summary tables for vectors and data frames

https://github.com/dcomtois/summarytools. R Package for quickly and neatly summarizing vectors and data frames.

skimr: A frictionless, pipeable approach to dealing with summary statistics

skimr for useful and tidy summary statistics

Create publication tables using tables package

See p13 for example at here

R's tables packages is the best solution. For example,

> library(tables)
> tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
+          (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )
                                                  
                Sepal.Length      Sepal.Width     
 Species    n   mean         sd   mean        sd  
 setosa      50 5.01         0.35 3.43        0.38
 versicolor  50 5.94         0.52 2.77        0.31
 virginica   50 6.59         0.64 2.97        0.32
 All        150 5.84         0.83 3.06        0.44
> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

and

# This example shows some of the less common options         
> Sex <- factor(sample(c("Male", "Female"), 100, rep=TRUE))
> Status <- factor(sample(c("low", "medium", "high"), 100, rep=TRUE))
> z <- rnorm(100)+5
> fmt <- function(x) {
  s <- format(x, digits=2)
  even <- ((1:length(s)) %% 2) == 0
  s[even] <- sprintf("(%s)", s[even])
  s
}
> tabular( Justify(c)*Heading()*z*Sex*Heading(Statistic)*Format(fmt())*(mean+sd) ~ Status )
                  Status              
 Sex    Statistic high   low    medium
 Female mean       4.88   4.96   5.17 
        sd        (1.20) (0.82) (1.35)
 Male   mean       4.45   4.31   5.05 
        sd        (1.01) (0.93) (0.75)

ClinReport: Statistical Reporting in Clinical Trials

https://cran.r-project.org/web/packages/ClinReport/index.html

Append figures to PDF files

How to append a plot to an existing pdf file. Hint: use the recordPlot() function.

Extracting tables from PDFs

  • extracting Tables from PDFs in R using Tabulizer. This needs the rJava package. Linux works fine. Some issue came out on my macOS 10.12 Sierra. Library not loaded: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/lib/server/libjvm.dylib. Referenced from: /Users/XXXXXXX/Library/R/3.5/library/rJava/libs/rJava.so.
  • pdftools - Text Extraction, Rendering and Converting of PDF Documents. pdf_text() and pdf_data() functions.
library(pdftools)
pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
txt <- pdf_text(pdf_file) # length = number of pages
# Suppose the table we are interested in is on page 1
cat(txt[1]) # Good but not in a data frame format

pdf_data(pdf_file)1  # data frame/tibble format

However, it seems it does not work on Table S6. Tabulizer package is better at this case.

sudo apt install poppler-utils
pdftotext -layout input.pdf output.txt
pdftotext -layout -f 3 -l 4 input.pdf output.txt # from page 3 to 4.

Print tables

tableone

https://cran.r-project.org/web/packages/tableone/

dplyr

https://stackoverflow.com/a/34587522. The output includes counts and proportions in a publication like fashion.

tables::tabular()

gmodels::CrossTable()

https://www.statmethods.net/stats/frequencies.html

base::prop.table(x, margin)

R> m <- matrix(1:4, 2)
R> prop.table(m, 1) # row percentage
          [,1]      [,2]
[1,] 0.2500000 0.7500000
[2,] 0.3333333 0.6666667
R> prop.table(m, 2) # column percentage
          [,1]      [,2]
[1,] 0.3333333 0.4285714
[2,] 0.6666667 0.5714286

stats::xtabs()

stats::ftable()

> ftable(Titanic, row.vars = 1:3)
                   Survived  No Yes
Class Sex    Age                   
1st   Male   Child            0   5
             Adult          118  57
      Female Child            0   1
             Adult            4 140
2nd   Male   Child            0  11
             Adult          154  14
      Female Child            0  13
             Adult           13  80
3rd   Male   Child           35  13
             Adult          387  75
      Female Child           17  14
             Adult           89  76
Crew  Male   Child            0   0
             Adult          670 192
      Female Child            0   0
             Adult            3  20
> ftable(Titanic, row.vars = 1:2, col.vars = "Survived")
             Survived  No Yes
Class Sex                    
1st   Male            118  62
      Female            4 141
2nd   Male            154  25
      Female           13  93
3rd   Male            422  88
      Female          106  90
Crew  Male            670 192
      Female            3  20
> ftable(Titanic, row.vars = 2:1, col.vars = "Survived")
             Survived  No Yes
Sex    Class                 
Male   1st            118  62
       2nd            154  25
       3rd            422  88
       Crew           670 192
Female 1st              4 141
       2nd             13  93
       3rd            106  90
       Crew             3  20
> str(Titanic)
 table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
 - attr(*, "dimnames")=List of 4
  ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
  ..$ Sex     : chr [1:2] "Male" "Female"
  ..$ Age     : chr [1:2] "Child" "Adult"
  ..$ Survived: chr [1:2] "No" "Yes"
> x <- ftable(mtcars[c("cyl", "vs", "am", "gear")])
> x
          gear  3  4  5
cyl vs am              
4   0  0        0  0  0
       1        0  0  1
    1  0        1  2  0
       1        0  6  1
6   0  0        0  0  0
       1        0  2  1
    1  0        2  2  0
       1        0  0  0
8   0  0       12  0  0
       1        0  0  2
    1  0        0  0  0
       1        0  0  0
> ftable(x, row.vars = c(2, 4))
        cyl  4     6     8   
        am   0  1  0  1  0  1
vs gear                      
0  3         0  0  0  0 12  0
   4         0  0  0  2  0  0
   5         0  1  0  1  0  2
1  3         1  0  2  0  0  0
   4         2  6  2  0  0  0
   5         0  1  0  0  0  0
> 
> ## Start with expressions, use table()'s "dnn" to change labels
> ftable(mtcars$cyl, mtcars$vs, mtcars$am, mtcars$gear, row.vars = c(2, 4),
         dnn = c("Cylinders", "V/S", "Transmission", "Gears"))

          Cylinders     4     6     8   
          Transmission  0  1  0  1  0  1
V/S Gears                               
0   3                   0  0  0  0 12  0
    4                   0  0  0  2  0  0
    5                   0  1  0  1  0  2
1   3                   1  0  2  0  0  0
    4                   2  6  2  0  0  0
    5                   0  1  0  0  0  0

addmargins. Puts Arbitrary Margins On Multidimensional Tables Or Arrays.

tracemem, data type, copy

How to avoid copying a long vector

Tell if the current R is running in 32-bit or 64-bit mode

8 * .Machine$sizeof.pointer

where sizeof.pointer returns the number of *bytes* in a C SEXP type and '8' means number of bits per byte.

32- and 64-bit

See R-admin.html.

  • For speed you may want to use a 32-bit build, but to handle large datasets a 64-bit build.
  • Even on 64-bit builds of R there are limits on the size of R objects, some of which stem from the use of 32-bit integers (especially in FORTRAN code). For example, the dimensionas of an array are limited to 2^31 -1.
  • Since R 2.15.0, it is possible to select '64-bit Files' from the standard installer even on a 32-bit version of Windows (2012/3/30).

Handling length 2^31 and more in R 3.0.0

From R News for 3.0.0 release:

There is a subtle change in behaviour for numeric index values 2^31 and larger. These never used to be legitimate and so were treated as NA, sometimes with a warning. They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.

In R 2.15.2, if I try to assign a vector of length 2^31, I will get an error

> x <- seq(1, 2^31)
Error in from:to : result would be too long a vector

However, for R 3.0.0 (tested on my 64-bit Ubuntu with 16GB RAM. The R was compiled by myself):

> system.time(x <- seq(1,2^31))
   user  system elapsed
  8.604  11.060 120.815
> length(x)
[1] 2147483648
> length(x)/2^20
[1] 2048
> gc()
             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells     183823     9.9     407500    21.8     350000    18.7
Vcells 2147764406 16386.2 2368247221 18068.3 2148247383 16389.9
>

Note:

  1. 2^31 length is about 2 Giga length. It takes about 16 GB (2^31*8/2^20 MB) memory.
  2. On Windows, it is almost impossible to work with 2^31 length of data if the memory is less than 16 GB because virtual disk on Windows does not work well. For example, when I tested on my 12 GB Windows 7, the whole Windows system freezes for several minutes before I force to power off the machine.
  3. My slide in http://goo.gl/g7sGX shows the screenshots of running the above command on my Ubuntu and RHEL machines. As you can see the linux is pretty good at handling large (> system RAM) data. That said, as long as your linux system is 64-bit, you can possibly work on large data without too much pain.
  4. For large dataset, it makes sense to use database or specially crafted packages like bigmemory or ff or bigstatsr.

NA in index

  • Question: what is seq(1, 3)[c(1, 2, NA)]?

Answer: It will reserve the element with NA in indexing and return the value NA for it.

  • Question: What is TRUE & NA?

Answer: NA

  • Question: What is FALSE & NA?

Answer: FALSE

  • Question: c("A", "B", NA) != "" ?

Answer: TRUE TRUE NA

  • Question: which(c("A", "B", NA) != "") ?

Answer: 1 2

  • Question: c(1, 2, NA) != "" & !is.na(c(1, 2, NA)) ?

Answer: TRUE TRUE FALSE

  • Question: c("A", "B", NA) != "" & !is.na(c("A", "B", NA)) ?

Answer: TRUE TRUE FALSE

Conclusion: In order to exclude empty or NA for numerical or character data type, we can use which() or a convenience function keep.complete(x) <- function(x) x != "" & !is.na(x). This will guarantee return logical values and not contain NAs.

Don't just use x != "" OR !is.na(x).

Constant and 'L'

Add 'L' after a constant. For example,

for(i in 1L:n) { }

if (max.lines > 0L) { }

label <- paste0(n-i+1L, ": ")

n <- length(x);  if(n == 0L) { }

Factor

labels argument

We can specify the factor levels and new labels using the factor() function.

sex <- factor(sex, levels = c("0", "1"), labels = c("Male", "Female"))
drug_treatment <- factor(drug_treatment, levels = c("Placebo", "Low dose", "High dose"))
health_status <- factor(health_status, levels = c("Healthy", "Alzheimer's"))

Create a factor from a continuous variable: cut()

facVar <- cut(contVar, c(-3, 0, 2, 4), labels = c("low", "medium", "high"))

Or a tibble object

library(tidyverse)
tibble(age_yrs = c(0, 4, 10, 15, 24, 55),
       age_cat = case_when(
          age_yrs < 2 ~ "baby",
          age_yrs < 13 ~ "kid",
          age_yrs < 20 ~ "teen",
          TRUE         ~ "adult")
)

Data frame

stringsAsFactors = FALSE

http://www.win-vector.com/blog/2018/03/r-tip-use-stringsasfactors-false/

check.names = FALSE

> data.frame("1a"=1:2, "2a"=1:2, check.names = FALSE)
  1a 2a
1  1  1
2  2  2
> data.frame("1a"=1:2, "2a"=1:2) # default
  X1a X2a
1   1   1
2   2   2

Print a data frame without rownames

# Method 1. 
rownames(df1) <- NULL

# Method 2. 
print(df1, row.names = FALSE)

Convert data frame factor columns to characters

Convert data.frame columns from factors to characters

# Method 1:
bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)

# Method 2:
bob[] <- lapply(bob, as.character)

data.frame to vector

> a= matrix(1:6, 2,3)
> rownames(a) <- c("a", "b")
> colnames(a) <- c("x", "y", "z")
> a
  x y z
a 1 3 5
b 2 4 6
> unlist(data.frame(a))
x1 x2 y1 y2 z1 z2 
 1  2  3  4  5  6 

Using cbind() to merge vectors together?

It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly. See Advanced R -> Data structures chapter.

merge

How to perform merges (joins) on two or more data frames with base R, tidyverse and data.table

is.matrix: data.frame is not necessarily a matrix

See ?matrix. is.matrix returns TRUE if x is a vector and has a "dim" attribute of length 2 and FALSE otherwise.

An example that is a data frame (is.data.frame() returns TRUE) but not a matrix (is.matrix() returns FALSE) is an object returned by

X <- data.frame(x=1:2, y=3:4)

The 'X' object is NOT a vector and it does NOT have the "dim" attribute. It has only 3 attributes: "names", "row.names" & "class". Note that dim() function works fine and returns correctly though there is not "dim" attribute.

Another example that is a data frame but not a matrix is the built-in object cars; see ?matrix. It is not a vector

Convert a data frame to a matrix: as.matrix() vs data.matrix()

If I have a data frame X which recorded the time of some files.

  • is.data.frame(X) shows TRUE but is.matrix(X) show FALSE
  • as.matrix(X) will keep the time mode. The returned object is not a data frame anymore.
  • data.matrix(X) will convert the time to numerical values. So use data.matrix() if the data is numeric. The returned object is not a data frame anymore.
# latex directory contains cache files from knitting an rmarkdown file
X <- list.files("latex/", full.names = T) %>% grep("RData", ., value=T) %>% file.info() %>%  `[`("mtime")
X %>% is.data.frame() # TRUE
X %>% is.matrix() # FALSE
X %>% as.matrix() %>% is.matrix() # TRUE
X %>% data.matrix() %>% is.matrix() # TRUE
X %>% as.matrix() %>% "["(1:2, ) # timestamps
X %>% data.matrix() %>% "["(1:2, ) # numeric

matrix vs data.frame

ip1 <- installed.packages()[,c(1,3:4)] # class(ip1) = 'matrix'
unique(ip1$Priority)
# Error in ip1$Priority : $ operator is invalid for atomic vectors
unique(ip1[, "Priority"])   # OK

ip2 <- as.data.frame(installed.packages()[,c(1,3:4)], stringsAsFactors = FALSE) # matrix -> data.frame
unique(ip2$Priority)     # OK

The length of a matrix and a data frame is different.

> length(matrix(1:6, 3, 2))
[1] 6
> length(data.frame(matrix(1:6, 3, 2)))
[1] 2
> x[1]
  X1
1  1
2  2
3  3
4  4
5  5
6  6
> x1
[1] 1 2 3 4 5 6

So the length of a data frame is the number of columns. When we use sapply() function on a data frame, it will apply to each column of the data frame.

Convert a matrix (not data frame) of characters to numeric

Just change the mode of the object

tmp <- cbind(a=c("0.12", "0.34"), b =c("0.567", "0.890")); tmp
     a     b
1 0.12 0.567
2 0.34 0.890
> is.data.frame(tmp) # FALSE
> is.matrix(tmp)     # TRUE
> sum(tmp)
Error in sum(tmp) : invalid 'type' (character) of argument
> mode(tmp)  # "character"

> mode(tmp) <- "numeric"
> sum(tmp)
[1] 1.917

Select Data Frame Columns in R

This is part of series of DATA MANIPULATION IN R from datanovia.com

  • pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
  • select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
  • select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
  • Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names

Danger of selecting rows from a data frame

> dim(cars)
[1] 50  2
> data.frame(a=cars[1,], b=cars[2, ])
  a.speed a.dist b.speed b.dist
1       4      2       4     10
> dim(data.frame(a=cars[1,], b=cars[2, ]))
[1] 1 4
> cars2 = as.matrix(cars)
> data.frame(a=cars2[1,], b=cars2[2, ])
      a  b
speed 4  4
dist  2 10

Creating data frame using structure() function

Creating data frame using structure() function in R

Objects from subsetting a row in a data frame vs matrix

My example:

str(trainData)
# 'data.frame':	503 obs. of  500 variables:
#  $ bm001: num  0.429 1 -0.5 1.415 -1.899 ...
#  $ bm002: num  0.0568 1 0.5 0.3556 -1.16 ...
# ...
trainData[1:3, 1:3]
#        bm001      bm002    bm003
# 1  0.4289449 0.05676296 1.657966
# 2  1.0000000 1.00000000 1.000000
# 3 -0.5000000 0.50000000 0.500000
o <- data.frame(time = trainData[1, ], status = trainData[2, ], treat = trainData[3, ], t(TData))
# Warning message:
# In data.frame(time = trainData[1, ], status = trainData[2, ], treat = trainData[3,  :
#   row names were found from a short variable and have been discarded

'trees' data from the 'datasets' package

trees[1:3,]
#   Girth Height Volume
# 1   8.3     70   10.3
# 2   8.6     65   10.3
# 3   8.8     63   10.2

# Wrong ways:
data.frame(trees[1,] , trees[2,])
#   Girth Height Volume Girth.1 Height.1 Volume.1
# 1   8.3     70   10.3     8.6       65     10.3
data.frame(time=trees[1,] , status=trees[2,])
#   time.Girth time.Height time.Volume status.Girth status.Height status.Volume
# 1        8.3          70        10.3          8.6            65          10.3
data.frame(time=as.vector(trees[1,]) , status=as.vector(trees[2,]))
#   time.Girth time.Height time.Volume status.Girth status.Height status.Volume
# 1        8.3          70        10.3          8.6            65          10.3
data.frame(time=c(trees[1,]) , status=c(trees[2,]))
# time.Girth time.Height time.Volume status.Girth status.Height status.Volume
# 1        8.3          70        10.3          8.6            65          10.3

# Right ways:
# method 1: dropping row names
data.frame(time=c(t(trees[1,])) , status=c(t(trees[2,]))) 
# OR
data.frame(time=as.numeric(trees[1,]) , status=as.numeric(trees[2,]))
#   time status
# 1  8.3    8.6
# 2 70.0   65.0
# 3 10.3   10.3
# method 2: keeping row names
data.frame(time=t(trees[1,]) , status=t(trees[2,]))
#          X1   X2
# Girth   8.3  8.6
# Height 70.0 65.0
# Volume 10.3 10.3
data.frame(time=unlist(trees[1,]) , status=unlist(trees[2,]))
#        time status
# Girth   8.3    8.6
# Height 70.0   65.0
# Volume 10.3   10.3

is.matrix(trees)
# [1] FALSE
trees2 <- as.matrix(trees)
data.frame(time=trees2[1,] , status=trees2[2,]) # row names are kept
#        time status
# Girth   8.3    8.6
# Height 70.0   65.0
# Volume 10.3   10.3

dim(trees[1,])
# [1] 1 3
dim(trees2[1, ])
# NULL
trees[1, ]  # notice the row name '1' on the left hand side
#   Girth Height Volume
# 1   8.3     70   10.3
trees2[1, ]
#  Girth Height Volume
#    8.3   70.0   10.3

matrix

NROW vs nrow

?nrow. Use NROW/NCOL instead of nrow/ncol to treat vectors as 1-column matrices.

matrix (column-major order) multiply a vector

> matrix(1:6, 3,2)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> matrix(1:6, 3,2) * c(1,2,3)
     [,1] [,2]
[1,]    1    4
[2,]    4   10
[3,]    9   18
> matrix(1:6, 3,2) * c(1,2,3,4)
     [,1] [,2]
[1,]    1   16
[2,]    4    5
[3,]    9   12

Print a vector by suppressing names

Use unname.

format.pval

> args(format.pval)
function (pv, digits = max(1L, getOption("digits") - 2L), eps = .Machine$double.eps, 
    na.form = "NA", ...) 

> format.pval(c(stats::runif(5), pi^-100, NA))
[1] "0.19571" "0.46793" "0.71696" "0.93200" "0.74485" "< 2e-16" "NA"     
> format.pval(c(0.1, 0.0001, 1e-27))
[1] "1e-01"  "1e-04"  "<2e-16"

Customize R: options()

Change the default R repository

Change R repository

Edit global Rprofile file. On *NIX platforms, it's located in /usr/lib/R/library/base/R/Rprofile although local .Rprofile settings take precedence.

For example, I can specify the R mirror I like by creating a single line .Rprofile file under my home directory. Another good choice of repository is cloud.r-project.org.

local({
  r = getOption("repos")
  r["CRAN"] = "https://cran.rstudio.com/"
  options(repos = r)
})
options(continue = "  ")
message("Hi MC, loading ~/.Rprofile")
if (interactive()) {
  .Last <- function() try(savehistory("~/.Rhistory"))
}

Change the default web browser

When I run help.start() function in LXLE, it cannot find its default web browser (seamonkey).

> help.start()
If the browser launched by 'xdg-open' is already running, it is *not*
    restarted, and you must switch to its window.
Otherwise, be patient ...
> /usr/bin/xdg-open: 461: /usr/bin/xdg-open: x-www-browser: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: firefox: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: mozilla: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: epiphany: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: konqueror: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: chromium-browser: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: google-chrome: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: links2: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: links: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: lynx: not found
/usr/bin/xdg-open: 461: /usr/bin/xdg-open: w3m: not found
xdg-open: no method available for opening 'http://127.0.0.1:27919/doc/html/index.html'

The solution is to put

options(browser='seamonkey')

in the .Rprofile of your home directory. If the browser is not in the global PATH, we need to put the full path above.

For one-time only purpose, we can use the browser option in help.start() function:

> help.start(browser="seamonkey")
If the browser launched by 'seamonkey' is already running, it is *not*
    restarted, and you must switch to its window.
Otherwise, be patient ...

We can work made a change (or create the file) ~/.Renviron or etc/Renviron. See

Change the default editor

On my Linux and mac, the default editor is "vi". To change it to "nano",

options(editor = "nano")

Change prompt and remove '+' sign

See https://stackoverflow.com/a/1448823.

options(prompt="R> ", continue=" ")

digits

In R,

> options()$digits # Default
[1] 7
> 100000.07 + .04
[1] 100000.1
> options(digits = 16)
> 100000.07 + .04
[1] 100000.11

In Python,

>>> 100000.07 + .04
100000.11

Disable scientific notation in printing: options(scipen)

> numer = 29707; denom = 93874
> c(numer/denom, numer, denom) 
[1] 3.164561e-01 2.970700e+04 9.387400e+04

# Method 1. Without changing the global option
> format(c(numer/denom, numer, denom), scientific=FALSE)
[1] "    0.3164561" "29707.0000000" "93874.0000000"

# Method 2. Change the global option
> options(scipen=999)
> numer/denom
[1] 0.3164561
> c(numer/denom, numer, denom)
[1]     0.3164561 29707.0000000 93874.0000000
> c(4/5, numer, denom)
[1]     0.8 29707.0 93874.0

Suppress warnings

Use options(). If warn is negative all warnings are ignored. If warn is zero (the default) warnings are stored until the top--level function returns.

op <- options("warn")
options(warn = -1)
....
options(op)

# OR
warnLevel <- options()$warn
options(warn = -1)
...
options(warn = warnLevel)

Converts warnings into errors

options(warn=2)

sprintf

paste, paste0, sprintf

this post

Format number as fixed width, with leading zeros

# sprintf()
a <- seq(1,101,25)
sprintf("name_%03d", a)
[1] "name_001" "name_026" "name_051" "name_076" "name_101"

# formatC()
paste("name", formatC(a, width=3, flag="0"), sep="_")
[1] "name_001" "name_026" "name_051" "name_076" "name_101"

# gsub()
paste0("bm", gsub(" ", "0", format(5:15)))
# [1] "bm05" "bm06" "bm07" "bm08" "bm09" "bm10" "bm11" "bm12" "bm13" "bm14" "bm15"

Creating publication quality graphs in R

HDF5 : Hierarchical Data Format

HDF5 is an open binary file format for storing and managing large, complex datasets. The file format was developed by the HDF Group, and is widely used in scientific computing.

> h5ls(destination_file)
   group                           name       otype  dclass           dim
0      /                           data   H5I_GROUP                      
1  /data                     expression H5I_DATASET INTEGER 35238 x 65429
2      /                           info   H5I_GROUP                      
3  /info                         author H5I_DATASET  STRING             1
4  /info                        contact H5I_DATASET  STRING             1
5  /info                  creation-date H5I_DATASET  STRING             1
6  /info                            lab H5I_DATASET  STRING             1
7  /info                        version H5I_DATASET  STRING             1
8      /                           meta   H5I_GROUP                      
9  /meta           Sample_channel_count H5I_DATASET  STRING         65429
10 /meta     Sample_characteristics_ch1 H5I_DATASET  STRING         65429
11 /meta         Sample_contact_address H5I_DATASET  STRING         65429
12 /meta            Sample_contact_city H5I_DATASET  STRING         65429
13 /meta         Sample_contact_country H5I_DATASET  STRING         65429
14 /meta      Sample_contact_department H5I_DATASET  STRING         65429
15 /meta           Sample_contact_email H5I_DATASET  STRING         65429
16 /meta       Sample_contact_institute H5I_DATASET  STRING         65429
17 /meta      Sample_contact_laboratory H5I_DATASET  STRING         65429
18 /meta            Sample_contact_name H5I_DATASET  STRING         65429
19 /meta           Sample_contact_phone H5I_DATASET  STRING         65429
20 /meta Sample_contact_zip-postal_code H5I_DATASET  STRING         65429
21 /meta         Sample_data_processing H5I_DATASET  STRING         65429
22 /meta          Sample_data_row_count H5I_DATASET  STRING         65429
23 /meta             Sample_description H5I_DATASET  STRING         65429
24 /meta    Sample_extract_protocol_ch1 H5I_DATASET  STRING         65429
25 /meta           Sample_geo_accession H5I_DATASET  STRING         65429
26 /meta        Sample_instrument_model H5I_DATASET  STRING         65429
27 /meta        Sample_last_update_date H5I_DATASET  STRING         65429
28 /meta       Sample_library_selection H5I_DATASET  STRING         65429
29 /meta          Sample_library_source H5I_DATASET  STRING         65429
30 /meta        Sample_library_strategy H5I_DATASET  STRING         65429
31 /meta            Sample_molecule_ch1 H5I_DATASET  STRING         65429
32 /meta            Sample_organism_ch1 H5I_DATASET  STRING         65429
33 /meta             Sample_platform_id H5I_DATASET  STRING         65429
34 /meta                Sample_relation H5I_DATASET  STRING         65429
35 /meta               Sample_series_id H5I_DATASET  STRING         65429
36 /meta         Sample_source_name_ch1 H5I_DATASET  STRING         65429
37 /meta                  Sample_status H5I_DATASET  STRING         65429
38 /meta         Sample_submission_date H5I_DATASET  STRING         65429
39 /meta    Sample_supplementary_file_1 H5I_DATASET  STRING         65429
40 /meta    Sample_supplementary_file_2 H5I_DATASET  STRING         65429
41 /meta               Sample_taxid_ch1 H5I_DATASET  STRING         65429
42 /meta                   Sample_title H5I_DATASET  STRING         65429
43 /meta                    Sample_type H5I_DATASET  STRING         65429
44 /meta                          genes H5I_DATASET  STRING         35238

Formats for writing/saving and sharing data

Efficiently Saving and Sharing Data in R

Write unix format files on Windows and vice versa

https://stat.ethz.ch/pipermail/r-devel/2012-April/063931.html

with() and within() functions

within() is similar to with() except it is used to create new columns and merge them with the original data sets. See youtube video.

closePr <- with(mariokart, totalPr - shipPr)
head(closePr, 20)

mk <- within(mariokart, {
             closePr <- totalPr - shipPr
     })
head(mk) # new column closePr

mk <- mariokart
aggregate(. ~ wheels + cond, mk, mean)
# create mean according to each level of (wheels, cond)

aggregate(totalPr ~ wheels + cond, mk, mean)

tapply(mk$totalPr, mk[, c("wheels", "cond")], mean)

stem(): stem-and-leaf plot, bar chart on terminals

Graphical Parameters, Axes and Text, Combining Plots

statmethods.net

15 Questions All R Users Have About Plots

See http://blog.datacamp.com/15-questions-about-r-plots/. This is a tremendous post. It covers the built-in plot() function and ggplot() from ggplot2 package.

  1. How To Draw An Empty R Plot? plot.new()
  2. How To Set The Axis Labels And Title Of The R Plots?
  3. How To Add And Change The Spacing Of The Tick Marks Of Your R Plot? axis()
  4. How To Create Two Different X- or Y-axes? par(new=TRUE), axis(), mtext()
  5. How To Add Or Change The R Plot’s Legend? legend()
  6. How To Draw A Grid In Your R Plot? grid()
  7. How To Draw A Plot With A PNG As Background? rasterImage() from the png package
  8. How To Adjust The Size Of Points In An R Plot? cex argument
  9. How To Fit A Smooth Curve To Your R Data? loess() and lines()
  10. How To Add Error Bars In An R Plot? arrows()
  11. How To Save A Plot As An Image On Disc
  12. How To Plot Two R Plots Next To Each Other? par(mfrow), gridBase package, lattice package
  13. How To Plot Multiple Lines Or Points? plot(), lines()
  14. How To Fix The Aspect Ratio For Your R Plots? asp parameter
  15. What Is The Function Of hjust And vjust In ggplot2?

jitter function

Scatterplot with the "rug" function

require(stats)  # both 'density' and its default method
with(faithful, {
    plot(density(eruptions, bw = 0.15))
    rug(eruptions)
    rug(jitter(eruptions, amount = 0.01), side = 3, col = "light blue")
})

File:RugFunction.png

See also the stripchart() function which produces one dimensional scatter plots (or dot plots) of the given data.

Identify/Locate Points in a Scatter Plot

?identify

Draw a single plot with two different y-axes

Draw Color Palette

Default palette

palette() # black, red, green3, blue, cyan, magenta, yellow, gray

SVG

Embed svg in html

svglite

https://blog.rstudio.org/2016/11/14/svglite-1-2-0/

pdf -> svg

Using Inkscape. See this post.

read.table

clipboard

source("clipboard")
read.table("clipboard")

inline text

mydf <- read.table(header=T, text='
 cond yval
    A 2
    B 2.5
    C 1.6
')

http(s) connection

temp = getURL("https://gist.github.com/arraytools/6743826/raw/23c8b0bc4b8f0d1bfe1c2fad985ca2e091aeb916/ip.txt", 
                           ssl.verifypeer = FALSE)
ip <- read.table(textConnection(temp), as.is=TRUE)

read only specific columns

Use 'colClasses' option in read.table, read.delim, .... For example, the following example reads only the 3rd column of the text file and also changes its data type from a data frame to a vector. Note that we have include double quotes around NULL.

x <- read.table("var_annot.vcf", colClasses = c(rep("NULL", 2), "character", rep("NULL", 7)), 
                skip=62, header=T, stringsAsFactors = FALSE)[, 1]
# 
system.time(x <- read.delim("Methylation450k.txt", 
                colClasses = c("character", "numeric", rep("NULL", 188)), stringsAsFactors = FALSE))

To know the number of columns, we might want to read the first row first.

library(magrittr)
scan("var_annot.vcf", sep="\t", what="character", skip=62, nlines=1, quiet=TRUE) %>% length()

Another method is to use pipe(), cut or awk. See ways to read only selected columns from a file into R

Serialization

If we want to pass an R object to C (use recv() function), we can use writeBin() to output the stream size and then use serialize() function to output the stream to a file. See the post on R mailing list.

> a <- list(1,2,3)
> a_serial <- serialize(a, NULL)
> a_length <- length(a_serial)
> a_length
[1] 70
> writeBin(as.integer(a_length), connection, endian="big")
> serialize(a, connection)

In C++ process, I receive one int variable first to get the length, and then read <length> bytes from the connection.

socketConnection

See ?socketconnection.

Simple example

from the socketConnection's manual.

Open one R session

con1 <- socketConnection(port = 22131, server = TRUE) # wait until a connection from some client
writeLines(LETTERS, con1)
close(con1)

Open another R session (client)

con2 <- socketConnection(Sys.info()["nodename"], port = 22131)
# as non-blocking, may need to loop for input
readLines(con2)
while(isIncomplete(con2)) {
   Sys.sleep(1)
   z <- readLines(con2)
   if(length(z)) print(z)
}
close(con2)

Use nc in client

The client does not have to be the R. We can use telnet, nc, etc. See the post here. For example, on the client machine, we can issue

nc localhost 22131   [ENTER]

Then the client will wait and show anything written from the server machine. The connection from nc will be terminated once close(con1) is given.

If I use the command

nc -v -w 2 localhost -z 22130-22135

then the connection will be established for a short time which means the cursor on the server machine will be returned. If we issue the above nc command again on the client machine it will show the connection to the port 22131 is refused. PS. "-w" switch denotes the number of seconds of the timeout for connects and final net reads.

Some post I don't have a chance to read. http://digitheadslabnotebook.blogspot.com/2010/09/how-to-send-http-put-request-from-r.html

Use curl command in client

On the server,

con1 <- socketConnection(port = 8080, server = TRUE)

On the client,

curl --trace-ascii debugdump.txt http://localhost:8080/

Then go to the server,

while(nchar(x <- readLines(con1, 1)) > 0) cat(x, "\n")

close(con1) # return cursor in the client machine

Use telnet command in client

On the server,

con1 <- socketConnection(port = 8080, server = TRUE)

On the client,

sudo apt-get install telnet
telnet localhost 8080
abcdefg
hijklmn
qestst

Go to the server,

readLines(con1, 1)
readLines(con1, 1)
readLines(con1, 1)
close(con1) # return cursor in the client machine

Some tutorial about using telnet on http request. And this is a summary of using telnet.

Subsetting

Subset assignment of R Language Definition and Manipulation of functions.

The result of the command x[3:5] <- 13:15 is as if the following had been executed

`*tmp*` <- x
x <- "[<-"(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)

Avoid Coercing Indices To Doubles

1 or 1L

Careful on NA value

See the example below. dplyr::filter() seems better.

R> mydf = data.frame(a=1:3, b=c(NA,5,6))
R> mydf[mydf$b >5, ]
    a  b
NA NA NA
3   3  6
R> mydf[which(mydf$b >5), ]
  a b
3 3 6
R> mydf %>% dplyr::filter(b > 5)
  a b
1 3 6

modelling

update()

Extract all variable names in lm(), glm(), ...

all.vars(formula(Model)[-2])

as.formula(): use a string in formula in lm(), glm(), ...

? as.formula
xnam <- paste("x", 1:25, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
outcome <- "mpg"
variables <- c("cyl", "disp", "hp", "carb")

# Method 1. The 'Call' portion of the model is reported as “formula = f” 
# our modeling effort, 
# fully parameterized!
f <- as.formula(
  paste(outcome, 
        paste(variables, collapse = " + "), 
        sep = " ~ "))
print(f)
# mpg ~ cyl + disp + hp + carb

model <- lm(f, data = mtcars)
print(model)

# Call:
#   lm(formula = f, data = mtcars)
# 
# Coefficients:
#   (Intercept)          cyl         disp           hp         carb  
#     34.021595    -1.048523    -0.026906     0.009349    -0.926863  

# Method 2. eval() + bquote() + ".()"
format(terms(model))  #  or model$terms
# [1] "mpg ~ cyl + disp + hp + carb"

# The new line of code
model <- eval(bquote(   lm(.(f), data = mtcars)   ))

print(model)
# Call:
#   lm(formula = mpg ~ cyl + disp + hp + carb, data = mtcars)
# 
# Coefficients:
#   (Intercept)          cyl         disp           hp         carb  
#     34.021595    -1.048523    -0.026906     0.009349    -0.926863  

# Note if we skip ".()" operator
> eval(bquote(   lm(f, data = mtcars)   ))

Call:
lm(formula = f, data = mtcars)

Coefficients:
(Intercept)          cyl         disp           hp         carb  
  34.021595    -1.048523    -0.026906     0.009349    -0.926863 

Aggregating results from linear model

https://stats.stackexchange.com/a/6862

S3 and S4 methods

To get the source code of S4 methods, we can use showMethod(), getMethod() and showMethod(). For example

library(qrqc)
showMethods("gcPlot")
getMethod("gcPlot", "FASTQSummary") # get an error
showMethods("gcPlot", "FASTQSummary") # good.
  • Debug a S4 function
> library(genefilter) # Bioconductor
> showMethods("nsFilter")
Function: nsFilter (package genefilter)
eset="ExpressionSet"
> debug(nsFilter, signature="ExpressionSet")
library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir

class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"

getClassDef(class(ir))
## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                       
## Name:            start           width           NAMES     elementType
## Class:         integer         integer characterORNULL       character
##                                       
## Name:  elementMetadata        metadata
## Class: DataTableORNULL            list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges"

See what methods work on an object

see what methods work on an object, e.g. a GRanges object:

methods(class="GRanges")

Or if you have an object, x:

methods(class=class(x))

View S3 function definition: double colon '::' and triple colon ':::' operators

?":::"

  • pkg::name returns the value of the exported variable name in namespace pkg
  • pkg:::name returns the value of the internal variable name
base::"+"
stats:::coef.default

Read the source code (include Fortran/C, S3 and S4 methods)

mcols() and DataFrame() from Bioc S4Vectors package

  • mcols: Get or set the metadata columns.
  • colData: SummarizedExperiment instances from GenomicRanges
  • DataFrame: The DataFrame class extends the DataTable virtual class and supports the storage of any type of object (with length and [ methods) as columns.

For example, in Shrinkage of logarithmic fold changes vignette of the DESeq2paper package

> mcols(ddsNoPrior[genes, ])
DataFrame with 2 rows and 21 columns
   baseMean   baseVar   allZero dispGeneEst    dispFit dispersion  dispIter dispOutlier   dispMAP
  <numeric> <numeric> <logical>   <numeric>  <numeric>  <numeric> <numeric>   <logical> <numeric>
1  163.5750  8904.607     FALSE  0.06263141 0.03862798  0.0577712         7       FALSE 0.0577712
2  175.3883 59643.515     FALSE  2.25306109 0.03807917  2.2530611        12        TRUE 1.6011440
  Intercept strain_DBA.2J_vs_C57BL.6J SE_Intercept SE_strain_DBA.2J_vs_C57BL.6J WaldStatistic_Intercept
  <numeric>                 <numeric>    <numeric>                    <numeric>               <numeric>
1  6.210188                  1.735829    0.1229354                    0.1636645               50.515872
2  6.234880                  1.823173    0.6870629                    0.9481865                9.074686
  WaldStatistic_strain_DBA.2J_vs_C57BL.6J WaldPvalue_Intercept WaldPvalue_strain_DBA.2J_vs_C57BL.6J
                                <numeric>            <numeric>                            <numeric>
1                                10.60602         0.000000e+00                         2.793908e-26
2                                 1.92280         1.140054e-19                         5.450522e-02
   betaConv  betaIter  deviance  maxCooks
  <logical> <numeric> <numeric> <numeric>
1      TRUE         3  210.4045 0.2648753
2      TRUE         9  243.7455 0.3248949

findInterval()

Related functions are cuts() and split(). See also

order(), rank() and sort()

If we want to find the indices of the first 25 genes with the smallest p-values, we can use order(pval)[1:25].

> x = sample(10)
> x
 [1]  4  3 10  7  5  8  6  1  9  2
> order(x)
 [1]  8 10  2  1  5  7  4  6  9  3
> rank(x)
 [1]  4  3 10  7  5  8  6  1  9  2
> rank(10*x)
 [1]  4  3 10  7  5  8  6  1  9  2

> x[order(x)]
 [1]  1  2  3  4  5  6  7  8  9 10
> sort(x)
 [1]  1  2  3  4  5  6  7  8  9 10

OS-dependent results on sorting string vector

Gene symbol case.

# mac: 
order(c("DC-UbP", "DC2")) # c(1,2)

# linux: 
order(c("DC-UbP", "DC2")) # c(2,1)

Affymetric id case.

# mac:
order(c("202800_at", "2028_s_at")) # [1] 2 1
sort(c("202800_at", "2028_s_at")) # [1] "2028_s_at" "202800_at"

# linux
order(c("202800_at", "2028_s_at")) # [1] 1 2
sort(c("202800_at", "2028_s_at")) # [1] "202800_at" "2028_s_at"

It does not matter if we include factor() on the character vector.

The difference is related to locale. See

# both mac and linux
stringr::str_order(c("202800_at", "2028_s_at")) # [1] 2 1
stringr::str_order(c("DC-UbP", "DC2")) # [1] 1 2

# Or setting the locale to "C"
Sys.setlocale("LC_ALL", "C"); sort(c("DC-UbP", "DC2"))
# Or
Sys.setlocale("LC_COLLATE", "C"); sort(c("DC-UbP", "DC2"))
# But not
Sys.setlocale("LC_ALL", "en_US.UTF-8"); sort(c("DC-UbP", "DC2"))

unique()

It seems it does not sort. ?unique.

# mac & linux
R> unique(c("DC-UbP", "DC2"))
[1] "DC-UbP" "DC2"

do.call

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

Below are some examples from the help.

  • Usage
do.call(what, args, quote = FALSE, envir = parent.frame())
# what: either a function or a non-empty character string naming the function to be called.
# args: a list of arguments to the function call. The names attribute of args gives the argument names.
# quote: a logical value indicating whether to quote the arguments.
# envir: an environment within which to evaluate the call. This will be most useful
#        if what is a character string and the arguments are symbols or quoted expressions.
  • do.call() is similar to lapply() but not the same.
> do.call("complex", list(imag = 1:3))
[1] 0+1i 0+2i 0+3i
> lapply(list(imag = 1:3), complex)
$imag
[1] 0+0i
> complex(imag=1:3)
[1] 0+1i 0+2i 0+3i
> tmp <- expand.grid(letters[1:2], 1:3, c("+", "-"))
> length(tmp)
[1] 3
> tmp[1:4,]
  Var1 Var2 Var3
1    a    1    +
2    b    1    +
3    a    2    +
4    b    2    +
> c(tmp, sep = "")
$Var1
 [1] a b a b a b a b a b a b
Levels: a b

$Var2
 [1] 1 1 2 2 3 3 1 1 2 2 3 3

$Var3
 [1] + + + + + + - - - - - -
Levels: + -

$sep
[1] ""
> do.call("paste", c(tmp, sep = ""))
 [1] "a1+" "b1+" "a2+" "b2+" "a3+" "b3+" "a1-" "b1-" "a2-" "b2-" "a3-"
[12] "b3-"
  • environment and quote arguments.
> A <- 2
> f <- function(x) print(x^2)
> env <- new.env()
> assign("A", 10, envir = env)
> assign("f", f, envir = env)
> f <- function(x) print(x)
> f(A)   
[1] 2
> do.call("f", list(A))
[1] 2
> do.call("f", list(A), envir = env)  
[1] 4
> do.call(f, list(A), envir = env)   
[1] 2                       # Why?

> eval(call("f", A))                      
[1] 2
> eval(call("f", quote(A)))               
[1] 2
> eval(call("f", A), envir = env)         
[1] 4
> eval(call("f", quote(A)), envir = env)  
[1] 100
> foo <- function(a=1, b=2, ...) { 
         list(arg=do.call(c, as.list(match.call())[-1])) 
  }
> foo()
$arg
NULL
> foo(a=1)
$arg
a 
1 
> foo(a=1, b=2, c=3)
$arg
a b c 
1 2 3 

do.call vs mapply

  • do.call() is doing what mapply() does but do.call() uses a list instead of multiple arguments. So do.call() more close to base::Map() function.
> mapply(paste, tmp[1], tmp[2], tmp[3], sep = "")
      Var1 
 [1,] "a1+"
 [2,] "b1+"
 [3,] "a2+"
 [4,] "b2+"
 [5,] "a3+"
 [6,] "b3+"
 [7,] "a1-"
 [8,] "b1-"
 [9,] "a2-"
[10,] "b2-"
[11,] "a3-"
[12,] "b3-"
# It does not work if we do not explicitly specify the arguments in mapply()
> mapply(paste, tmp, sep = "")
      Var1 Var2 Var3
 [1,] "a"  "1"  "+" 
 [2,] "b"  "1"  "+" 
 [3,] "a"  "2"  "+" 
 [4,] "b"  "2"  "+" 
 [5,] "a"  "3"  "+" 
 [6,] "b"  "3"  "+" 
 [7,] "a"  "1"  "-" 
 [8,] "b"  "1"  "-" 
 [9,] "a"  "2"  "-" 
[10,] "b"  "2"  "-" 
[11,] "a"  "3"  "-" 
[12,] "b"  "3"  "-" 
set.seed(1)
mapply(rweibull, 1, c(1, 10), MoreArgs=list(n=1))
# [1] 1.326108 9.885284
set.seed(1)
x <- replicate(1000, mapply(rweibull, 1, c(1, 10), MoreArgs=list(n=1)))
dim(x) # [1]  2 1000
rowMeans(x)
# [1]  1.032209 10.104131
set.seed(1); Vectorize(rweibull)(n=1, shape=1, scale=c(1, 10))
# [1] 1.326108 9.885284
set.seed(1); x <- replicate(1000, Vectorize(rweibull)(n=1, shape=1, scale=c(1, 10)))

do.call vs lapply

https://stackoverflow.com/a/10801883

  • lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
  • do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. It is widely used, for example, to assemble lists into simpler structures (often with rbind or cbind).
  • Map applies a function to the corresponding elements of given vectors... Map is a simple wrapper to mapply which does not attempt to simplify the result, similar to Common Lisp's mapcar (with arguments being recycled, however). Future versions may allow some control of the result type.
> lapply(iris, class) # same as Map(class, iris)
$Sepal.Length
[1] "numeric"

$Sepal.Width
[1] "numeric"

$Petal.Length
[1] "numeric"

$Petal.Width
[1] "numeric"

$Species
[1] "factor"

> x <- lapply(iris, class)
> do.call(c, x)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

https://stackoverflow.com/a/10801902

  • lapply applies a function over a list. So there will be several function calls.
  • do.call calls a function with a list of arguments (... argument) such as c() or rbind()/cbind() or sum or order or "[" or paste. So there is only one function call.
> X <- list(1:3,4:6,7:9)
> lapply(X,mean)
1
[1] 2

2
[1] 5

3
[1] 8
> do.call(sum, X)
[1] 45
> sum(c(1,2,3), c(4,5,6), c(7,8,9))
[1] 45
> do.call(mean, X) # Error
> do.call(rbind,X)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> lapply(X,rbind)
1
     [,1] [,2] [,3]
[1,]    1    2    3

2
     [,1] [,2] [,3]
[1,]    4    5    6

3
     [,1] [,2] [,3]
[1,]    7    8    9
> mapply(mean, X, trim=c(0,0.5,0.1))
[1] 2 5 8
> mapply(mean, X) 
[1] 2 5 8

Below is a good example to show the difference of lapply() and do.call() - Generating Random Strings.

> set.seed(1)
> x <- replicate(2, sample(LETTERS, 4), FALSE)
> x
1
[1] "Y" "D" "G" "A"

2
[1] "B" "W" "K" "N"

> lapply(x, paste0)
1
[1] "Y" "D" "G" "A"

2
[1] "B" "W" "K" "N"

> lapply(x, paste0, collapse= )
1
[1] "YDGA"

2
[1] "BWKN"

> do.call(paste0, x)
[1] "YB" "DW" "GK" "AN"

do.call + rbind + lapply

Lots of examples. See for example this one for creating a data frame from a vector.

x <- readLines(textConnection("---CLUSTER 1 ---
 3
 4
 5
 6
 ---CLUSTER 2 ---
 9
 10
 8
 11"))

 # create a list of where the 'clusters' are
 clust <- c(grep("CLUSTER", x), length(x) + 1L)

 # get size of each cluster
 clustSize <- diff(clust) - 1L

 # get cluster number
 clustNum <- gsub("[^0-9]+", "", x[grep("CLUSTER", x)])

 result <- do.call(rbind, lapply(seq(length(clustNum)), function(.cl){
     cbind(Object = x[seq(clust[.cl] + 1L, length = clustSize[.cl])]
         , Cluster = .cl
         )
     }))

 result

     Object Cluster
[1,] "3"    "1"
[2,] "4"    "1"
[3,] "5"    "1"
[4,] "6"    "1"
[5,] "9"    "2"
[6,] "10"   "2"
[7,] "8"    "2"
[8,] "11"   "2"

A 2nd example is to sort a data frame by using do.call(order, list()).

Another example is to reproduce aggregate(). aggregate() = do.call() + by().

attach(mtcars)
do.call(rbind, by(mtcars, list(cyl, vs), colMeans))
# the above approach give the same result as the following
# except it does not have an extra Group.x columns
aggregate(mtcars, list(cyl, vs), FUN=mean)

How to get examples from help file

See this post. Method 1:

example(acf, give.lines=TRUE)

Method 2:

Rd <- utils:::.getHelpFile(?acf)
tools::Rd2ex(Rd)

"[" and "[[" with the sapply() function

Suppose we want to extract string from the id like "ABC-123-XYZ" before the first hyphen.

sapply(strsplit("ABC-123-XYZ", "-"), "[", 1)

is the same as

sapply(strsplit("ABC-123-XYZ", "-"), function(x) x[1])

Dealing with date

d1 = date()
class(d1) # "character"
d2 = Sys.Date()
class(d2) # "Date"

format(d2, "%a %b %d")

library(lubridate); ymd("20140108") # "2014-01-08 UTC"
mdy("08/04/2013") # "2013-08-04 UTC"
dmy("03-04-2013") # "2013-04-03 UTC"
ymd_hms("2011-08-03 10:15:03") # "2011-08-03 10:15:03 UTC"
ymd_hms("2011-08-03 10:15:03", tz="Pacific/Auckland") 
# "2011-08-03 10:15:03 NZST"
?Sys.timezone
x = dmy(c("1jan2013", "2jan2013", "31mar2013", "30jul2013"))
wday(x[1]) # 3
wday(x[1], label=TRUE) # Tues

Nonstandard evaluation and deparse/substitute

  • Vignette from the lazyeval package. It is needed in three cases
    • Labelling: turn an argument into a label
    • Formulas
    • Dot-dot-dot
  • substitute(expr, env) - capture expression.
    • substitute() is often paired with deparse() to create informative labels for data sets and plots.
    • Use 'substitute' to include the variable's name in a plot title, e.g.: var <- "abc"; hist(var,main=substitute(paste("Dist of ", var))) will show the title "Dist of var" instead of "Dist of abc" in the title.
  • quote(expr) - similar to substitute() but do nothing?? noquote - print character strings without quotes
  • eval(expr, envir), evalq(expr, envir) - eval evaluates its first argument in the current scope before passing it to the evaluator: evalq avoids this.
  • deparse(expr) - turns unevaluated expressions into character strings. For example,
> deparse(args(lm))
[1] "function (formula, data, subset, weights, na.action, method = \"qr\", " 
[2] "    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, "
[3] "    contrasts = NULL, offset, ...) "                                    
[4] "NULL"     

> deparse(args(lm), width=20)
[1] "function (formula, data, "        "    subset, weights, "           
[3] "    na.action, method = \"qr\", " "    model = TRUE, x = FALSE, "   
[5] "    y = FALSE, qr = TRUE, "       "    singular.ok = TRUE, "        
[7] "    contrasts = NULL, "           "    offset, ...) "               
[9] "NULL"

Following is another example. Assume we have a bunch of functions (f1, f2, ...; each function implements a different algorithm) with same input arguments format (eg a1, a2). We like to run these function on the same data (to compare their performance).

f1 <- function(x) x+1; f2 <- function(x) x+2; f3 <- function(x) x+3

f1(1:3)
f2(1:3)
f3(1:3)

# Or
myfun <- function(f, a) {
    eval(parse(text = f))(a)
}
myfun("f1", 1:3)
myfun("f2", 1:3)
myfun("f3", 1:3)

# Or with lapply
method <- c("f1", "f2", "f3")
res <- lapply(method, function(M) {
                    Mres <- eval(parse(text = M))(1:3)
                    return(Mres)
})
names(res) <- method

The ‘…’ argument

See Section 10.4 of An Introduction to R. Especially, the expression list(...) evaluates all such arguments and returns them in a named list

Functions

Function argument

Argument matching from R Language Definition manual.

Argument matching is augmented by the functions

Access to the partial matching algorithm used by R is via pmatch.

Lazy evaluation in R functions arguments

R function arguments are lazy — they’re only evaluated if they’re actually used.

  • Example 1. By default, R function arguments are lazy.
f <- function(x) {
  999
}
f(stop("This is an error!"))
#> [1] 999
  • Example 2. If you want to ensure that an argument is evaluated you can use force().
add <- function(x) {
  force(x)
  function(y) x + y
}
adders2 <- lapply(1:10, add)
adders2[[1]](10)
#> [1] 11
adders2[[10]](10)
#> [1] 20
  • Example 3. Default arguments are evaluated inside the function.
f <- function(x = ls()) {
  a <- 1
  x
}

# ls() evaluated inside f:
f()
# [1] "a" "x"

# ls() evaluated in global environment:
f(ls())
# [1] "add"    "adders" "f" 
  • Example 4. Laziness is useful in if statements — the second statement below will be evaluated only if the first is true.
x <- NULL
if (!is.null(x) && x > 0) {

}

Use of functions as arguments

Just Quickly: The unexpected use of functions as arguments

Backtick sign, infix/prefix/postfix operators

The backtick sign ` (not the single quote) refers to functions or variables that have otherwise reserved or illegal names; e.g. '&&', '+', '(', 'for', 'if', etc. See some examples in this note and What do backticks do in R?.

infix operator.

1 + 2    # infix
+ 1 2    # prefix
1 2 +    # postfix

List data type

Calling a function given a list of arguments

> args <- list(c(1:10, NA, NA), na.rm = TRUE)
> do.call(mean, args)
[1] 5.5
> mean(c(1:10, NA, NA), na.rm = TRUE)
[1] 5.5

Error handling and exceptions, tryCatch(), stop(), warning() and message()

out <- try({
  a <- 1
  b <- "x"
  a + b
})

elements <- list(1:10, c(-1, 10), c(T, F), letters)
results <- lapply(elements, log)
is.error <- function(x) inherits(x, "try-error")
succeeded <- !sapply(results, is.error)
  • tryCatch(): With tryCatch() you map conditions to handlers (like switch()), named functions that are called with the condition as an input. Note that try() is a simplified version of tryCatch().
tryCatch(expr, ..., finally)

show_condition <- function(code) {
  tryCatch(code,
    error = function(c) "error",
    warning = function(c) "warning",
    message = function(c) "message"
  )
}
show_condition(stop("!"))
#> [1] "error"
show_condition(warning("?!"))
#> [1] "warning"
show_condition(message("?"))
#> [1] "message"
show_condition(10)
#> [1] 10

Below is another snippet from available.packages() function,

z <- tryCatch(download.file(....), error = identity)
if (!inherits(z, "error")) STATEMENTS

Using list type

Avoid if-else or switch

?plot.stepfun.

y0 <- c(1,2,4,3)
sfun0  <- stepfun(1:3, y0, f = 0)
sfun.2 <- stepfun(1:3, y0, f = .2)
sfun1  <- stepfun(1:3, y0, right = TRUE)

tt <- seq(0, 3, by = 0.1)
op <- par(mfrow = c(2,2))
plot(sfun0); plot(sfun0, xval = tt, add = TRUE, col.hor = "bisque")
plot(sfun.2);plot(sfun.2, xval = tt, add = TRUE, col = "orange") # all colors
plot(sfun1);lines(sfun1, xval = tt, col.hor = "coral")
##-- This is  revealing :
plot(sfun0, verticals = FALSE,
     main = "stepfun(x, y0, f=f)  for f = 0, .2, 1")

for(i in 1:3)
  lines(list(sfun0, sfun.2, stepfun(1:3, y0, f = 1))[[i]], col = i)
legend(2.5, 1.9, paste("f =", c(0, 0.2, 1)), col = 1:3, lty = 1, y.intersp = 1)

par(op)

File:StepfunExample.svg

Open a new Window device

X11() or dev.new()

par()

?par

text size and font on main, lab & axis

Examples:

  • cex.main=0.9
  • cex.lab=0.8
  • font.lab=2
  • cex.axis=0.8
  • font.axis=2
  • col.axis="grey50"

layout

http://datascienceplus.com/adding-text-to-r-plot/

reset the settings

op <- par(mfrow=c(2,1), mar = c(5,7,4,2) + 0.1) 
....
par(op) # mfrow=c(1,1), mar = c(5,4,4,2) + .1

mtext (margin text) vs title

mgp (axis label locations)

  1. The margin line (in ‘mex’ units) for the axis title, axis labels and axis line. Note that ‘mgp[1]’ affects ‘title’ whereas ‘mgp[2:3]’ affect ‘axis’. The default is ‘c(3, 1, 0)’. If we like to make the axis labels closer to an axis, we can use mgp=c(2.3, 1, 0) for example.
  2. http://rfunction.com/archives/1302 mgp – A numeric vector of length 3, which sets the axis label locations relative to the edge of the inner plot window. The first value represents the location the labels (i.e. xlab and ylab in plot), the second the tick-mark labels, and third the tick marks. The default is c(3, 1, 0).

pch

File:R pch.png

See here.

  • Full circle: pch=16

lty (line type)

File:R lty.png

See here.

las (label style)

0: The default, parallel to the axis

1: Always horizontal

2: Perpendicular to the axis

3: Always vertical

oma (outer margin), common title for two plots

The following trick is useful when we want to draw multiple plots with a common title.

par(mfrow=c(1,2),oma = c(0, 0, 2, 0))  # oma=c(0, 0, 0, 0) by default
plot(1:10,  main="Plot 1")
plot(1:100,  main="Plot 2")
mtext("Title for Two Plots", outer = TRUE, cex = 1.5) # outer=FALSE by default

Mastering R plot – Part 3: Outer margins mtext() & par(xpd).

Non-standard fonts in postscript and pdf graphics

https://cran.r-project.org/doc/Rnews/Rnews_2006-2.pdf#page=41


NULL, NA, NaN, Inf

https://tomaztsql.wordpress.com/2018/07/04/r-null-values-null-na-nan-inf/

save()/load() vs saveRDS()/readRDS() vs dput()/dget() vs dump()/source()

  1. saveRDS() can only save one R object while save() does not have this constraint.
  2. saveRDS() doesn’t save the both the object and its name it just saves a representation of the object. As a result, the saved object can be loaded into a named object within R that is different from the name it had when originally serialized. See this post.
x <- 5
saveRDS(x, "myfile.rds")
x2 <- readRDS("myfile.rds")
identical(mod, mod2, ignore.environment = TRUE)

dput: Writes an ASCII text representation of an R object. The object name is not written (unlike dump).

$ data(pbc, package = "survival")
$ names(pbc)
$ dput(names(pbc))
c("id", "time", "status", "trt", "age", "sex", "ascites", "hepato", 
"spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", 
"ast", "trig", "platelet", "protime", "stage")


==, all.equal(), identical()

  • ==: exact match
  • all.equal: compare R objects x and y testing ‘near equality’
  • identical: The safe and reliable way to test two objects for being exactly equal.
x <- 1.0; y <- 0.99999999999
all.equal(x, y)
# [1] TRUE
identical(x, y)
# [1] FALSE

See also the testhat package.

testhat

tinytest

tinytest: Lightweight but Feature Complete Unit Testing Framework

ttdo adds support of the 'diffobj' package for 'diff'-style comparison of R objects.

Numerical Pitfall

Numerical pitfalls in computing variance

.1 - .3/3
## [1] 0.00000000000000001388

Sys.getpid()

This can be used to monitor R process memory usage or stop the R process. See this post.

How to write R codes

  • Code smells and feels from R Consortium
    • write simple conditions,
    • handle class properly,
    • return and exit early,
    • polymorphism,
    • switch(),
    • case_when(),
    •  %||%.

How to debug an R code

Debug R

Locale bug (grep did not handle UTF-8 properly PR#16264)

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16264

Path length in dir.create() (PR#17206)

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17206 (Windows only)

install.package() error, R_LIBS_USER is empty in R 3.4.1 & .libPaths()

R_LIBS_USER=${R_LIBS_USER-'~/R/x86_64-pc-linux-gnu-library/3.4'}
R_LIBS_USER="${HOME}/R/${R_PLATFORM}-library/3.4"

On Mac & R 3.4.0 (it's fine)

> Sys.getenv("R_LIBS_USER")
[1] "~/Library/R/3.4/library"
> .libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"

On Linux & R 3.3.1 (ARM)

> Sys.getenv("R_LIBS_USER")
[1] "~/R/armv7l-unknown-linux-gnueabihf-library/3.3"
> .libPaths()
[1] "/home/$USER/R/armv7l-unknown-linux-gnueabihf-library/3.3"
[2] "/usr/local/lib/R/library"

On Linux & R 3.4.1 (*Problematic*)

> Sys.getenv("R_LIBS_USER")
[1] ""
> .libPaths()
[1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library"
[3] "/usr/lib/R/library"

I need to specify the lib parameter when I use the install.packages command.

> install.packages("devtools", "~/R/x86_64-pc-linux-gnu-library/3.4")
> library(devtools)
Error in library(devtools) : there is no package called 'devtools'

# Specify lib.loc parameter will not help with the dependency package
> library(devtools, lib.loc = "~/R/x86_64-pc-linux-gnu-library/3.4")
Error: package or namespace load failed for 'devtools':
 .onLoad failed in loadNamespace() for 'devtools', details:
  call: loadNamespace(name)
  error: there is no package called 'withr'

# A solution is to redefine .libPaths
> .libPaths(c("~/R/x86_64-pc-linux-gnu-library/3.4", .libPaths()))
> library(devtools) # Works

A better solution is to specify R_LIBS_USER in ~/.Renviron file or ~/.bash_profile; see ?Startup.

Using external data from within another package

https://logfc.wordpress.com/2017/03/02/using-external-data-from-within-another-package/

How to run R scripts from the command line

https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line

How to exit a sourced R script

Decimal point & decimal comma

Countries using Arabic numerals with decimal comma (Austria, Belgium, Brazil France, Germany, Netherlands, Norway, South Africa, Spain, Sweden, ...) https://en.wikipedia.org/wiki/Decimal_mark

setting seed locally (not globally) in R

https://stackoverflow.com/questions/14324096/setting-seed-locally-not-globally-in-r

R's internal C API

https://github.com/hadley/r-internals

cleancall package for C resource cleanup

Resource Cleanup in C and the R API

Random number generator

#include <R.h>

void myunif(){
  GetRNGstate();
  double u = unif_rand();
  PutRNGstate();
  Rprintf("%f\n",u);
}
$ R CMD SHLIB r_rand.c
$ R
R> dyn.load("r_rand.so")
R> set.seed(1)
R> .C("myunif")
0.265509
list()
R> .C("myunif")
0.372124
list()
R> set.seed(1)
R> .C("myunif")
0.265509
list()

Random numbers: multivariate normal

Why MASS::mvrnorm() gives different result on Mac and Linux/Windows?

The reason could be the covariance matrix decomposition - and that may be due to the LAPACK/BLAS libraries. See

set.seed(1234)
junk <- biospear::simdata(n=500, p=500, q.main = 10, q.inter = 10, 
                          prob.tt = .5, m0=1, alpha.tt= -.5, 
                          beta.main= -.5, beta.inter= -.5, b.corr = .7, b.corr.by=25, 
                          wei.shape = 1, recr=3, fu=2, timefactor=1)
## Method 1: MASS::mvrnorm()
## This is simdata() has used. It gives different numbers on different OS.
## 
library(MASS)
set.seed(1234)
n <- 500
p <- 500
b.corr.by <- 25
b.corr <- .7
n.blocks <- p%/%b.corr.by
covMat <- diag(n.blocks) %x% 
  matrix(b.corr^abs(matrix(1:b.corr.by, b.corr.by, b.corr.by, byrow = TRUE) - 
                    matrix(1:b.corr.by, b.corr.by, b.corr.by)), b.corr.by, b.corr.by)
diag(covMat) <- 1
x <- MASS::mvrnorm(n, rep(0, p), Sigma = covMat)
range(x)
eS <- eigen(covMat, symmetric = TRUE) # used by MASS::mvrnorm()
eS$vectors[which(eS$vectors[, 500] != 0), 500] # mac and Linux have opposite sign

packageVersion("MASS")
# Mac: [1] ‘7.3.49’
# Linux: [1] ‘7.3.49’
# Windows: [1] ‘7.3.47’

R.version$version.string
# Mac: [1] "R version 3.4.3 (2017-11-30)"
# Linux: [1] "R version 3.4.4 (2018-03-15)"
# Windows: [1] "R version 3.4.3 (2017-11-30)"

## Method 2: mvtnorm::rmvnorm()
set.seed(1234)
x <- mvtnorm::rmvnorm(n=n, rep(0, p), sigma=covMat)
# NOTE: eigen is used by default
range(x)
# Mac: [1] -4.482566  4.459236
# Linux: [1] -4.482566  4.459236

## Method 3: mvnfast::rmvn()
set.seed(1234)
x <- mvnfast::rmvn(n, rep(0, p), covMat)
range(x)
# Mac: [1] -4.323585  4.355666
# Linux: [1] -4.323585  4.355666

library(microbenchmark)
library(MASS)
library(mvtnorm)
library(mvnfast)
microbenchmark(v1 <- rmvnorm(n=n, rep(0, p), sigma=covMat, "eigen"),
               v2 <- rmvnorm(n=n, rep(0, p), sigma=covMat, "svd"),
               v3 <- rmvnorm(n=n, rep(0, p), sigma=covMat, "chol"),
               v4 <- rmvn(n, rep(0, p), covMat),
               v5 <- mvrnorm(n, rep(0, p), Sigma = covMat))
Unit: milliseconds
expr       min        lq
v1 <- rmvnorm(n = n, rep(0, p), sigma = covMat, "eigen") 296.55374 300.81089
v2 <- rmvnorm(n = n, rep(0, p), sigma = covMat, "svd") 461.81867 466.98806
v3 <- rmvnorm(n = n, rep(0, p), sigma = covMat, "chol") 118.33759 120.01829
v4 <- rmvn(n, rep(0, p), covMat)  66.64675  69.89383
v5 <- mvrnorm(n, rep(0, p), Sigma = covMat) 291.19826 294.88038
mean    median        uq      max neval   cld
306.72485 301.99339 304.46662 335.6137   100    d 
478.58536 470.44085 493.89041 571.7990   100     e
125.85427 121.26185 122.21361 151.1658   100  b   
71.67996  70.52985  70.92923 100.2622   100 a    
301.88144 296.76028 299.50839 346.7049   100   c  

A little more investigation shows the eigen values differ a little bit on macOS and Linux.

set.seed(1234); x <- mvrnorm(n, rep(0, p), Sigma = covMat)
debug(mvrnorm) 
# eS --- macOS
# eS2 -- Linux
Browse[2]> range(abs(eS$values - eS2$values))
# [1] 0.000000e+00 1.776357e-15
Browse[2]> var(as.vector(eS$vectors))
[1] 0.002000006
Browse[2]> var(as.vector(eS2$vectors))
[1] 0.001999987
Browse[2]> all.equal(eS$values, eS2$values)
[1] TRUE
Browse[2]> which(eS$values != eS2$values)
  [1]   6   7   8   9  10  11  12  13  14  20  22  23  24  25  26  27  28  29
  ...
[451] 494 495 496 497 499 500
Browse[2]> range(abs(eS$vectors - eS2$vectors))
[1] 0.0000000 0.5636919

rle() running length encoding

citation()

citation()
citation("MASS")
toBibtex(citation())

Monitor memory usage

  • Windows: memory.size(max=TRUE)
  • Linux
    • RStudio: htop -p PID where PID is the process ID of /usr/lib/rstudio/bin/rsession, not /usr/lib/rstudio/bin/rstudio. This is obtained by running x <- rnorm(2*1e8). The object size can be obtained through print(object.size(x), units = "auto"). Note that 1e8*8/2^20 = 762.9395.
    • R: htop -p PID where PID is the process ID of /usr/lib/R/bin/exec/R. Alternatively, use htop -p `pgrep -f /usr/lib/R/bin/exec/R`
    • To find the peak memory usage grep VmPeak /proc/$PID/status

References:

Resource

Books

# R 3.4.1
.libPaths(c("~/R/x86_64-pc-linux-gnu-library/3.4", .libPaths()))
setwd("/tmp/efficientR/")
bookdown::render_book("index.Rmd", output_format = "bookdown::pdf_book")
# generated pdf file is located _book/_main.pdf

bookdown::render_book("index.Rmd", output_format = "bookdown::epub_book")
# generated epub file is located _book/_main.epub.
# This cannot be done in RStudio ("parse_dt" not resolved from current namespace (lubridate))
# but it is OK to run in an R terminal

Videos

“Do More with R” video tutorials. Search for R video tutorials by task, topic, or package. Most videos are shorter than 10 minutes.

Webinar

useR!

R consortium

https://www.youtube.com/channel/UC_R5smHVXRYGhZYDJsnXTwg/featured

Blogs, Tips, Socials, Communities

Bug Tracking System

https://bugs.r-project.org/bugzilla3/ and Search existing bug reports. Remember to select 'All' in the Status drop-down list.

Use sessionInfo().

License

Some Notes on GNU Licenses in R Packages