R repository

From 太極
Jump to navigation Jump to search

How to set up a local repository

Some information here may focus on Windows binary packages

General guide: http://cran.r-project.org/doc/manuals/R-admin.html#Setting-up-a-package-repository

Utilities such as install.packages can be pointed at any CRAN-style repository, and R users may want to set up their own. The ‘base’ of a repository is a URL such as http://www.omegahat.org/R/: this must be an URL scheme that download.packages supports (which also includes ‘ftp://’ and ‘file://’, but not on most systems ‘https://’). Under that base URL there should be directory trees for one or more of the following types of package distributions:

  • "source": located at src/contrib and containing .tar.gz files. Other forms of compression can be used, e.g. .tar.bz2 or .tar.xz files.
  • "win.binary": located at bin/windows/contrib/x.y for R versions x.y.z and containing .zip files for Windows.
  • "mac.binary.leopard": located at bin/macosx/leopard/contrib/x.y for R versions x.y.z and containing .tgz files.

Each terminal directory must also contain a PACKAGES file. This can be a concatenation of the DESCRIPTION files of the packages separated by blank lines, but only a few of the fields are needed. The simplest way to set up such a file is to use function write_PACKAGES in the tools package, and its help explains which fields are needed. Optionally there can also be a PACKAGES.gz file, a gzip-compressed version of PACKAGES—as this will be downloaded in preference to PACKAGES it should be included for large repositories. (If you have a mis-configured server that does not report correctly non-existent files you will need PACKAGES.gz.)

To add your repository to the list offered by setRepositories(), see the help file for that function.

A repository can contain subdirectories, when the descriptions in the PACKAGES file of packages in subdirectories must include a line of the form

Path: path/to/subdirectory

—once again write_PACKAGES is the simplest way to set this up.

Space requirement if we want to mirror WHOLE repository

  • Whole CRAN takes about 92GB (rsync -avn cran.r-project.org::CRAN > ~/Downloads/cran).
  • Bioconductor is big (> 64G for BioC 2.11). Please check the size of what will be transferred with e.g. (rsync -avn bioconductor.org::2.11 > ~/Downloads/bioc) and make sure you have enough room on your local disk before you start.

On the other hand, we if only care about Windows binary part, the space requirement is largely reduced.

  • CRAN: 2.7GB
  • Bioconductor: 28GB.

Misc notes

  • If the binary package was built on R 2.15.1, then it cannot be installed on R 2.15.2. But vice is OK.
  • Remember to issue "--delete" option in rsync, otherwise old version of package will be installed.
  • The repository still need src directory. If it is missing, we will get an error
Warning: unable to access index for repository http://arraytools.no-ip.org/CRAN/src/contrib
Warning message:
package ‘glmnet’ is not available (for R version 2.15.2) 

The error was given by available.packages() function.

To bypass the requirement of src directory, I can use

install.packages("glmnet", contriburl = contrib.url(getOption('repos'), "win.binary"))

but there may be a problem when we use biocLite() command.

I find a workaround. Since the error comes from missing CRAN/src directory, we just need to make sure the directory CRAN/src/contrib exists AND either CRAN/src/contrib/PACKAGES or CRAN/src/contrib/PACKAGES.gz exists.

To create CRAN repository

Before creating a local repository please give a dry run first. You don't want to be surprised how long will it take to mirror a directory.

Dry run (-n option). Pipe out the process to a text file for an examination.

rsync -avn cran.r-project.org::CRAN > crandryrun.txt

To mirror only partial repository, it is necessary to create directories before running rsync command.

cd
mkdir -p ~/Rmirror/CRAN/bin/windows/contrib/2.15
rsync -rtlzv --delete cran.r-project.org::CRAN/bin/windows/contrib/2.15/ ~/Rmirror/CRAN/bin/windows/contrib/2.15
(one line with space before ~/Rmirror)

# src directory is very large (~27GB) since it contains source code for each R version. 
# We just need the files PACKAGES and PACKAGES.gz in CRAN/src/contrib. So I comment out the following line.
# rsync -rtlzv --delete cran.r-project.org::CRAN/src/ ~/Rmirror/CRAN/src/
mkdir -p ~/Rmirror/CRAN/src/contrib
rsync -rtlzv --delete cran.r-project.org::CRAN/src/contrib/PACKAGES ~/Rmirror/CRAN/src/contrib/
rsync -rtlzv --delete cran.r-project.org::CRAN/src/contrib/PACKAGES.gz ~/Rmirror/CRAN/src/contrib/

And optionally

library(tools)
write_PACKAGES("~/Rmirror/CRAN/bin/windows/contrib/2.15", type="win.binary") 

and if we want to get src directory

rsync -rtlzv --delete cran.r-project.org::CRAN/src/contrib/*.tar.gz ~/Rmirror/CRAN/src/contrib/
rsync -rtlzv --delete cran.r-project.org::CRAN/src/contrib/2.15.3 ~/Rmirror/CRAN/src/contrib/

We can use du -h to check the folder size.

For example (as of 1/7/2013),

$ du -k ~/Rmirror --max-depth=1 --exclude ".*" | sort -nr | cut -f2 | xargs -d '\n' du -sh
30G	/home/brb/Rmirror
28G	/home/brb/Rmirror/Bioc
2.7G	/home/brb/Rmirror/CRAN

To create Bioconductor repository

Dry run

rsync -avn bioconductor.org::2.11 > biocdryrun.txt

Then creates directories before running rsync.

cd
mkdir -p ~/Rmirror/Bioc
wget -N http://www.bioconductor.org/biocLite.R -P ~/Rmirror/Bioc

where -N is to overwrite original file if the size or timestamp change and -P in wget means an output directory, not a file name.

Optionally, we can add the following in order to see the Bioconductor front page.

rsync -zrtlv  --delete bioconductor.org::2.11/BiocViews.html ~/Rmirror/Bioc/packages/2.11/
rsync -zrtlv  --delete bioconductor.org::2.11/index.html ~/Rmirror/Bioc/packages/2.11/

The software part (aka bioc directory) installation:

cd
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/bin/windows
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/src
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/bin/windows/ ~/Rmirror/Bioc/packages/2.11/bioc/bin/windows
# Either rsync whole src directory or just essential files
# rsync -zrtlv  --delete bioconductor.org::2.11/bioc/src/ ~/Rmirror/Bioc/packages/2.11/bioc/src
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/src/contrib/PACKAGES ~/Rmirror/Bioc/packages/2.11/bioc/src/contrib/
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/src/contrib/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/bioc/src/contrib/
# Optionally the html part
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/html
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/html/ ~/Rmirror/Bioc/packages/2.11/bioc/html
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/vignettes
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/vignettes/ ~/Rmirror/Bioc/packages/2.11/bioc/vignettes
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/news
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/news/ ~/Rmirror/Bioc/packages/2.11/bioc/news
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/licenses
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/licenses/ ~/Rmirror/Bioc/packages/2.11/bioc/licenses
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/manuals
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/manuals/ ~/Rmirror/Bioc/packages/2.11/bioc/manuals
mkdir -p ~/Rmirror/Bioc/packages/2.11/bioc/readmes
rsync -zrtlv  --delete bioconductor.org::2.11/bioc/readmes/ ~/Rmirror/Bioc/packages/2.11/bioc/readmes

and annotation (aka data directory) part:

mkdir -p ~/Rmirror/Bioc/packages/2.11/data/annotation/bin/windows
mkdir -p ~/Rmirror/Bioc/packages/2.11/data/annotation/src/contrib
# one line for each of the following
rsync -zrtlv --delete bioconductor.org::2.11/data/annotation/bin/windows/ ~/Rmirror/Bioc/packages/2.11/data/annotation/bin/windows
rsync -zrtlv --delete bioconductor.org::2.11/data/annotation/src/contrib/PACKAGES ~/Rmirror/Bioc/packages/2.11/data/annotation/src/contrib/
rsync -zrtlv --delete bioconductor.org::2.11/data/annotation/src/contrib/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/data/annotation/src/contrib/

and experiment directory:

mkdir -p ~/Rmirror/Bioc/packages/2.11/data/experiment/bin/windows/contrib/2.15
mkdir -p ~/Rmirror/Bioc/packages/2.11/data/experiment/src/contrib
# one line for each of the following
# Note that we are cheating by only downloading PACKAGES and PACKAGES.gz files
rsync -zrtlv --delete bioconductor.org::2.11/data/experiment/bin/windows/contrib/2.15/PACKAGES ~/Rmirror/Bioc/packages/2.11/data/experiment/bin/windows/contrib/2.15/
rsync -zrtlv --delete bioconductor.org::2.11/data/experiment/bin/windows/contrib/2.15/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/data/experiment/bin/windows/contrib/2.15/
rsync -zrtlv --delete bioconductor.org::2.11/data/experiment/src/contrib/PACKAGES ~/Rmirror/Bioc/packages/2.11/data/experiment/src/contrib/
rsync -zrtlv --delete bioconductor.org::2.11/data/experiment/src/contrib/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/data/experiment/src/contrib/

and extra directory:

mkdir -p ~/Rmirror/Bioc/packages/2.11/extra/bin/windows/contrib/2.15
mkdir -p ~/Rmirror/Bioc/packages/2.11/extra/src/contrib
# one line for each of the following
# Note that we are cheating by only downloading PACKAGES and PACKAGES.gz files
rsync -zrtlv --delete bioconductor.org::2.11/extra/bin/windows/contrib/2.15/PACKAGES ~/Rmirror/Bioc/packages/2.11/extra/bin/windows/contrib/2.15/
rsync -zrtlv --delete bioconductor.org::2.11/extra/bin/windows/contrib/2.15/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/extra/bin/windows/contrib/2.15/
rsync -zrtlv --delete bioconductor.org::2.11/extra/src/contrib/PACKAGES ~/Rmirror/Bioc/packages/2.11/extra/src/contrib/
rsync -zrtlv --delete bioconductor.org::2.11/extra/src/contrib/PACKAGES.gz ~/Rmirror/Bioc/packages/2.11/extra/src/contrib/

sync Bioconductor software packages

To keep a copy of the bioc/source (software packages) code only,

$ mkdir -p ~/bioc_release/bioc/
$ rsync -zrtlv --delete master.bioconductor.org::release/bioc/src ~/bioc_release/bioc/

$ du -h ~/bioc_release/bioc/
# 20GB, 1565 items, Bioc 3.7

Note -z - compress file data during the transfer, -t - preserve modification times, -l copy symbolic links as symbolic links. The option -zrtlv can be replaced by the common options -avz.

To get the old versions of a packages (after the release of a version of Bioconductor), check out the Archive folder.

Now we can create a cron job to do sync. Note my observation is Bioconductor has a daily update around 10:45AM. So I set time at 11:00AM.

echo "00 11 * * * rsync -avz --delete master.bioconductor.org::release/bioc/src ~/bioc_release/bioc/" >> \
  ~/Documents/cronjob   # everyday at 6am & 1pm
crontab ~/Documents/cronjob
crontab -l

To test local repository

Create soft links in Apache server

su
ln -s /home/brb/Rmirror/CRAN /var/www/html/CRAN
ln -s /home/brb/Rmirror/Bioc /var/www/html/Bioc
ls -l /var/www/html

The soft link mode should be 777.

To test CRAN

Replace the host name arraytools.no-ip.org by IP address 10.133.2.111 if necessary.

r <- getOption("repos"); r["CRAN"] <- "http://arraytools.no-ip.org/CRAN"
options(repos=r)
install.packages("glmnet")

We can test if the backup server is working or not by installing a package which was removed from the CRAN. For example, 'ForImp' was removed from CRAN in 11/8/2012, but I still a local copy built on R 2.15.2 (run rsync on 11/6/2012).

r <- getOption("repos"); r["CRAN"] <- "http://cran.r-project.org"
r <- c(r, BRB='http://arraytools.no-ip.org/CRAN')
#                        CRAN                            CRANextra                                  BRB 
# "http://cran.r-project.org" "http://www.stats.ox.ac.uk/pub/RWin"   "http://arraytools.no-ip.org/CRAN"
options(repos=r)
install.packages('ForImp')

Note by default, CRAN mirror is selected interactively.

> getOption("repos")
                                CRAN                            CRANextra 
                            "@[email protected]" "http://www.stats.ox.ac.uk/pub/RWin" 

To test Bioconductor

# CRAN part:
r <- getOption("repos"); r["CRAN"] <- "http://arraytools.no-ip.org/CRAN"
options(repos=r)
# Bioconductor part:
options("BioC_mirror" = "http://arraytools.no-ip.org/Bioc")
source("http://bioconductor.org/biocLite.R")
# This source biocLite.R line can be placed either before or after the previous 2 lines
biocLite("aCGH")

If there is a connection problem, check folder attributes.

chmod -R 755 ~/CRAN/bin
  • Note that if a binary package was created for R 2.15.1, then it can be installed under R 2.15.1 but not R 2.15.2. The R console will show package xxx is not available (for R version 2.15.2).
  • For binary installs, the function also checks for the availability of a source package on the same repository, and reports if the source package has a later version, or is available but no binary version is.

So for example, if the mirror does not have contents under src directory, we need to run the following line in order to successfully run install.packages() function.

options(install.packages.check.source = "no")
  • If we only mirror the essential directories, we can run biocLite() successfully. However, the R console will give some warning
> biocLite("aCGH")
BioC_mirror: http://arraytools.no-ip.org/Bioc
Using Bioconductor version 2.11 (BiocInstaller 1.8.3), R version 2.15.
Installing package(s) 'aCGH'
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/data/experiment/src/contrib
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/extra/src/contrib
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/data/experiment/bin/windows/contrib/2.15
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/extra/bin/windows/contrib/2.15
trying URL 'http://arraytools.no-ip.org/Bioc/packages/2.11/bioc/bin/windows/contrib/2.15/aCGH_1.36.0.zip'
Content type 'application/zip' length 2431158 bytes (2.3 Mb)
opened URL
downloaded 2.3 Mb

package ‘aCGH’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
        C:\Users\limingc\AppData\Local\Temp\Rtmp8IGGyG\downloaded_packages
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/data/experiment/bin/windows/contrib/2.15
Warning: unable to access index for repository http://arraytools.no-ip.org/Bioc/packages/2.11/extra/bin/windows/contrib/2.15
> library()

Repository directory structure

CRAN

The information below is specific to R 2.15.2. There are linux and macosx subdirecotries whenever there are windows subdirectory.

bin/winows/contrib/2.15
src/contrib
   /contrib/2.15.2
   /contrib/Archive
web/checks
   /dcmeta
   /packages
   /views

A clickable map [1]

Bioconductor

The information below is specific to Bioc 2.11 (R 2.15). There are linux and macosx subdirecotries whenever there are windows subdirectory.

bioc/bin/windows/contrib/2.15
    /html
    /install
    /license
    /manuals 
    /news
    /src
    /vignettes
data/annotation/bin/windows/contrib/2.15
               /html
               /licenses
               /manuals
               /src
               /vignettes
     /experiment/bin/windows/contrib/2.15
                /html
                /manuals
                /src/contrib
                /vignettes
extra/bin/windows/contrib
     /html
     /src
     /vignettes

Package download statistics

CRAN packages from RStudio

Query top downloaded packages, download statistics

Bioconductor packages

http://bioconductor.org/packages/stats/

List all R packages from CRAN/Bioconductor

See METACRAN for packages hosted on CRAN. The 'https://github.com/metacran/PACKAGES' file contains the latest update.