R
Table of Contents
1 Setting up
1.1 Packages
library()
see the list of installed packages.
library(class)
load the package class
.
search()
see the list of loaded packages.
install.packages()
and update.packages()
install and update packages.
1.2 Emacs ESS
To start ESS session, run command S
.
ESS will create a command interface as a buffer.
Execute ?foo
will open the R-doc
for the function foo
.
There's a babel in org mode for R, so just C-c C-c
would work.
This will prompt to create a session.
One special for this babel is you can evaluate by line, using C-<RET>
in the edit buffer.
Keep the session using the header:
#+PROPERTY: session *R*
To export a graph:
:file image.png :results output graphics :exports both
To export ordinary result:
:exports both
To export some summary data:
:exports both :results output org
1.3 Interactive command
We need to know what's going on in the current workspace. getwd()
and setwd()
get and set the current dir. ls()
list the objects
currently stored. rm(x, y, z)
remove objects rm(list=ls())
remove
all objects. objects()
create and store current objects.
We can perform some IO by save.image("mydata.Rdata")
and
load("mydata.Rdata")
will save and load workspace in current
directory respectively. source("a.R")
loads a script.
sink("a.lis")
redirects the output to a file, and sink()
restore
that to standard output.
You can view documentation by calling help help(lm)
. ?lm
and
??solve
also shows documentation, while example(topic)
, as its
name indicates, shows the examples. help.start()
opens the html
documentation page.
Finally, q()
quit R.
1.4 Commonly used functions
str
show the structure of arbitrary typesummary
print the summary
To see the data, you can use:
- dim()
- length()
- head()
- tail()
summary(dataset)
shows some information like max,min,meanclass(data$col)
get typelevels(data$col)
if it is factor, get the values
2 Objects
Objects have mode and length. The typeof
gets the type of an
object, while mode
retrieves the mode of an object. length
gets
the length of the vector.
Objects have attributes. names
is used for indexing dim
is the
dimension of a matrix dimnames
is the character names of the
dimensions
The missing values are NA
, tested by is.na
. Illegal
computations produces NaN
, e.g. 0/0
.
Compound objects are factors (a.k.a. enumerators) and data frames (lists of objects all have the same length). Vectors are the array of objects of the same mode. List can contain the objects of different modes. A data frame is a list with class "data.frame".
Using bracket expression with predicates to select part of data is very useful. Inside the bracket, the column names do not need to have prefix.
Type Conversion: you can change a type of a vector by
as.factor(x)
- as.numeric()
2.1 Vector
Create a vector by c
: c(10.4, 5.6, 3.1)
.
This connects elements end to end, e.g. c(x, 0, x)
.
By a sequence 1:30
: the same as: c(1, 2, ..., 29, 30)
.
Colon operator can also specify a backward sequence: 30:1
Colon operator has higher priority: 2*1:15
is the same as c(2, 4, …, 28, 30)
The colon operator works, if you want a more flexible and powerful expression, seq
is what you want:
seq(2,10)
produces2:10
.seq(-5, 5, by=.2)
seq(length=51, from=-5, by=.2)
Repeating can be done by rep
:
x <- c(1,2,3)
rep(x, times=5)
:[1,2,3,1,2,3,...]
rep(x, each=5)
:[1,1,1,1,1,2,2,2,2,2,...]
2.2 Indexing
Inside []
, it can be a number or character, or a vector of them.
For vectors, []
returns the element.
For lists, []
will return the the element inside a list, while [[]]
will return the single element.
[]
does not allow a vector as index.
If the index is integer, will select based on the position, start from 1. If it is negative, it means the elements other than those index. The index 0 will return empty. Other numeric values will be converted to integer towards zero.
The index can be a integer vector, which selects a bunch of values.
If the index is logical vector, the true ones will be returned.
If the index is character, it is compared, partially, with the names attributes of the vector.
$
can be used for indexing with character.
The empty index []
will returns the entire vector with irrelevant attributes removed.
The only retained ones are the names
, dim
and dimnames
attributes.
fruit <- c(5, 10, 1, 20) names(fruit) <- c("orange", "banana", "apple", "peach") lunch <- fruit[c("apple","orange")] # matrix dim(z) <- c(3,5,100)~ z[2,,] z[,,]
Matrix can be created by the matrix
function.
matrix(1:9, nrow=3,byrow=TRUE)
2.3 Data frame
na.omit
- can omit the NA values in data frame
A data frame is a list of equal-length vectors.
When getting the data from read.csv, the result is a data frame.
Use names
to work on data frames will emit the names.
- Since it is a list, using
[]
to index will give also the list, a.k.a. data frame, retaining names. You can use a vector as index. - Using
[[]]
to index will give the value, dropping names. You cannot use a vector as index.
2.4 data example
## (HEBI: Command line arguments) args = commandArgs(trailingOnly=TRUE) csvfile = args[1] csv = read.csv(csvfile, header=TRUE) total_test <- dim(csv)[[1]] sub = subset(csv, reach_code>=5) total_reach_poi <- dim(sub)[[1]] sub = subset(csv, reach_code==5 & status_code == 1) total_fail_poi <- dim(sub)[[1]] sub <- sub[1:(length(csv)-2)] ## (HEBI: callin ga function) funcs = TransferFunction(sub); ## (HEBI: define a function) Constant <- function(data) { ## (HEBI: return value as a vector) ret <- c() i <- 1 ## (HEBI: a for loop using the vector as range) for (i in c(1:length(data))) { col = data[i]; ## (HEBI: Get the name of a column) name = names(col); if (substr(name, 1, 6) == "output") { ## (HEBI: remove of NA) newcol = col[!is.na(col)]; if (length(newcol) > 2) { value <- newcol[1] ## (HEBI: check the value of the vector is all the same) if (length(newcol[newcol != value]) == 0) { ## (HEBI: pushing a new value to the return vector) ret <- c(ret, paste("name = ", value))}}}} return(ret)}
3 Operators
- arithmetic
+-*/
,^
for exp,%%
for modulus- matrix
%*%
matrix product,%o%
outer product- logic
!
,&, |
for vector,&&, ||
for no vector- relative
>, <, ==, <=, >=
- general
<-, ->
assignments,$
list subset,:
sequence,~
for model formula
Built-in functions:
log
,exp
,sin
,cos
,tan
,sqrt
min
,max
range
: same asc(min(x),max(x))
length(x)
,sum(x)
,prod(x)
(product)mean(x)
:sum(x)/length(x)
var(x)
:sum((x-mean(x))^2)/(length(x)-1)
sort(x)
: increasing orderorder()
orsort.list()
paste(sep
" ")= function takes an arbitrary number of arguments and concatenates them one by one into character strings. The argument can be numeric.toString(8)
: convert integer to stringround(x, digits=0)
4 Control Structure
The compound statements are the same as C, can be a single statement without the braces.
4.1 Conditional
- if
if (STMT) STMT else if (STMT) STMT else STMT
- Switch
switch (STMT, LIST)
- the STMT is first evaluated
- if the value is within 1 and the length of the LIST, evaluate LIST[i], and return
- return NULL
- Notice that the LIST can be a comma separated argument of switch … which means switch actually accepts
...
4.2 Loop
for
for (NAME in VECTOR) STMT
while
- =while (STMT) STMT
repeat
- repeat STMT
break
,next
5 Evaluation rules
- recycling rules
- the shortest list is recycled to the length of longest.
- dimensional attributes
- the dimension of matrix must match. No recycle for a matrix.
6 Function
function (ARGLIST) BODY
The argument list can be a symbol, a symbol=value
, or a ...
.
The body is a compound expression, surrounded with {}
.
Function can be assigned to a symbol.
The matching of formals and actual are pretty tricky.
- exact matching on tags
- partial matching on tags
- positional matching for
...
Partial matching result must be unique, but the exact matched ones are excluded before this step is entered.
7 Quote
The quote will wrap the expression into an object without evaluating it.
The resulting object has the mode of call
.
The eval
is used to evaluate it.
8 Debugging
The print
function can output the value of a variable.
To enter the debugger, a call to browser
function suffices.
This allows you to browse the values at that point.
A more powerful debugger is by a call to debug
with the function name as argument.
Each time that function is called, you enter the debug and can control the execution.
Tracing can be registered by trace
or untrace
with the name of the function.
It might need to be quoted in some case, so you'd better quote it, with double quotes.
Every time the function is invoked, the return value will be printed as trace.
9 Data IO
write
write.table
write.csv
read.table("filename", header=TRUE, sep=",")
- this ignores blank lines,
- and expect the header to be one field less than the body.
#
as comments
- read.delim
cat
outputs the data, no index, no newline
- attach(data)
- make the columns into this namespace
- detach(data)
- remove those
10 Models
10.1 Linear model.
fm = lm(y ~ x1 + x2, data = mydataframe)
The fitted model in the variable fm
can be accessed by:
coef
- extract the coefficients
deviance
- the Residual Sum of Square
formula
- extract the model formula
plot
- produce four plots: residuals, fitted values, diagnostics.
predict(OBJECT, newdata=DATA.FRAME)
- use the model to predict
residuals
- extract the residuals
The models can be updated, if the formula only changes a little bit.
In the following example, the .
means the corresponding part of the original formula.
fs <- lm(y~x1 + x2, data=mydata) fs <- update(fs, . ~ . + x3) fs <- update(fs, sqrt(.) ~ .)
11 Plot
Process data:
- table
- cut(data, breaks=c(1,3,8))
11.1 Decoration
- box
- axis
- las attribute
- legend
- par
- text
- mtext
- points
11.2 Plot Types
11.2.1 plot
- lines
- abline
11.2.2 barplot
11.2.3 pie
11.2.4 boxplot
- quantile
11.2.5 hist
- lines(density(data))
11.2.6 TODO stem
11.2.7 TODO mosaicplot
11.2.8 pairs
11.3 Device Driver
When outputting some image, you have to tell R which format you want to use. The default on linux is X11, that's why it opens a image immediately after plotting. To use a device, call the device function, and after that all graphics output will be sent to that device.
- X11
- png
- jpeg
When you have finished with a device, terminate it by dev.off()
.
To output to a file TODO to open plot in emacs:
pdf("test1.pdf") dev.control(displaylist = "enable") plot(1:10) dev.copy(pdf, "test2.pdf") dev.off() # should now have a valid test2.pdf dev.off() # finished
12 Packages
12.1 ggplot2
qplot(totbill, tip, geom="point", data=tips) # scatter plot qplot(totbill, tip, geom="point", data=tips) + geom_smooth(method="lm") # with linear relationship line qplot(tip, geom="histogram", data=tip) # histogram qplot(tip, geom="histogram", binwidth=1, data=tips) # with custom binwidth # box plots qplot(sex, tipperc, geom="boxplot", data=tips) qplot(smoker, tipperc, geom="boxplot", data=tips) qplot(sex:smoker, tipperc, geom="boxplot", data=tips) # combine! plot the two sets of graph in two one graph qplot(totbill, tip, geom="point", colour=day, data=tips) # scatter plot with colors, in regard to "day" column
12.2 plot(x, y, …)
Possible ...
arguments:
type
what type of plot:p
for points,l
for lines,b
for both,h
forhistogram
like (orhigh-density
) vertical lines,
main
an overall title for the plot: seetitle
.xlab
a title for the x axis: seetitle
.ylab
a title for the y axis: seetitle
.
12.3 dplyr
A Grammar of Data Manipulation https://cran.r-project.org/web/packages/dplyr/index.html https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html