flowchart LR
O[Objects] --> N[Numeric] --> Nt[integer<br>floating point]
O --> NN[Non-Numeric] --> log[logical] & ch[character] & F[function]
O --> Sh[Shape] --> Shs[scalar<br>vector<br>matrix<br>array]
Sh --> Li[list]
Li --> df[data.frame] & dt[data.table] & it[irregular tree]
O --> Ad["Addressing<br>(subscripts)"] --> sub[integer<br>logical<br>names]
O --> SV[Special Values] --> na[NA]
F --> cf[Common Functions<br>Write Your Own]
Dec[Decisions] --> btypes[if<br>ifelse<br>switch]

3.1 Assignment Operator

You assign an R object to a value using the assignment operator <- or the equal sign. <- is read as “gets”.

x <- yd <-read.csv('mydata.csv')x = y

3.2 Object Types

Everything in R is an object. This includes functions, as demonstrated in the example below.

f <- sqrt # copy the square root function as ff(16) # equivalent to sqrt(16)

[1] 4

3.2.1 Object Names

The names of objects are always case-sensitive in R. If the age variable is named Age, typing age may result in an “object not found” error. Valid symbols in object names are upper and lower case letters, numbers, periods, and underscores. A name may not begin with a number. There are contexts in which you may include any characters in object names, including spaces:

for extracting columns in data frames, data tables, matrices, and lists you can single- or double-quote the name, e.g. mydata[, "age in years"]

for wider contexts (e.g. statistical model formulas) you can put single back-ticks around the name

3.2.2 Objects

Some primitive types of objects in R are below.

Type

Meaning

integer

whole numbers

logical

values of TRUE or FALSE

double

floating point non-whole numbers

character

character strings

function

code defining a function

In the table below, objects of different shapes are described. rows and cols refers to vectors of integers or logicals, or if the elements of the object are named, character strings.

Named vectors provide an extremely quick table lookup and recoding capability.

list objects are arbitrary trees and can have elements nested to any level. You can have lists of lists or lists of data frames/tables.

Vectors can be of many different types when a class is added to them. Two of the most common are Dates and factors. Character strings are handled very efficiently in R so there is not always a need to store categorical variables as factors. But there is one reason: to order levels, i.e., distinct variable values, so that tabular and graphical output will list values in a more logical order than alphabetic. A factor variable has a levelsattribute added to it to accomplish this. An example is x <- factor(x, 1:3, c('cat', 'dog', 'fox')) where the second argument 1:3 is the vector of possible numeric values x currently takes on (in order) and the three character strings are the corresponding levels. Internally factors are coded as integers, but they print as character strings.

Rectangular data objects, i.e., when the number of rows is the same for every column (variable), can be represented by matrices, arrays, data.frames, and data.tables. In a matrix or array, every value is of the same type. A data.frame or a data.table is an R list that can have mixtures of numeric, character, factor, dates, and other object types. A data.table is also a data.frame but the converse isn’t true. data.tables are handled by the R data.table package and don’t have row names but can be indexed, are much faster to process, and have a host of methods implemented for aggregation and other operations. data.frames are handled by base R.

See Section 18.2 for an example of using arrays with named elements

Data frames are best managed by converting them to data tables and using the data.table package. When data.table is not used there are three indispensable functions for operating on data frames:

with for analyzing variables within a data frame without constantly prefixing variable names with dataframename$

transform for adding or changing variables within a data frame

HmiscupData function for doing the same as transform but also allowing metadata to be added to the data, e.g., variable labels and units (to be discussed later)

Here are some examples of with and transform.

# Better than mean(mydata$systolic.bp - mydata$diastolic.bp) :with(mydata, mean(systolic.bp - diastolic.bp))# Better than mydata$pulse.pressure <- mydata$systolic.bp - mydata$diastolic.bp:mydata <-transform(mydata,pulse.pressure = systolic.bp - diastolic.bp,bmi = wt / ht ^2)# Perform several operations on the same data framewith(mydata, { x3 <- x1 /sqrt(x2)ols(y ~ x3) } )

3.3 Missing Values

R objects of any type can have elements whose values are missing. The symbol R uses for a missing value is NA. The is.na function returns TRUE/FALSE according to whether an element is missing. The following examples illustrate operations on NAs.

x <-c(1, 2, NA, 4, 5, NA)mean(x) # mean of all x

[1] NA

mean(x, na.rm=TRUE) # mean of non-missing x

[1] 3

is.na(x) # vector corresponding to x

[1] FALSE FALSE TRUE FALSE FALSE TRUE

sum(is.na(x)) # count # NAs

[1] 2

table(is.na(x)) # count # NAs and non-NAs

FALSE TRUE
4 2

x[!is.na(x)] # get the non-missing x's

[1] 1 2 4 5

x[1] <-NA# make x[1] missingx

[1] NA 2 NA 4 5 NA

y <- letters[1:6] # first 6 lower case letters of alphabety[is.na(x)] # get y for which x is NA

[1] "a" "c" "f"

As seen in the examples, most simple statistical summarization functions such as mean will result in NA if any element is NA, and you have to specify an optional argument na.rm=TRUE to remove NAs before computing so that the result will be, for example, the mean of the non-missing values.

3.4 Subscripting

Examples of subscripting are given above. Subscripting via placement of [] after an object name is used for subsetting, and occasionally for using some elements more than once:

x <-c('cat', 'dog', 'fox')x[2:3]

[1] "dog" "fox"

x[c(1, 1, 3, 3, 2)]

[1] "cat" "cat" "fox" "fox" "dog"

Subscripting a variable or a data frame/table by a vector of TRUE/FALSE values is a very powerful feature of R. This is used to obtain elements satisfying one or more conditions:

The last line of code can be read as “values of x such that y > 7”.

3.5 Branching and If/Then

3.5.1 Decisions Based on One Scalar Value

Common approaches to this problem are if and switch.

type <-'semiparametric'f <-switch(type,parametric =ols(y ~ x),semiparametric =orm(y ~ x),nonparametric =rcorr(x, y, type='spearman'), { z <- y / xc(median=median(z), gmean=exp(mean(log(z)))) } )# The last 2 lines are executed for any type other than the 3 listedf <-if(type =='parametric') ols(y ~ x)elseif(type =='semiparametric') orm(y ~ x)elseif(type =='nonparametric') rcorr(x, y, type='spearman')else { z <- y / zc(median=median(z), gmean=exp(mean(log(z)))}

What is inside if( ) must be a single scalar element that is evaluated to whether it’s TRUE or FALSE.

3.5.2 Series of Separate Decisions Over a Vector of Values

The ifelse or data.table::fifelse functions are most often used for this, but data.table::fcase is a little better. Here’s an example.

x <-c('cat', 'dog', 'giraffe', 'elephant')type <-ifelse(x %in%c('cat', 'dog'), 'domestic', 'wild')type

Sometimes when constructing variable-length vectors and other objects, elements are to be included in the newly constructed object only when certain conditions apply. When a condition does not apply, no element is to be inserted. We can capitalize on the fact that the result of if(...) is NULL when ... is not TRUE, and concatenating NULL results in ignoring it. Here are two examples. In the first the resulting vector will have length 2, 3, or 4 depending on sex and height. In the second example the new vector will have the appropriate element names preserved.

y <-23; z <-46; sex <-'female'; height <-71; u <- pi; w <-7c(y, z, if(sex =='male') u, if(height >70) w)

# reduce clutter in case of variable name conflicts:rm(y, z, sex, height, u, w)

3.6 Functions

There are so many functions in R that it may be better to use the stackoverflow.com Q&A to find the ones you need (as of 2022-05-26 there are 450,000 R questions there). Here are just a few of the multitude of handy R functions. The first functions listed below return the R missing value NA if any element is missing. You can specify na.rm=TRUE to remove NAs from consideration first, so they will not cause the result to be NA. Most functions get their arguments (inputs) in () after the function name. Some functions like %in% are binary operators whose two arguments are given on the left and right of %in%.

mean, median, quantile, var, sd: Compute statistical summaries on one vector

min, max: Minimum or maximum of values in a vector or of multiple variables, resulting in one number

pmin, pmax: Parallel minimum and maximum for vectors, resulting in a vector. Example: pmin(x, 3) returns a vector of the same length as x. Each element is the minimum of the original value or 3.

range: Returns a vector of length two with the minimum and maximum

table: Frequency tabulation and multi-way tabulations of any type of vector variables

unique: Return vector of distinct values, in same order as original values

union, intersect, setdiff, setequal: Set operations on two vectors (see below)

a %in% b, a %nin% b: Set membership functions that determine whether each element in a is in b (for %in%) or is not in b (for %nin%, which is in the Hmisc package)

Set operators are amazingly helpful. Here are some examples.

unique(x) # vector of distinct values of x, including NA if occurredsort(unique(x)) # distinct values in ascending ordersetdiff(unique(x), NA) # distinct values excluding NA if it occurredduplicated(x) # returns TRUE for elements that are duplicated by# values occurring EARLIER in the listunion(x, y) # find all distinct values in the union of x & yintersect(x, y) # find all distinct values in both x & ysetdiff(x, y) # find all distinct x that are not in ysetequal(x, y) # returns TRUE or FALSE depending on whether the distinct# values of x and y are identical, ignoring how they# are ordered

Find a list of subject ids that are found in baseline but not in follow-up datasets:

idn <-setdiff(baseline$id, followup$id)

Avoid repetition: Don’t say if(animal == 'cat' | animal == 'dog') ....; use %in% instead:

if(animal %in%c('cat', 'dog')) ...# or if(animal %in% .q(cat, dog)) ... using Hmisc's .q

Likewise don’t say if(animal != 'cat' & animal != 'dog') but use if(animal %nin% c('cat', 'dog')) ...

To get documentation on a function type the following in the R console: ?functionname or ?packagename::functionname.

Even new R users can benefit from writing functions to reduce repetitive coding. A function has arguments and these can have default values for when the argument is not specified by the user when the function is called. Here are some examples. One line functions do not need to have their bodies enclosed in {}.

cuberoot <-function(x) x ^ (1/3)cuberoot(8)

[1] 2

g <-function(x, power=2) { u <-abs(x -0.5) u / (1.+ u ^ power)}g(3, power=2)

[1] 0.3448276

g(3)

[1] 0.3448276

Write a function make mean() drop missing values without our telling it.

mn <-function(x) mean(x, na.rm=TRUE)

Function to be used throughout the report to round fractional values by a default amount (here round to 0.001):

rnd <-function(x) round(x, 3)# edit the 3 the change rounding anywhere in the report

A simple function to save coding when you need to recode multiple variables from 0/1 to no/yes:

yn <-function(x) factor(x, 0:1, c('no', 'yes'))

Even though functions described here returned simple results, many functions return complex tree-like objects (e.g., lists). The most common example is a statistical model-fitting function that returns a “fit object” containing estimated values such as regression coefficients, standard errors, \(R^2\), etc.

3.7 R Formula Language

R has a unified syntax for specification of statistical models. A model, or at least the major part of it, is specified by an R formula object, which is characterized by having ~ in it. The formula is almost always the first argument to a model fitting function, e.g., you may specify a standard linear model using lm(y ~ age + sex). The formula syntax has several useful effects:

Character and categorical (factor) variables are automatically expanded into the appropriate number of 0/1 indicator variables

An * in a formula automatically creates multiplicative interaction terms and adds lower-order terms (e.g., main effects). Though seldom used, you can also use : to generate product terms if you want to include lower-order terms manually.

Parentheses in a formula can be used to factor out repetitive interactions

Transformations (through function calls) can be part of formulas. Transformations can be 1-1, many-1, or 1-many:

1-1: take log or square root transformation on the fly

many-1: convert several variables or a matrix into a single column (e.g., first principal component)

1-many: expand a single column into a matrix to represent polynomials, spline functions, harmonic series, etc.

The last feature is all-powerful, as expanding one continuous variable into a multi-column matrix allows one to estimate the transformation the variable needs to receive to optimally fit the data.

An R formula has a ~ in it that separates the left-hand side (dependent variable(s)) from the right-hand side (independent variables). Independent variables that do not interact (act additively) are separated by +. You can omit certain terms using the minus sign -. The following examples will help you to learn how to use the formula language.

response ~ termsy ~ age + sex # age + sex main effectsy ~ age + sex + age:sex # add second-order interactiony ~ age*sex # second-order interaction +# all main effectsy ~ (age + sex + sbp)^2# age+sex+sbp+age:sex+age:sbp+sex:sbpy ~ (age + sex + sbp)^2- sex:sbp# all main effects and all 2nd order# interactions except sex:sbpy ~ (age + race)*sex # age+race+sex+age:sex+race:sexy ~ treatment*(age*race + age*sex) # no interact. with race,sexsqrt(y) ~ sex*sqrt(age) + race# functions, with dummy variables generated if# race is an R factor (classification) variabley ~ sex +poly(age,2) # poly generates orthogonal polynomials# poly(age,2) is a matrix with 2 columnsrace.sex <-interaction(race,sex)y ~ age + race.sex # for when you want indicator variables for# all combinations of the factors

The update function is handy for re-fitting a model with changes in terms or data:

f <-lrm(y ~rcs(x,4) + x2 + x3) # lrm, rcs in rms packagef2 <-update(f, subset=sex=="male")f3 <-update(f, .~.-x2) # remove x2 from modelf4 <-update(f, .~. +rcs(x5,5))# add rcs(x5,5) to modelf5 <-update(f, y2 ~ .) # same terms, new response var.

stackoverflow.com/tags/r is the best place for asking questions about the language and for learning from answers to past questions asked

Source Code

# R Basics {#sec-rbasics}```{mermaid}flowchart LRO[Objects] --> N[Numeric] --> Nt[integer<br>floating point]O --> NN[Non-Numeric] --> log[logical] & ch[character] & F[function]O --> Sh[Shape] --> Shs[scalar<br>vector<br>matrix<br>array]Sh --> Li[list]Li --> df[data.frame] & dt[data.table] & it[irregular tree]O --> Ad["Addressing<br>(subscripts)"] --> sub[integer<br>logical<br>names]O --> SV[Special Values] --> na[NA]F --> cf[Common Functions<br>Write Your Own]Dec[Decisions] --> btypes[if<br>ifelse<br>switch]```## Assignment OperatorYou assign an R object to a value using the assignment operator `<-` or the equal sign. `<-` is read as "gets".```{r eval=FALSE}x <- yd <-read.csv('mydata.csv')x = y```## Object Types {#sec-rbasics-objects}Everything in R is an object. This includes functions, as demonstrated in the example below. ```{r}f <- sqrt # copy the square root function as ff(16) # equivalent to sqrt(16)```### Object NamesThe names of objects are always case-sensitive in R. If the age variable is named `Age`, typing `age` may result in an "object not found" error. Valid symbols in object names are upper and lower case letters, numbers, periods, and underscores. A name may not begin with a number. There are contexts in which you may include any characters in object names, including spaces:* for extracting columns in data frames, data tables, matrices, and lists you can single- or double-quote the name, e.g. `mydata[, "age in years"]`* for wider contexts (e.g. statistical model formulas) you can put single back-ticks around the name### ObjectsSome primitive types of objects in R are below.| Type | Meaning ||------|---------|| integer | whole numbers || logical | values of `TRUE` or `FALSE` || double | floating point non-whole numbers || character | character strings || function| code defining a function |In the table below, objects of different shapes are described.`rows` and `cols` refers to vectors of integers or logicals, or if the elements of the object are named, character strings.| Type | Example |Values Retrieved By||------|---------|--------------------|| scalar | `x <- 3` | `x` || vector | `y <- c(1, 2, 5)` | `y[2]` (2), `y[2:3]` (2, 5), `y[-1]` (2, 5), `y[c(TRUE,FALSE,TRUE)]` (1, 5) || named vector | `y <- c(a=1, b=2, d=5)` | `y[2]` (2), `y['b']` (2), `y[c('a','b')]` (1, 2) || matrix | `y <- cbind(1:3, 4:5)` | `y[rows,cols]`, `y[rows,]` (all cols), `y[,cols]` (all rows) || array | `y <- array(1:30, dim=c(2,3,5))` | `y[1,1,1]` (1), `y[2,3,5]` (30) || list | `x <- list(a='cat', b=c(1,3,7))` | `x$a` ('cat'), `x[[1]]` ('cat'), `x[['a']]` ('cat') |Named vectors provide an extremely quick table lookup and recoding capability.`list` objects are arbitrary trees and can have elements nested to any level. You can have lists of lists or lists of data frames/tables.Vectors can be of many different types when a `class` is added to them. Two of the most common are `Date`s and `factor`s. Character strings are handled very efficiently in R so there is not always a need to store categorical variables as `factor`s. But there is one reason: to order levels, i.e., distinct variable values, so that tabular and graphical output will list values in a more logical order than alphabetic. A factor variable has a `levels` _attribute_ added to it to accomplish this. An example is `x <- factor(x, 1:3, c('cat', 'dog', 'fox'))` where the second argument `1:3` is the vector of possible numeric values `x` currently takes on (in order) and the three character strings are the corresponding `levels`. Internally `factors` are coded as integers, but they print as character strings.Rectangular data objects, i.e., when the number of rows is the same for every column (variable), can be represented by matrices, arrays, `data.frame`s, and `data.table`s. In a matrix or array, every value is of the same type. A `data.frame` or a `data.table` is an R `list` that can have mixtures of numeric, character, factor, dates, and other object types. A `data.table` is also a `data.frame` but the converse isn't true. `data.table`s are handled by the R `data.table` package and don't have row names but can be indexed, are much faster to process, and have a host of methods implemented for aggregation and other operations. `data.frame`s are handled by base R.[See @sec-sim-array for an example of using arrays with named elements]{.aside}Data frames are best managed by converting them to data tables and using the `data.table` package. When `data.table` is not used there are three indispensable functions for operating on data frames:* `with` for analyzing variables within a data frame without constantly prefixing variable names with `dataframename$`* `transform` for adding or changing variables within a data frame* `Hmisc``upData` function for doing the same as `transform` but also allowing metadata to be added to the data, e.g., variable labels and units (to be discussed later)Here are some examples of `with` and `transform`.```{r with,eval=FALSE}# Better than mean(mydata$systolic.bp - mydata$diastolic.bp) :with(mydata, mean(systolic.bp - diastolic.bp))# Better than mydata$pulse.pressure <- mydata$systolic.bp - mydata$diastolic.bp:mydata <-transform(mydata,pulse.pressure = systolic.bp - diastolic.bp,bmi = wt / ht ^2)# Perform several operations on the same data framewith(mydata, { x3 <- x1 /sqrt(x2)ols(y ~ x3) } )```## Missing Values {#sec-rbasics-na}R objects of any type can have elements whose values are missing. The symbol R uses for a missing value is `NA`. The `is.na` function returns `TRUE/FALSE` according to whether an element is missing. The following examples illustrate operations on `NA`s.```{r na}x <-c(1, 2, NA, 4, 5, NA)mean(x) # mean of all xmean(x, na.rm=TRUE) # mean of non-missing xis.na(x) # vector corresponding to xsum(is.na(x)) # count # NAstable(is.na(x)) # count # NAs and non-NAsx[!is.na(x)] # get the non-missing x'sx[1] <-NA# make x[1] missingxy <- letters[1:6] # first 6 lower case letters of alphabety[is.na(x)] # get y for which x is NA```As seen in the examples, most simple statistical summarization functions such as `mean` will result in `NA` if any element is `NA`, and you have to specify an optional argument `na.rm=TRUE` to remove `NA`s before computing so that the result will be, for example, the mean of the non-missing values.## SubscriptingExamples of subscripting are given above. Subscripting via placement of `[]` after an object name is used for subsetting, and occasionally for using some elements more than once:```{r subsc}x <-c('cat', 'dog', 'fox')x[2:3]x[c(1, 1, 3, 3, 2)]```Subscripting a variable or a data frame/table by a vector of `TRUE/FALSE` values is a very powerful feature of R. This is used to obtain elements satisfying one or more conditions:```{r subcond}x <-c(1, 2, 3, 2, 1, 4, 7)y <-c(1, 8, 2, 3, 8, 9, 2)x[y >7]```The last line of code can be read as "values of `x` such that `y > 7`".## Branching and If/Then### Decisions Based on One Scalar ValueCommon approaches to this problem are `if` and `switch`.```{r ifs,eval=FALSE}type <-'semiparametric'f <-switch(type,parametric =ols(y ~ x),semiparametric =orm(y ~ x),nonparametric =rcorr(x, y, type='spearman'), { z <- y / xc(median=median(z), gmean=exp(mean(log(z)))) } )# The last 2 lines are executed for any type other than the 3 listedf <-if(type =='parametric') ols(y ~ x)elseif(type =='semiparametric') orm(y ~ x)elseif(type =='nonparametric') rcorr(x, y, type='spearman')else { z <- y / zc(median=median(z), gmean=exp(mean(log(z)))}```What is inside `if( )` must be a single scalar element that is evaluated to whether it's `TRUE` or `FALSE`.### Series of Separate Decisions Over a Vector of ValuesThe `ifelse` or `data.table::fifelse` functions are most often used for this, but `data.table::fcase` is a little better. Here's an example.```{r ifelse}x <-c('cat', 'dog', 'giraffe', 'elephant')type <-ifelse(x %in%c('cat', 'dog'), 'domestic', 'wild')typerequire(data.table)fcase(x %in%c('cat', 'dog'), 'domestic', default='wild')```### `if` TrickSometimes when constructing variable-length vectors and other objects, elements are to be included in the newly constructed object only when certain conditions apply. When a condition does not apply, no element is to be inserted. We can capitalize on the fact that the result of `if(...)` is `NULL` when `...` is not `TRUE`, and concatenating `NULL` results in ignoring it. Here are two examples. In the first the resulting vector will have length 2, 3, or 4 depending on `sex` and `height`. In the second example the new vector will have the appropriate element `names` preserved.```{r iftrick}y <-23; z <-46; sex <-'female'; height <-71; u <- pi; w <-7c(y, z, if(sex =='male') u, if(height >70) w)c(x1=3, if(sex =='male') c(x2=4), if(height >70) c(x3=height))# reduce clutter in case of variable name conflicts:rm(y, z, sex, height, u, w)```## Functions {#sec-rbasics-functions}There are so many functions in R that it may be better to use the [stackoverflow.com](https://stackoverflow.com/questions/tagged/r) Q&A to find the ones you need (as of 2022-05-26 there are 450,000 R questions there). Here are just a few of the multitude of handy R functions. The first functions listed below return the R missing value `NA` if any element is missing. You can specify `na.rm=TRUE` to remove `NA`s from consideration first, so they will not cause the result to be `NA`. Most functions get their arguments (inputs) in () after the function name. Some functions like `%in%` are binary operators whose two arguments are given on the left and right of `%in%`.* `mean`, `median`, `quantile`, `var`, `sd`: Compute statistical summaries on one vector* `min, max`: Minimum or maximum of values in a vector or of multiple variables, resulting in one number* `pmin, pmax`: Parallel minimum and maximum for vectors, resulting in a vector. Example: `pmin(x, 3)` returns a vector of the same length as `x`. Each element is the minimum of the original value or 3.* `range`: Returns a vector of length two with the minimum and maximum* `plot`, `points`, `lines`, `text`: Basic ploting functions* `table`: Frequency tabulation and multi-way tabulations of any type of vector variables* `unique`: Return vector of distinct values, in same order as original values* `union`, `intersect`, `setdiff`, `setequal`: Set operations on two vectors (see below)* `a %in% b`, `a %nin% b`: Set membership functions that determine whether each element in `a` is in `b` (for `%in%`) or is not in `b` (for `%nin%`, which is in the `Hmisc` package)Set operators are amazingly helpful. Here are some examples.```{r setup,eval=FALSE}unique(x) # vector of distinct values of x, including NA if occurredsort(unique(x)) # distinct values in ascending ordersetdiff(unique(x), NA) # distinct values excluding NA if it occurredduplicated(x) # returns TRUE for elements that are duplicated by# values occurring EARLIER in the listunion(x, y) # find all distinct values in the union of x & yintersect(x, y) # find all distinct values in both x & ysetdiff(x, y) # find all distinct x that are not in ysetequal(x, y) # returns TRUE or FALSE depending on whether the distinct# values of x and y are identical, ignoring how they# are ordered```Find a list of subject ids that are found in baseline but not in follow-up datasets:```{r eval=FALSE}idn <-setdiff(baseline$id, followup$id)```Avoid repetition: Don't say `if(animal =='cat'| animal =='dog') ....`; use `%in%` instead:```{r eval=FALSE}if(animal %in%c('cat', 'dog')) ...# or if(animal %in% .q(cat, dog)) ... using Hmisc's .q```Likewise don't say `if(animal !='cat'& animal !='dog')` but use `if(animal %nin%c('cat', 'dog')) ...`To get documentation on a function type the following in the R console: `?functionname` or `?packagename::functionname`.Even new R users can benefit from writing functions to reduce repetitive coding. A function has _arguments_ and these can have default values for when the argument is not specified by the user when the function is called. Here are some examples. One line functions do not need to have their bodies enclosed in `{}`.```{r runex}cuberoot <-function(x) x ^ (1/3)cuberoot(8)g <-function(x, power=2) { u <-abs(x -0.5) u / (1.+ u ^ power)}g(3, power=2)g(3)```Write a function make `mean()` drop missing values without our telling it.```{r eval=FALSE}mn <-function(x) mean(x, na.rm=TRUE)```Function to be used throughout the report to round fractional values by a default amount (here round to 0.001):```{r eval=FALSE}rnd <-function(x) round(x, 3)# edit the 3 the change rounding anywhere in the report```A simple function to save coding when you need to recode multiplevariables from 0/1 to no/yes:```{r eval=FALSE}yn <-function(x) factor(x, 0:1, c('no', 'yes'))```Even though functions described here returned simple results, many functions return complex tree-like objects (e.g., `list`s). The most common example is a statistical model-fitting function that returns a "fit object" containing estimated values such as regression coefficients, standard errors, $R^2$, etc.## R Formula Language {#sec-rbasics-formula}R has a unified syntax for specification of statistical models. A model, or at least the major part of it, is specified by an R _formula object_, which is characterized by having `~` in it. The formula is almost always the first argument to a model fitting function, e.g., you may specify a standard linear model using `lm(y ~ age + sex)`. The formula syntax has several useful effects:* Character and categorical (`factor`) variables are automatically expanded into the appropriate number of 0/1 indicator variables* An `*` in a formula automatically creates multiplicative interaction terms and adds lower-order terms (e.g., main effects). Though seldom used, you can also use `:` to generate product terms if you want to include lower-order terms manually.* Parentheses in a formula can be used to factor out repetitive interactions* Transformations (through function calls) can be part of formulas. Transformations can be 1-1, many-1, or 1-many: + 1-1: take log or square root transformation on the fly + many-1: convert several variables or a matrix into a single column (e.g., first principal component) + 1-many: expand a single column into a matrix to represent polynomials, spline functions, harmonic series, etc.The last feature is all-powerful, as expanding one continuous variable into a multi-column matrix allows one to estimate the transformation the variable needs to receive to optimally fit the data.An R formula has a `~` in it that separates the left-hand side (dependent variable(s)) from the right-hand side (independent variables). Independent variables that do not interact (act additively) are separated by `+`. You can omit certain terms using the minus sign `-`.The following examples will help you to learn how to use the formula language.```{r eval=FALSE}response ~ termsy ~ age + sex # age + sex main effectsy ~ age + sex + age:sex # add second-order interactiony ~ age*sex # second-order interaction +# all main effectsy ~ (age + sex + sbp)^2# age+sex+sbp+age:sex+age:sbp+sex:sbpy ~ (age + sex + sbp)^2- sex:sbp# all main effects and all 2nd order# interactions except sex:sbpy ~ (age + race)*sex # age+race+sex+age:sex+race:sexy ~ treatment*(age*race + age*sex) # no interact. with race,sexsqrt(y) ~ sex*sqrt(age) + race# functions, with dummy variables generated if# race is an R factor (classification) variabley ~ sex +poly(age,2) # poly generates orthogonal polynomials# poly(age,2) is a matrix with 2 columnsrace.sex <-interaction(race,sex)y ~ age + race.sex # for when you want indicator variables for# all combinations of the factors```The `update` function is handy for re-fitting a model with changes in terms or data:```{r eval=FALSE}f <-lrm(y ~rcs(x,4) + x2 + x3) # lrm, rcs in rms packagef2 <-update(f, subset=sex=="male")f3 <-update(f, .~.-x2) # remove x2 from modelf4 <-update(f, .~. +rcs(x5,5))# add rcs(x5,5) to modelf5 <-update(f, y2 ~ .) # same terms, new response var.```## Resources for Learning R {#sec-rbasics-resources}* [Catalog of resources](https://stackoverflow.com/tags/r/info) on `Stackoverflow`* [Ten Simple Rules for Teaching Yourself R](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010372)* [Fast Lane to Learning R](https://github.com//matloff/fasteR)* [R Tutorials](https://r-bloggers.com/how-to-learn-r-2)* [R Programming Tutorials](https://youtube.com/user/marinstatlectures)* [R Bootcamp](https://couthcommander.github.io/msci_rbootcamp/workshop.html) by Cole Beck* [Swirlstats](https://swirlstats.com) (interactive)* For [those who have used SPSS or SAS before](https://www.amazon.com/SAS-SPSS-Users-Statistics-Computing/dp/1461406846)* [R books on Amazon](http://amzn.to/15URiF6)* [UCLA site](https://stats.idre.ucla.edu/r)* [R Resources for Beginners](http://www.introductoryr.co.uk/R_Resources_for_Beginners.html)* [R for Data Science](https://r4ds.had.co.nz)* [Introduction to Data Science](https://rafalab.github.io/dsbook) by Rafael Irizarry* [R in Action](https://www.amazon.com/R-Action-Robert-Kabacoff/dp/1935182)* [Statistical modeling by Legler and Roback](https://bookdown.org/roback/bookdown-bysh)* [stackoverflow.com/tags/r](http://stackoverflow.com/tags/r) is the best place for asking questions about the language and for learning from answers to past questions asked