14  Graphics

flowchart LR
gg[ggplot2] --> ggp[ggplotly]
sh[Spike Histograms] --- sc[Scatterplots] --- dc[Dot Charts] --- eb[Extended<br>Box Plots]
sp[Frequency Scatterplots for Large Datasets]
sp --> gghex[ggplot2 Hexagonal Binning]
sp --> hs[Hmisc::ggfreqScatter]
md[Annotating Plots With<br>Labels and Units<br>From Metadata]

Cleveland’s The Elements of Graphing Data and Graphics in Scientific Publications are two of the best sources of how-to information on making scientific graphs. Much information may be found at hbiostat.org/bbr/descript.html and hbiostat.org/R/hreport especially these notes: hbiostat.org/doc/graphscourse.pdf. John Rauser has an exceptional video about principles of good graphics. Rafael Irizarry has an excellent chapter on graphics principles and recommendations. See datamethods.org/t/journal-graphics for graphical methods for journal articles.

Paul Murrell has an excellent summary of recommendations:

  • Display data values using position or length.
  • Use horizontal lengths in preference to vertical lengths.
  • Watch your data–ink ratio.
  • Think very carefully before using color to represent data values.
  • Do not use areas to represent data values.
  • Please do not use angles or slopes to represent data values. Please, please do not use volumes to represent data values.

On the fifth point above, avoid the use of bars when representing a single number. Bar widths contain no information and get in the way of important information. This is addressed below.

R has superior graphics implemented in multiple models, including

For ggplot2, www.cookbook-r.com/Graphs contains a nice cookbook. See also learnr.wordpress.com. To get excellent documentation with examples for any ggplot2 function, google ggplot2 _functionname_. ggplot2 graphs can be converted into plotly graphics using the ggplotly function. But you will have more control using R plotly directly.

The older non-interactive graphics models which are useful for producing printed and pdf output are starting to be superseded with interactive graphics. One of the biggest advantages of the latter is the ability to present the most important graphic information front-and-center but to allow the user to easily hover the mouse over areas in the graphic to see tabular details.

I make heavy use of ggplot2, plotly and R base graphics. plotly is used for interactive graphics, and the R plotly package provides an amazing function ggplotly to convert a static ggplot2 graphics object to an interactive plotly one. If the user goes to the trouble of adding labels for graphics entities (usually points, lines, curves, rectangles, and circles) those labels can become hover text in plotly without disturbing anything in static graphics. As shown here you can sense whether an html or pdf report is being produced, and for html all ggplot2 objects can be automatically transformed to plotly.

With ggplotly extra text appears in front of labels, but the result of ggplotly can be run through Hmisc::ggplotlyr to remove this as shown in the example.

Many types of graphs can be created with base graphics, e.g. hist(age, nclass=50) or Ecdf(age) but using ggplot2 for even simple graphics makes it easy to add handle multiple groups on one graph or to create multiple panels for strata using faceting. ggplot2 has excellent default font sizes and axis labeling that works for most sizes of plots.

14.1 Recommended Graphics by Data Types

Let Y be the dependent (response) variable, also called the analysis variable, to display, and X denote an independent or descriptor variable.

14.1.1 Y Discrete

X Absent or Categorical

Compute proportions of Y categories and display using a dot chart invented by Bill Cleveland. Many examples are visible here. See R examples in Chapter 9 and here.

Dot charts can be produced using ggplot2 or a variety of Hmisc package functions.

X Continuous

Use nonparametric smoothers or moving proportions exemplified in Chapter 15.

14.1.2 Y Continuous

X Absent or Categorical

Bivariate With Continuous X

  • Scatterplot
  • Scatterplots for large datasets using color or gray scale to encode frequencies (Figure 14.2 and here)

14.2 ggplot2

Here is a prototypical ggplot2 example illustrating many of the features I most often use. Ignore the ggplot2 label attribute if not using plotly. Options are given to the Hmisc label function so that it will retrieve the variable label and units (if present) and format them for axis labels or tables. The formatting takes into account whether html output is being created and plotly is being used.

ishtml <- knitr::is_html_output()
hookaddcap()   # make knitr call a function at the end of each chunk
               # to try to automatically add to list of figure
# Create a vector of formatted labels for all variables in data
# For variables without labels or units use the variable name
# as the label.  If html and plotly are not in effect use R's
# regular plotmath notation to typeset labels/units

d <- stressEcho
nam   <- names(d)
nv    <- length(nam)
vlabs <- structure(character(nv), names=nam)
for(n in nam)
  vlabs[n] <- label(d[[n]], plot=TRUE, html=ishtml, default=n)

# Define substitutes for xlab and ylab that look up our
# constructed labels.
# Could instead directly use xlab(vlabs['age'])
labx <- function(v) xlab(vlabs[[as.character(substitute(v))]])
laby <- function(v) ylab(vlabs[[as.character(substitute(v))]])
g <-
  ggplot(d, aes(x=age, y=bhr, color=gender, label=paste0('dose:', dose))) +
         geom_point() + geom_smooth() +
         scale_x_continuous(minor_breaks=seq(30, 80, by=5)) +  # minor tick marks
         guides(color=guide_legend(title='')) +
         theme(legend.position='bottom') +  # not respected by ggplotly
         labs(caption='Scatterplot of age by basal heart rate stratified by sex') +
         labx(age) + laby(bhr)
# or just xlab('Age in years') + ylab('Basal heart rate')
# To put the caption in a different font or size use e.g.
#   theme(plot.caption=element_text(family='mono', size=7))
# Likewise for the legend
#   theme(legend.text=element_text(family='mono', size=9))

ggplotlyr(g, remove='.*): ')  # removes paste0("dose:", dose): 

Figure 14.1: plotly translation of a ggplot2 graph making use of variable labels from a data table that are translated to use within-string html font changes

# dose is in hover text for each point

A simpler approach is to use ggplot2’s xlab and ylab on the result of a little generic function hlab that assumes the dataset being analyzed is d.

hlab <- function(x) {
  x <- as.character(substitute(x))
  label(d[[x]], plot=TRUE, default=x)
ggplot(...) + ... + xlab(hlab(age)) + ylab(hlab(bhr))

ggplot2 allows one to flexibly use transformed axes. Here is example syntax to plot the \(x\) variable on a log scale, taking charge of tick mark placement and adding minor divisions depicted with grid lines.

ggplot2(d, aes(x, y)) + geom_point() +
    breaks=c(5, seq(10, 100, by=10)),
    minor_breaks=seq(5, 100, by=5))

14.3 Formatting Columns in Legends

If the text for the legend contains columns that you want to have lined up, build the columns so that they are of equal length and use mono font, e.g.

pad <- function(x, n)  # pad x to n characters
  substring(paste(x, '                       '), 1, n)
d$z   <- paste(pad(a), b)
ggplot(d, aes(x, y, color=z)) + geom_line() +
  theme(legend.text = element_text(family='mono'))

14.4 Plot Annotation

  • See this by Mine Çetinkaya-Rundel. Note that when annotating a facet or a whole plot, when the annotation does not use an aesthetic (such as colors to represent different curves on one facet), make sure that aesthetic does not appear in ggplot() but rather only in the geoms, e.g. geom_line(aes(col=region)).

14.5 Separating Points Without Labeling Groups

Sometimes you need to start a new curve when moving to a new group of points, but without labeling all the groups. In this example curves respresenting lower confidence limits, upper confidence limits, and point estimates are separated. paste() is used to create unique groups to pass to the group ggplot2 aesthetic. The code for that example also shows how it is easier to use ggplot2 when complex data objects are melted into a single data frame.

For large datasets you can use hexagonal binning with ggplot2 or use the Hmisc package ggfreqScatter function. Both approaches make it easy to see overlapping points by color coding the frequency of points in each small bin, allowing scatterplots to scale to very large datasets. Here is an example using ggfreqScatter:

html=TRUE was needed because otherwise axis labels are formatted using R’s plotmath and plotly doesn’t like that.
x <- round(rnorm(2000), 1)
y <- 2 * (x > 1.5) + round(rnorm(2000), 1)
z <- sample(c('a', 'b'), 2000, replace=TRUE)
label(x) <- 'X Variable'   # could use xlab() &
label(y) <- 'Y Variable'   # ylab() in ggfreqScatter()
g <- ggfreqScatter(x, y, by=z, html=ishtml)
# If variables were inside a data table use
# g <- d[, ggfreqScatter(x, y, by=z, html=ishtml)]

Figure 14.2: ggfreqScatter example, making use of color coded frequencies of points that will work for any size dataset and any number of coincident points

Now convert the graphic to plotly if html is in effect otherwise stay with ggplot2 output.


Figure 14.3: plotly version of Figure 14.2

When you hover the mouse over a point, its frequency pops up.

Many functions in the Hmisc and rms packages produce plotly graphics directly. These two package’s functions using plotly try to compute optimal figure heights and widths, but it is usually better to let plotly auto-size the plots. Putting options(plotlyauto=TRUE) will override these dimensions and force plotly to auto-size. Putting this command in your .Rprofile file in the home directory makes this easy.

One of the most unique pure plotly functions in Hmisc is dotchartpl.