Using ggplot2 in R Language

Introduction to using ggplot2 in R Language

ggplot2 is a popular R package that provides flexible and elegant grammar of graphics for creating a wide range of dynamic and static graphics. It breaks down plots into fundamental components like data, aesthetics, geometric objects, and statistical transformations. In this post, we will learn about using ggplot2 in R Language.

There are three strategies for plotting in R language.

  1. base graphics using functions such as plot(), points(), and par()
  2. lattice graphics to create nice graphics, however, it is not easy to create high-dimensional data graphics.
  3. ggplot package, it is an implementation of “Grammar of Graphics”.

The ggplot2 is built on the principle of layering graphical elements, making it flexible and customizable.

To plot using ggplot2 in R Langauge, a data.frame object is required as an input, then one needs to define plot layers that stack on top of each other, and each layer has visual/text elements that are mapped to aesthetics (size, colors, and opacity). An extremely informative graph will be produced using the above-described simple set of commands.

Before drawing high-quality informative graphs, one needs to install the ggplot2 package. If ggplot2 is already installed, one does not need to reinstall it using the command below.

install.packages("ggplot2")

Scatter Plot using ggplot2 in R

Let us draw a dot plot (scatter points) graph between variables $hp$ (horsepower) and $disp$ (displacement) from mtcars dataset.

# first load the data set say mtcars
attach(mtcars)

# load the ggplot2 library
library(ggplot2)

# now specify the dataset and variables
p <- ggplot(mtcars, aes(x = disp, y = hp))

# Add a plot layer with points
p <- p + geom_point()
print(p) # display/ show the plot
using ggplot2 in R Language

Note that geom, aesthetics, and facets are three important concepts in drawing the graphs using ggplot2, where

  • geom is the type of the plot
  • aesthetics is the shape, color, size, and alpha values used in ggplot
  • facet are small multiples, displaying different subsets of data

When certain aesthetics are defined, an appropriate legend is chosen and displayed automatically.

p <- ggplot(mtcars, aes(x = disp, y = hp))
p <- p + geom_point(aes(color = mpg))
p
using ggplot2 in R with aesthetics

Updating Graphs using aesthetics (color, size, and shape)

Graphs can be updated by assigning variables to aesthetics color, size, and shape. For example

p <- ggplot(mtcars, aes(x = disp, y = hp))
p <- p + geom_point(aes(color = gear, size = wt))
p
Using ggplot2 in R scatter plot with more aesthetics

Consider the following example. Here, the $gear$ variable is taken as a factor (grouping variable).

p <- ggplot(mtcars, aes(x = disp, y = hp))
p <- p + geom_point(aes(color = as.factor(gear), size = wt))
p
ggplot2

Note that the behaviour of the aesthetics is predictable and customizable.

AestheticDiscrete VariableContinuous Variable
colorRainbow of colorsGradient from red to blue
sizeDiscrete size stepsLinear mapping between radius and value
shapeDifferent shapes for each groupShould not work

Faceting in ggplot2

A small multiple (sometimes called faceting, trellis chart, lattice chart, panel chart, or grid chart) is a series or grid of small similar graphics or charts for comparison purposes. Usually, these small multiples are used to display different subsets of the data and these multiples are useful for exploring some conditional relationship between variables (especially when data is large enough).

Let us examine the faceting of different types. The following are some examples of subsetting the scatterplot in facets

# Create a basic scatter plot
p <- ggplot(mtcars, aes(x = disp, y = hp))
p <- p + geom_point()

# columns are cyl categories
p1 <- p + facet_grid(. ~ cyl)

# rows are cyl categories
p2 <- p + facet_grid(cyl ~ .)

# columns and rows both
p3 <- p + facet_grid(carb ~.)

wrap plots by cyl
p4 <- p + facet_grid(~ am)

# plot all four in one 
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4), ncol = 2, top = "Facet Examples")
using ggplot2 in R using facets

https://itfeature.com

https://gmstat.com

Graphical Representations in R

Many graphical representations in R Language are available for both qualitative and quantitative data types. In this post, we will only discuss graphical representations in R such as histograms, bar plots, and box plots.

Creating Histogram in R

To visualize a single variable, the histogram can be drawn using the hist() function in R. The use of histograms is to judge the shape and distribution of data in a graphical way. Histograms are also used to check the normality of the variable.

Let us attach the data from iris dataset.

attach(iris)
head(iris)
hist(Petal.Width)

We can enhance the histogram by using some arguments/parameters related to the hist() function in R. For example,

hist(Petal.Width,
  xlab = "Petal Width",
  ylab = "Frequency",
  main = "Histogram of Petal Width from Iris Data set",
  breaks = 10,
  col = "dodgerblue",
  border = "orange")
Graphical Representations in R Language

If these arguments are not provided, R will attempt to intelligently guess them, especially the number of breaks. See the YouTube tutorial for graphical representations of the histogram.

Creating Barplots in R

The bar plots are the best choice for visual inspection of a categorical variable (or a numeric variable with a finite number of values), or a rank variable. Usually, one can use bar plots for comparison purposes. The barplot() function can be used for visual inspection of a categorical variable.

library(mtcars)
barplot( table(cyl) )
barplot(table(cyl),
  ylab = "Frequency",
  xlab = "Cylinders (4, 6, 8)",
  main = "Number of cylinders ",
  col = "green",
  border = "blue")

Creating Boxplots in R

One can use Boxplots to visualize the normality, skewness, and existence of outliers in the data based on five-number summary statistics.

boxplot(mpg)
boxplot(Petal.Width)
boxplot(Petal.Length)

However, one can compare a numerical variable for different values of a categorical/grouping variable. For example,

boxplot(mpg ~ cyl, data = mtcars)
Graphical Representations in R Boxplot

The reads the formula mpg ~ cyl as: “Plot the mpg variable against the cyl variable using the dataset mtcars. The symbol ~ used to specify a formula in R.

boxplot(mpg ~ cyl, data =mtcars,
  xlab = "Cylinders",
  ylab = "Miles per Gallon",
  pch = 20,
  cex = 2,
  col = "pink",
  border = "black")
Graphical-representation-in-r

See How to perform descriptive statistics

Visit: MCQs and Quiz site https://gmstat.com

Scatter Plots In R

Introduction to Scatter Plots in R Language

Scatter plots (scatter diagrams) are bivariate graphical representations for examining the relationship between two quantitative variables. Scatter plots are essential for visualizing correlations and trends in data. Scatter plots in R can be drawn in several ways. Here we will discuss how to make several kinds of scatter plots in R.

The plot function in R

In plot() function when two numeric vectors are provided as arguments (one for horizontal and the other for vertical coordinates), the default behavior of the plot() function is to make a scatter diagram. For example,

library(car)
attach(Prestige)
plot(income, prestige)

will draw a simple scatterplot of prestige by income.

Usually, the interpretation of a scatterplot is often assisted by enhancing the plot with least-squares or non-parametric regression lines. For this purpose scatterplot() in car package can be used and it will add marginal boxplots for the two variables

scatterplot(prestige ~ income, lwd = 3 )

Note that in the scatterplot, the non-parametric regression curve is drawn by a local regression smoother, where local regression works by fitting a least-square line in the neighborhood of each observation, placing greater weight on points closer to the focal observation. A fitted value for the focal observation is extracted from each local regression, and the resulting fitted values are connected to produce the non-parametric regression line.

Coded Scatterplots

The scatterplot() function can also be used to create coded scatterplots. For this purpose, a categorical variable is used for coloring or using different symbols for each category. For example, let us plot prestige by income, coded by the type of occupation

scatterplot(prestige ~ income | type)

Note that variables in the scatterplot are given in a formula-style (as y ~ x | groups).

The coded scatterplot indicates that the relationship between prestige and income may well be linear within occupation types. The slope of the relationship looks steepest for blue-collar (bc) occupations, and least steep for professional and managerial occupations.

Jittering Scatter Plots

Jittering the data by adding a small random quantity to each coordinate serves to separate the overplotted points.

data(Vocab)
attach(Vocab)
plot(education, vocabulary) 
# without jittering
plot(jitter (education), jitter(vocabulary) )
Scatter Plots in R Language

The degree of jittering can be controlled via factor argument. For example, specifying factor = 2 doubles the jitter.

plot(jitter(education, factor = 2), jitter(vocabulary, factor = 2))

Let’s add the least-squares and non-parametric regression line.

abline(lm(vocabulary ~ education), lwd = 3, lty = 2)
lines(lowess(education, vocabulary, f = 0.2), lwd = 3)

The lowess function (an acronym for locally weighted regression) returns coordinates for the local regression curve, which is drawn by lines. The “f” arguments set the span of the local regression to lowess.

Using these different kinds of graphical representations of relationships between variables may help to identify some hidden information (hidden due to overplotting).

See more on plot() function

https://itfeature.com

https://gmstat.com

Exploring Data in R

Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the variable’s nature (or type), the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used. If we are interested in the kind of relationship between variables then a scatter plot can be useful.

  • Distributional Displays:
    The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots.
  • Plots of the Relationship between two variables:
    The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
  • Multivariate Displays:
    Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three-dimensional scatter plots.

For exploring the data in R, the following are some examples:

Stem and Leaf Display and Histogram in R

attach(mtcars)
hist(mpg)
hist(mpg, nclass = 3, col = 3)
stem(mpg)
Histogram: Exploring Data in R

Exploring Data in R: Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability = T, ylab = 'Density')
lines(density(mpg, lwd = 2))
points(mpg, rep(0, length(mpg)), pch = "|")
lines(density(mpg, adjust = 0.9), lwd = 1)

The hist() function constructs the histogram with probability=TRUE specifying density scaling. The lines() function draws the density estimate on the graph having a thickness of the line as double due to the parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9, specifies a bandwidth of 0.9 the default value.

Quantile Comparison Plots

Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.

library(car)
qqPlot(mpg)

Note that the qqPlot() function is available in the car library. The qq.plot() function is defunct.

Exploring Data: Relationship Graphs

To explore the relationship between two quantitative variables use plot() function and for a more enhanced version of a scatter plot between two variables use scatterplot() function. This function plots the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels = rownames(cyl))

CLICK to learn about plot() function in R

R Programming Language

Computer MCQs Online Test