Exploring Data in R

Master Exploring Data in R with this essential guide! Learn how to use summary statistics, data visualization, and exploratory data analysis (EDA) techniques to uncover patterns, detect outliers, and prepare your datasets for machine learning. Perfect for data scientists, analysts, and researchers!

Introduction to Exploring Data in R Language

The examination of data (Exploring Data), particularly graphical examination and representation of data, is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the variable’s nature (or type), the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used. If we are interested in the kind of relationship between variables then a scatter plot can be useful.

  • Distributional Displays:
    The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots.
  • Plots of the Relationship between two variables:
    The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
  • Multivariate Displays:
    Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three-dimensional scatter plots.

Before exploring data in R, it is important to understand the structure of your data set.

Understanding Your Data Structure

Use the following built-in functions to understand your data set first.

  • str() – Examine object structure
  • summary() – Quick statistical overview
  • head()/tail() – View first/last rows
  • dim() – Check dataset dimensions
  • class() – Identify variable types

Stem and Leaf Display and Histogram in R

attach(mtcars)
hist(mpg)
hist(mpg, nclass = 3, col = 3)
stem(mpg)
Histogram: Exploring Data in R

Exploring Data in R: Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability = T, ylab = 'Density')
lines(density(mpg, lwd = 2))
points(mpg, rep(0, length(mpg)), pch = "|")
lines(density(mpg, adjust = 0.9), lwd = 1)

The hist() function constructs the histogram with probability=TRUE specifying density scaling. The lines() function draws the density estimate on the graph, having a thickness of the line as double due to the parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9, specifies a bandwidth of 0.9, the default value.

Quantile Comparison Plots in R

Quantile plots help in comparing the distribution of a variable with a theoretical distribution, such as the normal distribution.

library(car)
qqPlot(mpg)

Note that the qqPlot() function is available in the car library. The qq.plot() function is defunct.

Exploring Data: Relationship Graphs

To explore the relationship between two quantitative variables use plot() function, and for a more enhanced version of a scatter plot between two variables, use scatterplot() function. This function plots the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels = rownames(cyl))

CLICK to learn about plot() function in R

FAQs about Exploring Data in R Language

  1. What do you mean by exploring data?
  2. What are the objectives of exploratory data analysis?
  3. What are the important visualizations for exploratory data analysis?
  4. For exploratory analysis, which graph is used for comparison purposes?
  5. For exploratory analysis, which graph is used to explore the relationship between variables?
  6. What is a quantile comparison plot?
  7. What is the objective of density estimation graphs?
  8. Name some of the multivariate plots used for EDA.

R Programming Language

Computer MCQs Online Test

Greek Letters in R Plot Label and Title

In R, plot symbols (Greek Letters in R Plot) are used to represent data points in scatter plots and other types of plots. These symbols can be customized to suit your preferences, making your data visualization more effective and aesthetically pleasing graphs or plots in R.

Common Plot Symbols in R

R Language uses numeric values to represent different symbols. The following is a list of the most commonly used plot symbols and their corresponding numbers:

SymbolCodeDescription
Circle1Solid circle (default)
Square15Solid square
Triangle2Solid triangle
Diamond18Solid diamond
Plus Sign3Plus sign
X4X marks the spot
Open Circle1Circle with no fill
Open Square0Square with no fill
Open Triangle17Triangle with no fill

Introduction to R Plot Symbols (Greek Letters)

The post is about writing (Greek Letters in) R plot symbols, their labels, and the title of the plots. There are two main ways to include Greek letters in your R plot labels (axis labels, title, legend):

  1. Using the expression Function
    This is the recommended approach as it provides more flexibility and control over the formatting of the Greek letters and mathematical expressions.
  2. Using raw Greek letter Codes
    This method is less common and requires memorizing the character codes for each Greek letter.

Question: How can one include Greek letters (symbols) in R plot labels?
Answer: Greek letters or symbols can be included in titles and labels of a graph using the expression command. Following are some examples

Note that in these examples, random data is generated from a normal distribution. You can use your own data set to produce graphs that have symbols or Greek letters in their labels or titles.

Greek Letters in R Plot

The following are a few examples of writing Greek letters in R plot.

Example 1: Draw Histogram

mycoef <- rnorm (1000)
hist(mycoef, main = expression(beta) )

where beta in expression is the Greek letter (symbol) of $\beta$. A histogram similar to the following will be produced.

greek Letters in r plot-1

Example 2:

sample <- rnorm(mean=5, sd=1, n=100)
hist(sample, main=expression( paste("sampled values, ", mu, "=5, ", sigma, "=1" )))

where mu and sigma are symbols of $\mu$ and $\sigma$ respectively. The histogram will look like

greek symbols in r plot-2

Example 3:

curve(dnorm, from= -3, to=3, n=1000, main="Normal Probability Density Function")

will produce a curve of Normal probability density function ranging from $-3$ to $3$.

greek symbols in r plot-3

List of Common Greek Letters in R Plot

The following is a list of common Greek letters and their corresponding R expressions:

Greek LetterR ExpressionR ExampleSymbol
Alphaalphaexpression(alpha)$\alpha$
Betabetaexpression(beta)$\beta$
Gammagammaexpression(gamma)$\gamma$
Deltadeltaexpression(delta)$delta$
Thetathetaexpression(theta)$theta$
Pipiexpression(pi)$\pi$
Sigmasigmaexpression(sigma)$\sigma$
Lambdalambdaexpression(lambda)$\lambda$
Rhorhoexpression(rho)$\rho$
Phiphiexpression(phi)$phi$
Mumuexpression(mu)$\mu$
Omegaomegaexpression(omega)$\omega$

Complex Mathematical Expressions in R Plot

One can also combine Greek Letters with other math functions like sum or integrals

# Plot with complex mathematical expression
x = runif(100)
y = runif(100)
plot(x, y, main=expression(paste("Sum: ", sum(x[i]^2), " for all ", i)))

Normal Density Function

To add a normal density function formula, we need to use the text and paste command, that is

text(-2, 0.3, expression(f(x) == paste(frac(1, sqrt(2*pi* sigma^2 ) ), " ", e^{frac(-(x-mu)^2, 2*sigma^2)})), cex=1.2)

Now, the updated curve of the Normal probability density function will be

Normal Probability Density Function

Example 4:

x <- dnorm( seq(-3, 3, 0.001))
plot(seq(-3, 3, 0.001), cumsum(x)/sum(x), 
           type="l", col="blue", xlab="x", 
           main="Normal Cumulative Distribution Function")

The Normal Cumulative Distribution function will look like

Normal Cumulative Distribution Function

To add the formula, use the text and paste command, that is

text(-1.5, 0.7, 
       expression(phi(x) == paste(frac(1, sqrt(2*pi)), " ", 
       integral(e^(-t^2/2)*dt, -infinity, x))), cex = 1.2)

The Curve of the Normal Cumulative Distribution Function

The Curve of the Normal Cumulative Distribution Function and its formula in the plot will look like this,

Normal Cumulative distribution

https://itfeature.com, https://gmstat.com

Plot Function in R

This article about the plot function in R Language gives some introduction about the plot() function, the use and purpose of its arguments, and a few examples are provided. Using the R plot function one can draw different graphical representations and the arguments of the plot() function can be used to enhance the graph.

Introduction to Graphics in R Language

Question: Can we draw graphics in R language?
Answer: Yes. R language produces high-quality statistical graphs. There are many useful and sophisticated kinds of graphs available in R.

Question: Where graphics are displayed in R?
Answer: In R, all graphs are produced in a window named Graphic Windows which can be resized.

Question: What is the use of the plot function in R?
Answer: In R, plot() is a generic function that can be used to make a variety of point and line graphs. plot() function can also be used to define a coordinate space.

Important Arguments of the Plot Function in R

Question: What are the arguments of the plot() function?
Answer: There are many arguments used in the plot() function. Some of these arguments are x, y, type, xlab, ylab, etc. To see the full list of arguments of the plot() write the command in the R console;

args(plot.default)

Question: Are all arguments necessary to be used in R?
Answer: No. The first two arguments x and y provide the horizontal and vertical coordinates of points or lines to be plotted and define a data-coordinate system for the graph. At least argument x is required. Note that many of the arguments are set to default values in the plot function.

Question: What is the use of the argument type in the plot() function?
Answer: In the R plot function, the argument type determines the type of the graph to be drawn. Several types of graphs can be drawn. The default type of graph type=’p’, plots points at the coordinates specified by the x and y argument. Specifying type=’l’ produces a line graph, and type=’n’ sets up the plotting region to accommodate the data set but plots nothing.

Other Types of Graphs: Setting type Argument

Question: Are there other types of graphs?
Answer: Yes. Setting type=’b’, draw graphs having both points and lines. Setting type=’h’ draws histogram-like vertical lines and setting type=’s’ and type=’S’ draws stair-step-like lines starting horizontally and vertically respectively.

Question: What is the use of xlim and ylim in plot() function?
Answer: The arguments xlim and ylim may be used to define the limits of the horizontal and vertical axes. Usually, these arguments are unnecessary, because R language reasonably picks limits from x and y.

Question: What are the purpose of xlab and xlab arguments in the plot() function?
Answer: xlab and ylab argument tack character-string arguments to label the horizontal and vertical axes.

Examples of R Plot Function in R

Question: Provide a few examples of the R plot function.
Answer: The following are a few examples of R plot functions. Suppose you have a data set on variables x and y, such as

x <- rnorm(100, m=10, sd=10)
y <- rnorm(100)

plot(x, y)
plot(x, y, xlab='X  (Mean=10, SD=10)',   ylab='Y (Mean=1, SD=1)' , type='l')
plot(x, y, xlab='X  (Mean=10, SD=10)',   ylab='Y (Mean=1, SD=1)' , type='o')
plot(x, y, xlab='X  (Mean=10, SD=10)',   ylab='Y (Mean=1, SD=1)' , pch=10)
Introduction to plot function in R

https://gmstat.com

https://itfeature.com