Exploring Data in R
Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.
One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation too. We can categorize the graphical representation of data on the basis of nature (or type) of variable, number of variables, and objectivity of analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used and if we are interested in the kind of relationship between variables then a scatter plot can be useful.
- Distributional Displays:
The distributional displays include stem and leaf display, histograms, density estimates, quantile comparison plots, and box plots.
- Plots of the Relationship between two variables:
The graphical representations for the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
- Multivariate Displays:
Multivariate graphical representations include scatter plot matrices,
coplots, and dynamic three dimensional scatter plots.
For exploring the data in R, following are some examples:
Stem and Leaf display and Histogram in R
attach(mtcars) hist(mpg) hist(mpg, nclass=3, col=3) stem(mpg)
Consider the following R code for a representation of distribution by smoothing the histogram.
hist(mpg, probability=T, ylab='Density') lines(density(mpg, lwd=2)) points(mpg, rep(0, length(mpg)), pch="|") lines(density(mpg, adjust=0.9), lwd=1)
hist() function constructs the histogram with
probability = TRUE specifying density scaling. The
lines() function draws the density estimate on the graph having a thickness of the line as double due to parameter
points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in
lines() function with
adjust=0.9, specifies a bandwidth 0.9 the default value.
Quantile Comparison Plots
Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.
Note that the
qqPlot() function is available in car library. The
qq.plot() function is defunct.
To explore the relationship between two quantitative variables use
plot() function and for a more enhanced version of a scatter plot between two variables use
scatterplot() function. This function plot the variables with least squares and non-parametric regression lines. For example,
plot(mpg, wt) scatterplot(mpg, wt) scatterplot(mpg, wt, labels=rownames(cyl))