Summarizing Data in R Base Package

Introduction to Summarizing Data in R

Data summarization (getting different summary statistics) is a fundamental step in exploratory data analysis (EDA). Summarizing data in R Language helps analysts to understand the patterns, detect anomalies, and derive insights. While modern R packages like dplyr and data.table offers streamlined approaches. However, Base R remains a powerful and efficient tool for quick data summarization without additional dependencies (packages).

This guide explores essential Base R functions for summarizing data, from basic statistics to advanced grouped operations, ensuring you can efficiently analyze datasets right out of the box.

For learning purposes, we will use the mtcars data set.

Key Functions for Basic Summary Statistics

There are several Base R functions for computing summary statistics. The summary() function offers a quick overview of a dataset, displaying minimum, maximum, mean, median, and quartiles for numerical variables. On the other hand, the categorical variables are summarized with frequency counts. For more specific metrics, functions like mean(), median(), sd(), and var() calculate central tendency and dispersion, while min() and max() functions can be used to identify the data range. These functions are particularly useful when combined with na.rm = TRUE to handle missing values. For example, applying summary(mtcars) gives an immediate snapshot of the dataset, while mean(mtcars$mpg, na.rm = TRUE) computes the average miles per gallon.

Frequency Counts and Cross-Tabulations

When working with categorical data, the table() function is indispensable for generating frequency distributions. It counts occurrences of unique values, making it ideal for summarizing factors or discrete variables. For more complex relationships, xtabs() or ftable() can create cross-tabulations, revealing interactions between multiple categorical variables. For instance, table(mtcars$cyl) shows how many cars have 4, 6, or 8 cylinders, while xtabs(~ gear + cyl, data = mtcars) presenting a contingency table between gears and cylinders.

attach(mtcars)

# Frequency of cylinders
table(cyl)

# contingency table of gears and cylinders
xtabs(~ gear + cyl, data = mtcars)
Summarizing Data in R Language

Group-Wise Summarization Using aggregate() and by()

To compute summary statistics by groups, Base R offers aggregate() and by(). The aggregate() function splits data into subsets and applies a summary function, such as mean or sum, to each group. For example, aggregate(mpg ~ cyl, data = mtcars, FUN = mean) calculate the average MPG per cylinder group. Meanwhile, by() provides more flexibility, allowing custom functions to be applied across groups. While tapply() is another alternative for vector-based grouping, aggregate() is often preferred for its formula interface and cleaner output.

# Average for each cylinder of the vehicle
aggregate(mpg ~ cyl, data = mtcars, FUN = mean)

## Output
  cyl      mpg
1   4 26.66364
2   6 19.74286
3   8 15.10000

Advanced Techniques: Quantiles and Custom Summaries

Beyond basic summaries, Base R supports advanced techniques like percentile analysis using quantile(), which helps assess data distribution by returning specified percentiles (e.g., quantile(mtcars$mpg, probs = c(0.25, 0.5, 0.75))). For customized summaries, users can define their own functions and apply them using sapply() or lapply(). This approach is useful when needing tailored metrics, such as trimmed means or confidence intervals. Additionally, combining these functions with plotting tools like boxplot() or hist() can further enhance data interpretation.

# percentiles
quantile(mtcars$mpg, probs = c(0.25, 0.5, 0.75))

## Output
   25%    50%    75% 
15.425 19.200 22.800 

boxplot(quantile(mtcars$mpg, probs = c(0.25, 0.5, 0.75)) )
Data Visualization Summarizing Data in R Base Package

When to Use Base R vs. Tidyverse for Summarization

While Base R is efficient and lightweight, the Tidyverse (particularly dplyr) offers a more readable syntax for complex operations. Functions like summarize() and group_by() simplify chained operations, making them preferable for large-scale data wrangling. However, Base R remains advantageous for quick analyses, legacy code, or environments where installing additional packages is restricted. Understanding both approaches ensures flexibility in different analytical scenarios.

Best Effective Practices for Summarizing Data in R

To maximize efficiency, always handle missing values explicitly using na.rm = TRUE in statistical functions. For large datasets, consider optimizing performance by pre-filtering data or using vectorized operations. Visualizing summaries with basic plots (e.g., hist(), boxplot()) can provide immediate insights. Finally, documenting summary steps ensures reproducibility, whether in scripts, R Markdown, or Shiny applications.

In summary, the Base R provides a robust toolkit for data summarization, from simple descriptive statistics to advanced grouped analyses. By mastering functions like summary(), table(), aggregate(), and quantile(), analysts can efficiently explore datasets without relying on external packages. While modern alternatives like dplyr enhance readability for complex tasks, Base R’s simplicity and universality make it an essential skill for every R programmer. Practicing these techniques on real-world datasets will solidify your understanding and improve your data analysis workflow.

Dimensionality Reduction in Machine Learning

Descriptive Summary in R

Introduction to Descriptive Summary in R

Statistics is a study of data: describing properties of data (descriptive statistics) and drawing conclusions about a population based on information in a sample (inferential statistics). In this article, we will discuss the computation of descriptive summary in R (Descriptive statistics in R Programming).

Example: Twenty elementary school children were asked if they live with both parents (B), father only (F), mother only (M), or someone else (S) and how many brothers has he. The responses of the children are as follows:

CaseSexNo. of His BrothersCaseSexNo. of His Brothers
MFemale3BMale2
BFemale2FMale1
BFemale3BMale0
MFemale4MMale0
FMale3MMale3
SMale1BFemale4
BMale2BFemale3
MMale2FMale2
FFemale4BFemale1
BFemale3MFemale2


Consider the following computation is required. These computations are related to the Descriptive summary in R.

  • Construct a frequency distribution table in r relative to the case of each one.
  • Draw a bar and pie graphs of the frequency distribution for each category using the R code.

Creating the Frequency Table in R

# Enter the data in the vector form 
x <- c("M", "B", "B", "M", "F", "S", "B", "M", "F", "B", "B", "F", "B", "M", "M", "B", "B", "F", "B", "M") 

# Creating the frequency table use Table command 
tabx=table(x) ; tabx

# Output
x
B F M S 
9 4 6 1 

Draw a Bar Chart and Pie Chart from the Frequency Table

# Drawing the bar chart for the resulting table in Green color with main title, x label and y label 

barplot(tabx, xlab = "x", ylab = "Frequency", main = "Sample of Twenty elementary school children ",col = "Green") 

# Drawing the pie chart for the resulting table with main title.
pie(tabx, main = "Sample of Twenty elementary school children ")
Graphical Descriptive summary in R Programming Language
Descriptive summary in R Programming Language

Descriptive Statistics for Air Quality Data

Consider the air quality data for computing numerical and graphical descriptive summary in R. The air quality data already exists in the R Datasets package.

attach(airquality)
# To choose the temperature degree only
Temperature = airquality[, 4]
hist(Temperature)

hist(Temperature, main="Maximum daily temperature at La Guardia Airport", xlab="Temperature in degrees Fahrenheit", xlim = c(50, 100), col="darkmagenta", freq=T)

h <- hist(Temperature, ylim = c(0,40))
text(h$mids, h$counts, labels=h$counts, adj=c(0.5, -0.5))
Histogram Descriptive Statistics in R Programming Language

In the above histogram, the frequency of each bar is drawn at the top of each bar by using the text() function.

Note that to change the number of classes or the interval, we should use the sequence function to divide the $range$, $Max$, and $Min$, into $n$ using the function length.out=n+1

hist(Temperature, breaks = seq(min(Temperature), max(Temperature), length.out = 7))
Histogram with breaks. Descriptive Statistics in R Programming Language

Median for Ungrouped Data

Numeric descriptive statistics such as median, mean, mode, and other summary statistics can be computed.

median(Temperature)
## Output 79
mean(Temperature)
summary(Temperature)
Numerical Descriptive Statistics in R Programming Language

A customized function for the computation of the median can be created. For example

arithmetic.median <- function(xx){
    modulo <- length(xx) %% 2
    if (modulo == 0){
      (sort(xx)[ceiling(length(xx)/2)] + sort(xx)[ceiling(1+length(xx)/2)])/2
    } else{
     sort(xx)[ceiling(length(xx)/2)]
  }
}
arithmetic.median(Temperature)

Computing Quartiles and IQR

The quantiles (Quartiles, Deciles, and Percentiles) can be computed using the function quantile() in R. The interquartile range (IQR) can also be computed using the iqr() function.

y = airquality[, 4]  # temperature variable

quantile(y)

quantile(y, probs = c(0.25,0.5,0.75))
quantile(y, probs = c(0.30,0.50,0.70,0.90))

IQR(y)
Quartiles Descriptive summary in R Programming Language

One can create a custom function for the computation of Quartiles and IQR. For example,

quart<- function(x) {
   x <- sort(x)
   n <- length(x)
   m <- (n+1)/2
   if (floor(m) != m) {
      l <- m-1/2; u <- m+1/2
     } else {
     l <- m-1; u <- m+1
     }
   c(Q1 = median(x[1:l]), 
   Q3 = median(x[u:n]), 
   IQR = median(x[u:n])-median(x[1:l]))
}

quart(y)

FAQs in R Language

  1. How one can perform descriptive statistics in R Language?
  2. Discuss the strategy of creating a frequency table in R.
  3. How Pie Charts and Bar Charts can be drawn in R Language? Discuss the commands and important arguments.
  4. What default function is used to compute the quartiles of a data set?
  5. You are interested in computing the median for group and ungroup data in R. Write a customized R function.
  6. Create a User-Defined function that can compute, Quaritles and IQR of the inputted data set.

https://itfeature.com

https://gmstat.com

Some Descriptive Statistics in R: A Comprehensive R Tutorial

Descriptive Statistics in R

There are numerous functions in the R language that are used to computer descriptive statistics. Here, we will consider the data mtcars to get descriptive statistics in R. You can use a dataset of your own choice. To learn about what are descriptive statistics, read the different posts from the Basic Statistics Section.

Getting Dataset Information in R

Before performing any descriptive or inferential statistics, it is better to get some basic information about the data. It will help to understand the mode (type) of variables in the datasets.

# attach the mtcars datasets
attach(mtcars)

# data structure
str(mtcars)

You will see the dataset mtcars contains 32 observations and 11 variables.

It is also best to inspect the first and last rows of the dataset.

# for the first six rows
head(mtcars)

# for the last six rows
tail(mtcars)

Getting Numerical Descriptive Statistics in R

To get a quick overview of the dataset, the summary( ) function can also be used. We can use the summary( ) function separately for each of the variables in the dataset.

summary(mtcars)
summary(mpg)
summary(gear)
Some Descriptive Statistics in R

Note that the summary( ) the function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.

summary(cyl)
summary( factor(cyl) )

Remember that if for a certain variable, the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the summary( ) function will result in a frequency table.

Some other functions can be used instead of summary() function.

# average value
mean(mpg)
# median value
median(mpg)
# minimum value
min(mpg)
# maximum value
max(mpg)
# Quatiles, percentiles, deciles
quantile(mpg)
quantile(mpg, probs=c(10, 20, 30, 70, 90))
# variance and standard deviation
var(mpg)
sd(mpg)
# Inter-quartile range
IQR(mpg)
# Range
range(mpg)

Creating a Frequency Table in R

We can produce a frequency table and a relative frequency table for any categorical variable.

freq <- table(cyl); freq
rf <- prop.table(freq)

barplot(freq)
barplot(rf)
pie(freq)
pie(rf)
Barplot and Pie chart Some Descriptive Statistics in R

Creating a Contingency Table (Cross-Tabulation)

The contingency table can be used to summarize the relationship between two categorical variables. The xtab( ) or table( ) functions can be used to produce cross-tabulation (contingency table).

xtabs(~cyl + gear, data = mtcars)
table(cyl, gear)

Finding a Correlation between Variables

The cor( ) function can be used to find the degree of relationship between variables using Pearson’s method.

cor(mpg, wt)

However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.

cor(mpg, wt, method = "spearman")

The scatter plot can be drawn using plot( ) a function.

plot(mpg ~ wt)

FAQs about Descriptive Statistics in R

  1. How to check the data types of different variables/columns in R?
  2. What is the use of str() function in R?
  3. What is the use of head() and tail() functions in R?
  4. How numerical statistics of a variable or data set can be obtained in R?
  5. What is the use of summary() function in R?
  6. How summary() function can be used to perform descriptive statistics of a categorical variable in R?
  7. How to produce a frequency table in R?
  8. What is the use of xtab() function in R?
  9. What is the use cor(), plot(), and pie() functions, explain with the help of examples.
  10. What functions are used to compute, mean, median, standard deviation, variance, Quantiles, and IQR.

Learn more about plot( ) function: plot( ) function

Visit: Learn Basic Statistics