Exploratory Data Analysis - R Programming FAQs

R Quick Reference Guide

R language: A Quick Reference Guide about learning R Programming with a short description of the widely used commands. It will help the learner and intermediate user of the R Programming Language to get help with different functions quickly. This Quick Reference is classified into different groups. Let us start with R Language: A Quick Reference – IV.

This Quick Reference will help in performing different descriptive statistics on vectors, matrices, lists, data frames, arrays, and factors.

Basic Descriptive Statistics in R Language

The following is the list of widely used functions that are further helpful in computing descriptive statistics. The functions below are not direct descriptive statistics functions, however, these functions are helpful to compute other descriptive statistics.

R Command	Short Description
sum(x1, x2, … , xn)	Computes the sum/total of $n$ numeric values given as argument
prod(x1, x2, … , xn)	Computes the product of all $n$ numeric values given as argument
min(x1, x2, … , xn)	Gives smallest of all $n$ values given as argument
max(x1, x2, …, xn)	Gives largest of all $n$ values given as argument
range(x1, x2, … , xn)	Gives both the smallest and largest of all $n$ values given as argument
pmin(x1, x2, …)	Returns minima of the input values
pmax(x1, x2, …)	Returns maxima of the input values

Statistical Descriptive Statistics in R Language

The following functions are used to compute measures of central tendency, measures of dispersion, and measures of positions.

R Command	Short Description
mean(x)	Computes the arithmetic mean of all elements in $x$
sd(x)	Computes the standard deviation of all elements in $x$
var(x)	Computes the variance of all elements in $x$
median(x)	Computes the median of all elements in $x$
quantile(x)	Computes the median, quartiles, and extremes in $x$
quantile(x, p)	Computes the quantiles specified by $p$

Cumulative Summaries in R Language

The following functions are also helpful in computing the other descriptive calculations.

R Command	Short Description
cumsum(x)	Computes the cumulative sum of $x$
cumprod(x)	Computes the cumulative product of $x$
cummin(x)	Computes the cumulative minimum of $x$
cummax(x)	Computes the cumulative maximum of $x$

Sorting and Ordering Elements in R Language

The sorting and ordering functions are useful in especially non-parametric methods.

R Command	Short Description
sort(x)	Sort the all elements of $x$ in ascending order
sort(x, decreasing = TRUE)	Sor the all elements of $x$ in descending order
rev(x)	Reverse the elements in $x$
order(x)	Get the ordering permutation of $x$

Sequence and Repetition of Elements in R Language

These functions are used to generate a sequence of numbers or repeat the set of numbers $n$ times.

R Command	Short Description
a:b	Generates a sequence of numbers from $a$ to $b$ in steps of size 1
seq(n)	Generates a sequence of numbers from 1 to $n$
seq(a, b)	Generates a sequence of numbers from $a$ to $b$ in steps of size 1, it is the same as a:b
seq(a, b, by=s)	Generates a sequence of numbers from $a$ to $b$ in steps of size $s$.
seq(a, b, length=n)	Generates a sequence of numbers having length $n$ from $a$ to $b$
rep(x, n)	Repeats the elements $n$ times
rep(x, each=n)	Repeats the elements of $x$, each element is repeated $n$ times

R Quick Reference Guide Frequently Asked Questions About R

R Language: A Quick Reference – I

https://gmstat.com

In this article, you will learn about how to perform Summary Statistics in R Language on a data set and finally, you will create a data quality Report file. Let us start learning “Computing Summary Statistics in R”.

We will follow each step as a Task for better understanding. It will also help us to complete all work in sequential tasks.

Task 1: Load and View Data Set

It is better to confirm the working directory using getwd() and save your data in the working directory, or save the data in the required folder and then set the path of this folder (directory) in R using setwd() function.

getwd()
data <- read.csv("data.csv")

Task 2: Calculate Measure of Frequency Metrics in R

Before calculating the frequency metrics it is better to check the data structure and some other useful information about the data, For example,

Note: here we are using mtcars data set.

data <- mtcars
str(data)
head(data)
length(data$cyl)
length(unique(data$cyl))
table(data$cyl)

freq <- table(data$cyl)
freq <- sort(freq, descreasing = T)
print(freq)

The above lines of code will tell you about the number of observations in the data set, the frequency of the cylinder variable, its unique category, and finally sorted frequency in order.

Task 3: Calculate the Measure of Central Tendency in R

Here we will calculate some available measures of central tendencies such as mean, median, and mode. One can easily calculate the measures of central tendency in R by following the commands below:

mean(data$mpg)
mean(data$mpg, na.rm = T)

median(data$mpg)
median(data$mpg, na.rm = T)

Note the use of na.rm argument. If there are missing values in the data then na.rm should be set to true. Since the mtcars data set does not contain any missing values, therefore, results for both will be the same.

There is no direct function to compute the most repeated value in the variable. However, using a combination of different functions we can calculate the mode. For example

# for continuous variable
uniquevalues <- unique(data$hp)
uniquevalues[which.max(tabulate(match(data$ho, uniquevalues)))]

# for categorical variable
uniquevalues <- unique(data$cyl)
uniquevalues[which.max(tabulate(match(data$cyl, uniquevalues)))]

Task 4: Calculate Measure of Dispersion in R Programming

The measures of dispersion such as range, variance, and standard deviation can be computed as given below. The use of different functions for the measure of dispersion in R programming is described as follows:

min(data$disp)
min(data$disp, na.rm = T)
max(data$disp)
max(data$disp, na.rm = T)
range(data$disp, na.rm = T)
var(data$disp, na.rm = T)
sd(data$disp, na.rm = T)

Task 5: Calculate Additional Quality Data Metrics

To compute more data metrics we must be aware of the data type of variables. Suppose we have numbers but its data type is set to the character. For example,

test <- as.character(1:3)

Finding the mean of such character variable (the numbers are converted to character class) will result in a warning.

mean(test)

[1] NA 
Warning message: In mean.default(test) : argument is not numeric or logical: returning NA

Therefore, one must be aware of the data type and class of the variable for which calculations are being performed. The class of variable in R can be checked using class() function. For example

class(data$hp)
class(mtcars)

It may also be useful if we know the number of missing observations in the data set.

test2 <- c(NA, 2, 55, 10, NA)

sum(is.na(test2))
sum(is.na(data$hp))
sum(is.na(data$hp))

Note that the data set we are using does not contain any missing values.

Task 6: Computing Summary Statistics in R on all Columns

There are functions in R that can be applied to each column to perform certain calculations on them. For example, apply() the function is used to compute the number of observations in the data set using length function as an argument of apply() function.

apply(data, MARGIN=2, length)

sapply(data, function(x) min(x, na.rm=T))

Let us create a user-defined function that can compute the minimum, maximum, mean, total, number of missing values, unique values, and data type of each variable (column) of the data frame.

quality_data <- function(df = NULL){
    if (is.null(df))
          print("Please Pass a non-empty data frame")
  
summary_tab <- do.call(data.frame,
     list(
           Min = sapply(df, function(x) min(x, na.rm = T) ),
           Max = sapply(df, function(x) max(x, na.rm = T) ),
           Mean = sapply(df, function(x) mean(x, na.rm = T) ),
           Total = apply(df, 2, length),
           NULLS = sapply(df, function(x) sum(is.na(x)) ),
           Unique = sapply(df, function(x) length(unique(x)) ),
           DataType = sapply(df, class)
      )
)
                         
nums <- vapply(summary_tab, is.numeric, FUN.VALUE = logical(1))
summary_tab[, nums] &lt;- round(summary_tab[, nums], digits = 3)
      
return(summary_tab)

}

quality_data(data)

Task 7: Generate a Quality Data Report File

df_quality <- quality_data(data)
df_quality <- cbind(columns = rownames(df_quality),
                    data.frame(df_quality, row.names = NULL)  )

write.csv(df_quality, "Data Quality Report.csv", row.names = F)

write.csv(df_quality, paste0("Data Quality Repor", 
      format(Sys.time(), "%d-%m-%Y-%M%M%S"), ".csv"),
      row.names = F)

The write.csv() function will create a file that contains all the results produced by the quality_data() function.

That’s all about Calculating Descriptive Statistics in R. There are many other descriptive measures, we will learn in future posts.

To learn about importing and exporting different data files, see the post on Importing and Exporting Data in R.

FAQs in R

What summary statistics can easily be computed in R?
How to load the data set in the current workspace?
What are the functions that can be used to compute different measures of dispersions in R Language?
How to compute the summary statistics of all columns at once in R?
What measure of central tendencies can be computed in R?
What functions can be used to get information about the loaded dataset in R?
How missing observations can be identified in R?

Learn Basic Statistics

R Language: A Quick Reference Guide – IV