Summary Statistics in R

In this article, you will learn about how to perform Summary Statistics in R Language on a data set and finally, you will create a data quality Report file. Let us start learning “Computing Summary Statistics in R”.

We will follow each step as a Task for better understanding. It will also help us to complete all work in sequential tasks.

Task 1: Load and View Data Set

It is better to confirm the working directory using getwd() and save your data in the working directory, or save the data in the required folder and then set the path of this folder (directory) in R using setwd() function.

getwd()
data <- read.csv("data.csv")

Task 2: Calculate Measure of Frequency Metrics in R

Before calculating the frequency metrics it is better to check the data structure and some other useful information about the data, For example,

Note: here we are using mtcars data set.

data <- mtcars
str(data)
head(data)
length(data$cyl)
length(unique(data$cyl))
table(data$cyl)

freq <- table(data$cyl)
freq <- sort(freq, descreasing = T)
print(freq)
Descriptive summary Statistics in R

The above lines of code will tell you about the number of observations in the data set, the frequency of the cylinder variable, its unique category, and finally sorted frequency in order.

Task 3: Calculate the Measure of Central Tendency in R

Here we will calculate some available measures of central tendencies such as mean, median, and mode. One can easily calculate the measures of central tendency in R by following the commands below:

mean(data$mpg)
mean(data$mpg, na.rm = T)
median(data$mpg)
median(data$mpg, na.rm = T)

Note the use of na.rm argument. If there are missing values in the data then na.rm should be set to true. Since the mtcars data set does not contain any missing values, therefore, results for both will be the same.

There is no direct function to compute the most repeated value in the variable. However, using a combination of different functions we can calculate the mode. For example

# for continuous variable
uniquevalues <- unique(data$hp)
uniquevalues[which.max(tabulate(match(data$ho, uniquevalues)))]
# for categorical variable
uniquevalues <- unique(data$cyl)
uniquevalues[which.max(tabulate(match(data$cyl, uniquevalues)))]

Task 4: Calculate Measure of Dispersion in R Programming

The measures of dispersion such as range, variance, and standard deviation can be computed as given below. The use of different functions for the measure of dispersion in R programming is described as follows:

min(data$disp)
min(data$disp, na.rm = T)
max(data$disp)
max(data$disp, na.rm = T)
range(data$disp, na.rm = T)
var(data$disp, na.rm = T)
sd(data$disp, na.rm = T)

Task 5: Calculate Additional Quality Data Metrics

To compute more data metrics we must be aware of the data type of variables. Suppose we have numbers but its data type is set to the character. For example,

test <- as.character(1:3)

Finding the mean of such character variable (the numbers are converted to character class) will result in a warning.

mean(test)

[1] NA 
Warning message: In mean.default(test) : argument is not numeric or logical: returning NA

Therefore, one must be aware of the data type and class of the variable for which calculations are being performed. The class of variable in R can be checked using class() function. For example

class(data$hp)
class(mtcars)

It may also be useful if we know the number of missing observations in the data set.

test2 <- c(NA, 2, 55, 10, NA)

sum(is.na(test2))
sum(is.na(data$hp))
sum(is.na(data$hp))

Note that the data set we are using does not contain any missing values.

Task 6: Computing Summary Statistics in R on all Columns

There are functions in R that can be applied to each column to perform certain calculations on them. For example, apply() the function is used to compute the number of observations in the data set using length function as an argument of apply() function.

apply(data, MARGIN=2, length)

sapply(data, function(x) min(x, na.rm=T))

Let us create a user-defined function that can compute the minimum, maximum, mean, total, number of missing values, unique values, and data type of each variable (column) of the data frame.

quality_data <- function(df = NULL){
    if (is.null(df))
          print("Please Pass a non-empty data frame")
  
summary_tab <- do.call(data.frame,
     list(
           Min = sapply(df, function(x) min(x, na.rm = T) ),
           Max = sapply(df, function(x) max(x, na.rm = T) ),
           Mean = sapply(df, function(x) mean(x, na.rm = T) ),
           Total = apply(df, 2, length),
           NULLS = sapply(df, function(x) sum(is.na(x)) ),
           Unique = sapply(df, function(x) length(unique(x)) ),
           DataType = sapply(df, class)
      )
)
                         
nums <- vapply(summary_tab, is.numeric, FUN.VALUE = logical(1))
summary_tab[, nums] &lt;- round(summary_tab[, nums], digits = 3)
      
return(summary_tab)

}

quality_data(data)

Task 7: Generate a Quality Data Report File

df_quality <- quality_data(data)
df_quality <- cbind(columns = rownames(df_quality),
                    data.frame(df_quality, row.names = NULL)  )

write.csv(df_quality, "Data Quality Report.csv", row.names = F)

write.csv(df_quality, paste0("Data Quality Repor", 
      format(Sys.time(), "%d-%m-%Y-%M%M%S"), ".csv"),
      row.names = F)

The write.csv() function will create a file that contains all the results produced by the quality_data() function.

That’s all about Calculating Descriptive Statistics in R. There are many other descriptive measures, we will learn in future posts.

To learn about importing and exporting different data files, see the post on Importing and Exporting Data in R.

FAQs in R

  1. What summary statistics can easily be computed in R?
  2. How to load the data set in the current workspace?
  3. What are the functions that can be used to compute different measures of dispersions in R Language?
  4. How to compute the summary statistics of all columns at once in R?
  5. What measure of central tendencies can be computed in R?
  6. What functions can be used to get information about the loaded dataset in R?
  7. How missing observations can be identified in R?

Learn Basic Statistics

One-Way ANOVA in R

Learn how to perform One-Way ANOVA in R with code examples, post-hoc tests (Tukey HSD), assumption checks, and result interpretation. Perfect for beginners & researchers!

The two-sample t or z-test is used to compare two groups from an independent population. However, if there are more than two groups, One-Way ANOVA (analysis of variance) or its further versions can be used in R.

Introduction to One-Way ANOVA

The statistical test associated with ANOVA is the F-test (also called F-ratio). In the ANOVA procedure, an observed F-value is computed and then compared with a critical F-value derived from the relevant F-distribution. The F-value comes from a family of F-distribution defined by two numbers (the degrees of freedom). Note that the F-distribution cannot be negative as it is the ratio of variance, and variances are always positive numbers.

The One-Way ANOVA is also known as one-factor ANOVA. It is the extension of the independent two-sample test for comparing means when there are more than two groups. The data in One-Way ANOVA is organized into several groups based on grouping variables (called factor variables, too).

To compute the F-value, the ratio of “the variance between groups” and the “variance within groups” needs to be computed. The assumptions of ANOVA should also be checked before performing the test. We will learn how to perform One-Way ANOVA in R.

One-Way ANOVA in R

Suppose we are interested in finding the difference of miles per gallon based on the number of cylinders in an automobile; from the dataset “mtcars”. Let us get some basic insight into the data before performing the ANOVA.

# load and attach the data mtcars
attach(mtcars)
# see the variable names and initial observations
head(mtcars)

Let us draw the boxplot of each group

boxplot(mpg ~ cyl, main="Boxplot", xlab="Number of Cylinders", ylab="mpg")

Basic Syntax of aov() Function

# Using aov()
result <- aov(dependent_variable ~ independent_variable, data = dataset)
summary(result)

# Using lm() (alternative method)
result <- lm(dependent_variable ~ independent_variable, data = dataset)
anova(result)

Now, to perform One-Way ANOVA in R using the aov() function. The example for performing One-Way ANOVA in R is as follows

aov(mpg ~ cyl)

The variable “mpg” is continuous, and the variable “cyl” is the grouping variable. From the output note, the degrees of freedom are under the variable “cyl”. It will be one. It means the results are not correct as the degrees of freedom should be two as there are three groups on “cyl”. In the mode (data type) of grouping variable required for ANOVA  should be the factor variable. For this purpose, the “cyl” variable can be converted to a factor as

cyl <- as.factor(cyl)

Now re-issue the aov( ) function as

aov(mpg ~ cyl)

Now the results will be as required. To get the ANOVA table, use the summary() function as

summary(aov (mpg ~ cyl))

Let’s store the ANOVA results obtained from aov() function in object say res

res <- aov(mpg ~ cyl)
summary(res)

Let us find the means of each number of the cylinder group

print(model.tables(res, "means"), digits = 4)
Step-by-step One-Way ANOVA in R

Post-hoc tests for ANOVA in R (Tukey HSD)

Post-hoc tests or multiple-pairwise comparison tests help in finding out which groups differ (significantly) from one other and which do not. The post-hoc tests allow for multiple-pairwise comparisons without inflating the type-I error. To understand it, suppose the level of significance (type-I error) is 5%. Then, the probability of making at least one Type-I error (assuming independence of three events), the maximum family-wise error rate, will be

$1-(0.95 \times 0.95 \times 0.95) =  14.2%$

It will give the probability of having at least one FALSE alarm (type-I error).

To perform Tukey’s post hoc test and plot the group’s differences in means from Tukey’s test.

# Tukey Honestly Significant Differences
TukeyHSD(res)
plot(TukeyHSD(res))

Diagnostic Plots (Checking Model Assumptions)

The diagnostic plots can be used to check the assumptions of heteroscedasticity, normality, and influential observations.

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(res)
Diagnostic Plots for One-Way ANOVA in R

Levene’s Test

To check the assumption of ANOVA, Levene’s test can be used. For this purpose leveneTest() function can be used which is available in the car package.

library(car)
leveneTest(res)

https://itfeature.com

https://gmstat.com

Statistical Models in R Language

R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal, and one needs to ask for the details by calling extractor functions.

R is one of the most powerful tools for statistical modeling, offering a wide range of functions and packages for different types of analyses. This guide covers the fundamentals of building, evaluating, and interpreting statistical models in R.

Defining Statistical Models in R Language

The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is
$$\sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p$$

In matrix form, the statistical model can be written as

$$y=X\beta+e$$

where the $y$ is the dependent (response) variable, $X$ is the model matrix or design matrix (matrix of regressors), and has columns $x_0, x_1, \cdots, x_p$, the determining variables with intercept term. Usually, $x_0$ is a column of ones defining an intercept term in the statistical model.

Statistical Model Examples

Suppose $y, x, x_0, x_1, x_2, \cdots$ are numeric variables, $X$ is a matrix. The following are some examples that specify statistical models in R.

  • y ~ x    or   y ~ 1 + x
    Both examples imply the same simple linear regression model of $y$ on $x$. The first formulae have an implicit intercept term, and the second formulae have an explicit intercept term.
  • y ~ 0 + x  or  y ~ -1 + x  or y ~ x – 1
    All these imply the same simple linear regression model of $y$ on $x$ through the origin, without an intercept term.
  • log(y) ~ x1 + x2
    Imply multiple regression of the transformed variable, $(log(y)$ on $x_1$ and $x_2$ with an implicit intercept term.
  • y ~ poly(x , 2)  or  y ~ 1 + x + I(x, 2)
    Imply a polynomial regression model of $y$ on $x$ of degree 2 (second-degree polynomials), and the second formulae use explicit powers as a basis.
  • y~ X + poly(x, 2)
    Multiple regression $y$ with a model matrix consisting of the design matrix $X$ as well as polynomial terms in $x$ to degree 2.

Note that the operator ~ defines a model formula in the R language. The form of an ordinary linear regression model is, $response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots $,

where

  • The response is a vector or matrix defining the response (dependent) variable(s).
  • $op_i$ is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
  • $term_i$ is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors, or matrices connected by formula operators.
Statistical Models in R Language

Best Practices for Statistical Modeling in R

  1. Always check assumptions (normality, homoscedasticity, multicollinearity)
  2. Use appropriate model diagnostics (residual plots, VIF, QQ plots)
  3. Consider regularization (ridge/lasso regression) for high-dimensional data
  4. Document your modeling process for reproducibility
  5. Validate models using holdout samples or cross-validation

Important R Packages for Statistical Modeling

R Package NamePurpose
statsBase R statistical functions
lme4Mixed effects models
glmnetRegularized regression
forecastTime series analysis
caretMachine learning workflow
tidymodelsModern modeling framework

R provides an incredibly rich ecosystem for statistical modeling, from simple linear regression to advanced machine learning algorithms. By understanding these fundamental modeling techniques and how to implement them in R, one will be well-equipped to tackle a wide variety of data analysis problems.

FAQS about Statistical Models in R

  1. How are statistical models specified in R Language?
  2. How is linear regression performed in R language using the formula?
  3. How can linear regression be performed without intercept in R?
  4. How can polynomial regression be performed in R?
  5. Write about the ~ operator in R.
Statistical Models in R Language R FAQs https://rfaqs.com

https://gmstat.com
https://itfeature.com