Here we will learn about how to perform the calculation of Descriptive Statistical metrics on a Data set and finally, we will create a data quality Report file. Let us start learning “calculating Descriptive Statistics in R”.

We will follow each step as a Task for better understanding. It will also help us to complete all work in sequential tasks.

**Task 1: Load and View data Set**

It is better to confirm the working directory using `getwd()`

and save your data in the working directory, or save the data in the required folder and then set the path of this folder (directory) in R using `setwd()`

function.

getwd() data <- read.csv("data.csv")

**Task 2: Calculate Measure of Frequency Metrics**

Before calculating the frequency metrics it is better to check the data structure and some other useful information about data, For example,

Note: here we are using `mtcars`

data set.

data <- mtcars str(data) head(data) length(data$cyl) length(unique(data$cyl)) table(data$cyl) freq <- table(data$cyl) freq <- sort(freq, descreasing = T) print(freq)

The above lines of code will tell you about the number of observations in the data set, frequency of cylinder variable, its unique category, and finally sorted frequency in order.

**Task 3: Calculate Measure of Central Tendency**

Here we will calculate some available measures of central tendencies such as mean, median, and mode.

mean(data$mpg) mean(data$mpg, na.rm = T)

median(data$mpg) median(data$mpg, na.rm = T)

Note the use of `na.rm`

argument. If there are missing values in the data then `na.rm`

should be set to true. Since the `mtcars`

data set does not contain any missing values therefore, results for both will be the same.

There is no direct function to compute the most repeated value in the variable. However, using a combination of different functions we can calculate the mode. For example

# for continuous variable uniquevalues <- unique(data$hp) uniquevalues[which.max(tabulate(match(data$ho, uniquevalues)))]

# for categorical variable uniquevalues <- unique(data$cyl) uniquevalues[which.max(tabulate(match(data$cyl, uniquevalues)))]

**Task 4: Calculate Measure of Dispersion Metrics**

min(data$disp) min(data$disp, na.rm = T) max(data$disp) max(data$disp, na.rm = T) range(data$disp, na.rm = T) var(data$disp, na.rm = T) sd(data$disp, na.rm = T)

**Task 5: Calculate Additional Quality Data Metrics**To compute more data metrics we must be aware of the data type of variables. Suppose we have numbers but its data type is set to the character. For example,

test <- as.character(1:3)

Finding the mean of such character variable (the numbers are converted to character class) will result in a warning.

mean(test) [1] NA Warning message: In mean.default(test) : argument is not numeric or logical: returning NA

Therefore, one must be aware of the data type and class of the variable for which calculations are being performed. The class of variable in R can be checked using `class() function.`

For example

class(data$hp) class(mtcars)

It may also be useful if we knew the number of missing observations in the data set.

test2 <- c(NA, 2, 55, 10, NA) sum(is.na(test2)) sum(is.na(data$hp)) sum(is.na(data$hp))

Note that the data set we are using does not contain any missing values.

**Task 6: Calculate Descriptive Statistics on all Columns**There are functions in R that can be applied to each column for performing certain calculations on them. For example,

`apply()`

the function is used to compute the number of observations in the data set using `length`

function as an argument of `apply()`

function.apply(data, MARGIN=2, length) sapply(data, function(x) min(x, na.rm=T))

Let us create a user-defined function that can compute the minimum, maximum, mean, total, number of missing values, unique values, and data type of each variable (column) of the data frame.

quality_data <- function(df = NULL){ if (is.null(df)) print("Please Pass a non-empty data frame") summary_tab <- do.call(data.frame, list( Min = sapply(df, function(x) min(x, na.rm = T) ), Max = sapply(df, function(x) max(x, na.rm = T) ), Mean = sapply(df, function(x) mean(x, na.rm = T) ), Total = apply(df, 2, length), NULLS = sapply(df, function(x) sum(is.na(x)) ), Unique = sapply(df, function(x) length(unique(x)) ), DataType = sapply(df, class) ) ) nums <- vapply(summary_tab, is.numeric, FUN.VALUE = logical(1)) summary_tab[, nums] <- round(summary_tab[, nums], digits = 3) return(summary_tab) } quality_data(data)

**Task 7: Generate a Quality Data Report File**

df_quality <- quality_data(data) df_quality <- cbind(columns = rownames(df_quality), data.frame(df_quality, row.names = NULL) ) write.csv(df_quality, "Data Quality Report.csv", row.names = F) write.csv(df_quality, paste0("Data Quality Repor", format(Sys.time(), "%d-%m-%Y-%M%M%S"), ".csv"), row.names = F)

The `write.csv()`

function will create a file that contains all the results produced by the `quality_data()`

function.

That’s all about Calculating Descriptive Statistics in R. There are many other descriptive measures, we will learn in future posts.

Learn about importing and exporting different data files, see the post on Importing and Exporting Data in R.

You must log in to post a comment.