Skip to content

R Frequently Asked Questions

Statistical Computing and Graphics in R

Menu
  • Learn R
    • R Basics
      • R FAQS about Package
      • R GUI
      • Using R packages
      • Missing Values
    • R Graphics
    • Data Structure
      • Data Frame
      • Matrices
      • List
    • R Programming
    • Statistical Models
  • R Quiz
    • MCQs R Programming
    • R Basic Quiz 7
    • MCQs R Debugging 6
    • MCQs R Vectors 5
    • R History & Basics 4
    • R Language Test 3
    • R Language MCQs 2
    • R Language MCQs 1
  • MCQs
    • MCQs Statistics
      • MCQs Basic Statistics
      • MCQs Probability
      • MCQs Graph & Charts
      • MCQs Sampling
      • MCQs Inference
      • MCQs Correlation & Regression
      • MCQs Time Series
      • MCQs Index Numbers
      • MCQs Quality Control 1
    • MCQS Computer
    • MCQs Mathematics Part-I
  • About ME
  • Contact Us
  • Glossary

Tag: Descriptive Statistics in R

Calculating Descriptive Statistics in R

No Comments
| R Language Basics

Here we will learn about how to perform the calculation of Descriptive Statistical metrics on a Data set and finally, we will create a data quality Report file. Let us start learning “calculating Descriptive Statistics in R”.

We will follow each step as a Task for better understanding. It will also help us to complete all work in sequential tasks.

Task 1: Load and View data Set
It is better to confirm the working directory using getwd() and save your data in the working directory, or save the data in the required folder and then set the path of this folder (directory) in R using setwd() function.

getwd()
data <- read.csv("data.csv") 

Task 2: Calculate Measure of Frequency Metrics
Before calculating the frequency metrics it is better to check the data structure and some other useful information about data, For example,

Note: here we are using mtcars data set.

data <- mtcars
str(data)
head(data)

length(data$cyl)
length(unique(data$cyl))

table(data$cyl)
freq <- table(data$cyl)
freq <- sort(freq, descreasing = T)
print(freq)

The above lines of code will tell you about the number of observations in the data set, frequency of cylinder variable, its unique category, and finally sorted frequency in order.

Task 3: Calculate Measure of Central Tendency
Here we will calculate some available measures of central tendencies such as mean, median, and mode.

mean(data$mpg)
mean(data$mpg, na.rm = T)

median(data$mpg)
median(data$mpg, na.rm = T)

Note the use of na.rm argument. If there are missing values in the data then na.rm should be set to true. Since the mtcars data set does not contain any missing values therefore, results for both will be the same.

There is no direct function to compute the most repeated value in the variable. However, using a combination of different functions we can calculate the mode. For example

# for continuous variable
uniquevalues <- unique(data$hp)
uniquevalues[which.max(tabulate(match(data$ho, uniquevalues)))]

# for categorical variable
uniquevalues <- unique(data$cyl)
uniquevalues[which.max(tabulate(match(data$cyl, uniquevalues)))]

Task 4: Calculate Measure of Dispersion Metrics

min(data$disp)
min(data$disp, na.rm = T)
max(data$disp)
max(data$disp, na.rm = T)
range(data$disp, na.rm = T)
var(data$disp, na.rm = T)
sd(data$disp, na.rm = T)

Task 5: Calculate Additional Quality Data Metrics
To compute more data metrics we must be aware of the data type of variables. Suppose we have numbers but its data type is set to the character. For example,

test <- as.character(1:3)

Finding the mean of such character variable (the numbers are converted to character class) will result in a warning.

mean(test)
[1] NA 
Warning message: In mean.default(test) : argument is not numeric or logical: returning NA

Therefore, one must be aware of the data type and class of the variable for which calculations are being performed. The class of variable in R can be checked using class() function. For example

class(data$hp)
class(mtcars)

It may also be useful if we knew the number of missing observations in the data set.

test2 <- c(NA, 2, 55, 10, NA)
sum(is.na(test2))
sum(is.na(data$hp))
sum(is.na(data$hp))

Note that the data set we are using does not contain any missing values.

Task 6: Calculate Descriptive Statistics on all Columns
There are functions in R that can be applied to each column for performing certain calculations on them. For example, apply() the function is used to compute the number of observations in the data set using length function as an argument of apply() function.

apply(data, MARGIN=2, length)
sapply(data, function(x) min(x, na.rm=T))

Let us create a user-defined function that can compute the minimum, maximum, mean, total, number of missing values, unique values, and data type of each variable (column) of the data frame.

quality_data <- function(df = NULL){
    if (is.null(df))
          print("Please Pass a non-empty data frame")
summary_tab <- do.call(data.frame,
     list(
           Min = sapply(df, function(x) min(x, na.rm = T) ),
           Max = sapply(df, function(x) max(x, na.rm = T) ),
           Mean = sapply(df, function(x) mean(x, na.rm = T) ),
           Total = apply(df, 2, length),
           NULLS = sapply(df, function(x) sum(is.na(x)) ),
           Unique = sapply(df, function(x) length(unique(x)) ),
           DataType = sapply(df, class)
      )
)
nums <- vapply(summary_tab, is.numeric, FUN.VALUE = logical(1))
summary_tab[, nums] <- round(summary_tab[, nums], digits = 3)
return(summary_tab)
}

quality_data(data)

Task 7: Generate a Quality Data Report File

df_quality <- quality_data(data)
df_quality <- cbind(columns = rownames(df_quality),
                    data.frame(df_quality, row.names = NULL)  )

write.csv(df_quality, "Data Quality Report.csv", row.names = F)

write.csv(df_quality, paste0("Data Quality Repor", 
      format(Sys.time(), "%d-%m-%Y-%M%M%S"), ".csv"),
      row.names = F)

The write.csv() function will create a file that contains all the results produced by the quality_data() function.

That’s all about Calculating Descriptive Statistics in R. There are many other descriptive measures, we will learn in future posts.

Learn about importing and exporting different data files, see the post on Importing and Exporting Data in R.

Share this:

  • Twitter
  • Facebook
  • LinkedIn
  • Skype
  • Tumblr
  • Pinterest
  • Print
  • WhatsApp
  • Telegram
  • Reddit
  • Pocket

Like this:

Like Loading...

Read More »

Subscribe via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 265 other subscribers

Search Form

Facebook

Facebook

Categories

  • Advance R Programming (3)
  • Data Analysis (12)
    • Comparisons Tests (2)
    • Statistical Models (10)
  • Data Structure (9)
    • Data Frame (2)
    • Factors in R (1)
    • List (2)
    • Matrices (2)
    • Vectors in R (1)
  • Importing/ Exporting Data (4)
    • R Data Library (4)
  • R Control Structure (3)
    • For loop in R (1)
    • Switch Statement (1)
  • R FAQS (18)
    • Missing Values (2)
    • R Basics (12)
    • R FAQS about Package (3)
    • R Programming (2)
  • R Graphics (4)
    • Exploring Data in R (1)
    • plot Function (2)
  • R Language Basics (4)
  • R Language Quiz (8)
  • Using R packages (2)
https://www.youtube.com/watch?v=MZpiMyAfnYQ&list=PLB01qg3XnNiMbKkvP2wYzzHkv6ZekaKZx

Posts: itfeature.com: Basic Statistics and Data Analysis

MCQs Chi-Square Association 2

The relationship/ Dependency (also called Association) between the attributes is called relationship/association and the measure of degrees of relationship between the attributes is called the coefficient of association. The Chi-Square Statistic is used to…

Short Questions Sampling and Sampling Distributions 1

The post is about some important Short Questions about sampling and sampling distribution. Q1: Define Sample and Sampling. Answer: Sample: A small portion of the population representing the qualities of the population being sampled…

MCQs IBM SPSS-1

Online MCQs about IBM SPSS with answers.

MCQs Correlation and Regression 6

This Quiz contains MCQs about Correlation and Regression Analysis, Multiple Regression Analysis, Coefficient of Determination (Explained Variation), Unexplained Variation, Model Selection Criteria, Model Assumptions, Interpretation of results, Intercept, Slope, Partial Correlation, Significance tests, OLS Assumptions,…

Short Questions: Normal and Standard Normal Distribution

The following post is about Short Questions related to Normal and Standard Normal Distribution. Q1: What is a standard normal variable? Ans: The variable $Z=\frac{X-\mu}{\sigma}$ which measures the deviations of variable $X$ from the…

Posts: gmstat.com: GM Statistics

MCQs Number System – 4

MCQs Economics – 3

MCQs Economics – 2

Try MCQs Economics Test 1

MCQs Economics – 1

MCQs Econometrics Quiz 5

This quiz is about Econometrics, which covers the topics of Regression analysis, correlation, dummy variable, multicollinearity, heteroscedasticity, autocorrelation, and many other topics. Let’s start with MCQs Econometrics test An application of different statistical methods applied to the economic data used…

R Frequently Asked Questions 2023 . Powered by WordPress

%d bloggers like this:
    pixel