Logistic Regression Models in R

The article is about the use and application of Logistic Regression Models in R Language. In logistic regression models, the response variable ($y$) is of categorical (binary, dichotomous) values such as 1 or 0 (TRUE/ FALSE). It measures the probability of a binary response variable based on a mathematical equation relating the values of the response variable with the predictor(s). The built-in glm() function in R can be used to perform logistic regression analysis.

Probability and Odds Ratio

The odds are used in logistic regression. If $p$ is the probability of success, the odds of in favour of success are, $\frac{p}{q}=\frac{p}{1-p}$.

Note that probability can be converted to odds and odds can also be converted to likelihood (probability). However, unlike probability, odds can exceed 1. For example, if the likelihood of an event is 0.25, the odds in favour of that event are $\frac{0.25}{0.75}=0.33$. And the odds against the same event are $\frac{0.75}{0.25}=3$.

Logistic Regression Models in R (Example)

In built-in dataset (“mtcars“), the column (am) describes the transmission mode (automatic or manual) which is of binary value (0 or 1). Let us perform logistic regression models between the response variable “am” and other regressors: “hp”, “wt”, and “cyl” as given:

Logistic Regression with one Dichotomous Predictor

logmodel1 <- glm(am ~ vs, family = "binomial")
summary(logmodel1)

Logistic Regression with one Continuous Predictor

If the prediction variable is continuous then the logistic regression formula in R would be as given below:

logmodel2 <- glm(am ~ wt, family = "binomial")
summary(logmodel2)

Multiple Predictors in Logistic Regression

The following is an example of a logistic regression model with more than one predictor. For the model diagnostic plots are also drawn.

logmodel3 <- glm(am ~ cyl + hp + wt, family = "binomial")
summary(logmodel3)
plot(logmodel3)

Note: in the logistic regression model, dichotomous and continuous variables can be used as predictors.

Logistic Regression Models in R
Logistic Regression Models in R and Diagnostic Plots

In R language, the coefficients returned by logistic regression are a logit, or the log of the odds. To convert logits to odds ratio exponentiates it and to convert logits to probability use $\frac{e^\beta}{1-e^\beta}$. For example,

logmodel1 <- glm(am ~ vs, family = "binomial", data = mtcars)
logit_coef <- logmodel1$coef
exp(logmodel1$coef)
exp(logit_coef)/(1 + exp(logmodel1$coef))
Logistic Regression in R

Generalized Linear Models (GLM) in R

The generalized linear models (GLM) can be used when the distribution of the response variable is non-normal or when the response variable is transformed into linearity. The GLMs are flexible extensions of linear models that are used to fit the regression models to non-Gaussian data.

Introduction to Generalized Linear Models

Generalized Linear Models (GLMs) in R are an extension of linear regression that allow for response variables with non-normal distributions. GLMs are used to model relationships between a dependent variable and one or more independent variables. Generalized Linear Models consist of three components:

  1. Random Component: Specifies the probability distribution of the response variable (e.g., Gaussian, Binomial, Poisson).
  2. Systematic Component: The linear predictor, which is a linear combination of the predictors (independent variables).
  3. Link Function: Connects the mean of the response variable to the linear predictor (e.g., identity, logit, log).

One can classify a regression model as linear or non-linear regression models.

Generalized Linear Models

Basic Form of a Generalized Linear Models

The basic form of a Generalized linear model is
\begin{align*}
g(\mu_i) &= X_i’ \beta \\
&= \beta_0 + \sum\limits_{j=1}^p x_{ij} \beta_j
\end{align*}
where $\mu_i=E(U_i)$ is the expected value of the response variable $Y_i$ given the predictors, $g(\cdot)$ is a smooth and monotonic link function that connects $\mu_i$ to the predictors, $X_i’=(x_{i0}, x_{i1}, \cdots, x_{ip})$ is the known vector having $i$th observations with $x_{i0}=1$, and $\beta=(\beta_0, \beta_1, \cdots, \beta_p)’$ is the unknown vector of regression coefficients.

Syntax of glm() Function

In R, GLMs are fitted using the glm() function. The basic syntax of glm() function is

glm(formula, family, data)
  • formula: Specifies the model (e.g., y ~ x1 + x2).
  • family: Describes the distribution and link function (e.g., gaussian(link = "identity"), binomial(link = "logit"), poisson(link = "log")).
  • data: The dataset containing the variables.

Fitting Generalized Linear Models

The glm() is a function that can be used to fit a generalized linear model, using the generic form of the model below. The formula argument is similar to that used in the lm() function for the linear regression model.

mod <- glm(formula, family = gaussian, data = data.frame)

The family argument is a description of the error distribution and link function to be used in the model.

The class of generalized linear models is specified by giving a symbolic description of the linear predictor and a description of the error distribution. The link functions for different families of the probability distribution of the response variables are given below. The family name can be used as an argument in the glm( ) function.

Link Functions for Different Families

Family NameLink Functions
binomiallogit , probit, cloglog
gaussianidentity, log, inverse
Gammaidentity, inverse, log
inverse gaussian$1/ \mu^2$, identity, inverse,log
poissonlogit, probit, cloglog, identity, inverse
quasilog, $1/ \mu^2$, sqrt

Generalized Linear Models, GLM Example in R

Consider the “cars” dataset available in R. Let us fit a generalized linear regression model on the data set by assuming the “dist” variable as the response variable, and the “speed” variable as the predictor. Both the linear and generalized linear models are performed in the example below.

data(cars)
head(cars)
attach(cars)

scatter.smooth(x=speed, y=dist, main = "Dist ~ Speed")

# Linear Model
lm(dist ~ speed, data = cars)
summary(lm(dist ~ speed, data = cars)

# Generalized Linear Model
glm(dist ~ speed, data=cars, family = "gaussian")
plot(glm(dist ~ speed, data = cars))
summary(glm(dist ~ speed, data = cars))
Generalized Linear Models

Diagnostic Plots of Generalized Linear Models

generalized linear models

Generalized Linear Models Types and Applications

GLM TypeResponse VariableReal-Life Example
Logistic RegressionBinary (0/1)Customer churn, disease diagnosis
Poisson RegressionCount dataInsurance claims, website visits
Gamma RegressionPositive, skewed continuousInsurance claim amounts, machine failure time
Multinomial RegressionMulti-categoryProduct choice, species classification
Negative Binomial RegressionOverdispersed count dataAccident counts, sick days
Ordinal RegressionOrdered categoriesCustomer satisfaction, disease severity
Tweedie RegressionZero-inflated continuousInsurance claims with many zeros

https://gmstat.com

Important R Language MCQs ggplot2 with Answers 8

The quiz “R Language MCQS ggplot2” will help you check your ability to execute some basic operations on objects in the R language, and it will also help you understand some basic concepts. This quiz may also improve your computational understanding.

Quiz about R Language

1. Let us have 1000 random samples of size 6 under SRSWOR using the following population (111, 150, 121, 198, 112, 136, 114, 129, 117, 115, 186, 110, 121, 115, 114) which is the R command for repeating this procedure 1500 times?

 
 
 
 

2. How sampling with and without replacement can be done using R?

 
 
 
 

3. When programming in R, what is a pipe used as an alternative for?

 
 
 
 

4. For the population y<-c(1,2,3,4,5), write the R command to find the mean?

 
 
 
 

5. You are cleaning a data frame with improperly formatted column names. To clean the data frame you want to use the clean_names() function. Which column names will be changed using the clean_names() with default parameters?

 
 
 
 

6. Data analysts are working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?

 
 
 
 

7. What is the class of the object defined by the expression? x <- c(4,5,10)?

 

 
 
 
 

8. Which summary functions can you use to preview data frames in R Language?

 
 
 
 

9. Which is the R command for obtaining 1000 random numbers through normal distribution with mean 0 and variance 1?

 
 
 
 

10. Which of the following are standards of tidy data?

 
 
 
 

11. Suppose you want to simulate a coin toss 20 times in R. Write the command.

 
 
 
 

12. Which R function can be used to make changes to a data frame?

 
 
 
 

13. In ggplot2, an _____ is a visual property of an object in your plot.

 
 
 
 

14. In R the following are all atomic data types EXCEPT:

 
 
 
 

15. Why are tibbles a useful variation of data frames?

 
 
 
 

16. A data analyst is working with the penguin’s data. They write the following code:
penguins %>%

The variable species includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the Gentoo species?

 
 
 
 

17. Data analysts are cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?

 
 
 
 

18. For the population y<-c(1,2,3,4,5), write the R command to find the median?

 

 
 
 
 

19. Write the R commands for generating 700 random variables from normal distribution by using the following information: Mean = 14, SD = 3, n = 5, k = 2000.

 
 
 
 

20. A data scientist is trying to print a data frame but when you print the data frame to the console output produces too many rows and columns to be readable. What could they use instead of a data frame to make printing more readable?

 
 
 
 

Frequently Asked Questions About R Language MCQs ggplot2

R Language MCQs ggplot2 Function

  • What is the class of the object defined by the expression? x <- c(4,5,10)?  
  • In R the following are all atomic data types EXCEPT:
  • For the population y<-c(1,2,3,4,5), write the R command to find the mean.
  • For the population y<-c(1,2,3,4,5), write the R command to find the median.
  • Let us have 1000 random samples of size 6 under SRSWOR using the following population (111, 150, 121, 198, 112, 136, 114, 129, 117, 115, 186, 110, 121, 115, 114) which is the R command for repeating this procedure 1500 times?
  • Which is the R command for obtaining 1000 random numbers through normal distribution with mean 0 and variance 1?
  • How sampling with and without replacement can be done using R?
  • Write the R commands for generating 700 random variables from normal distribution by using the following information: Mean = 14, SD = 3, n = 5, k = 2000.
  • Suppose you want to simulate a coin toss 20 times in R. Write the command.
  • When programming in R, what is a pipe used as an alternative for?
  • Which of the following are standards of tidy data?
  • Which summary functions can you use to preview data frames in R Language?
  • Which R function can be used to make changes to a data frame?
  • Why are tibbles a useful variation of data frames?
  • Data analysts are cleaning their data in R.
  • They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?
  • Data analysts are working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?
  • A data scientist is trying to print a data frame but when you print the data frame to the console output produces too many rows and columns to be readable. What could they use instead of a data frame to make printing more readable?
  • A data analyst is working with the penguin’s data. They write the following code: penguins %>% The variable species includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the Gentoo species?
  • You are cleaning a data frame with improperly formatted column names. To clean the data frame you want to use the clean_names() function. Which column names will be changed using the clean_names() with default parameters?
  • In ggplot2, an ———- is a visual property of an object in your plot.

R Language MCQs 2

Computer MCQs Online Test