Factors in R (Categorical Data)
Factors in R Language are used to represent categorical data in the R language. Factors can be ordered or unordered. One can think of a factor as an integer vector where each integer has a label. Factors are specially treated by modeling functions such as
glm(). Factors are the data objects used for categorical data and store it as levels. Factors can store both string and integer variables.
Using factors with labels is better than using integers as factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable having values 1 and 2.
Creating a Simple Factor
create a simple factor that has two levels
# Simple factor with two levels x <- factor(c("yes", "yes", "no", "yes", "no")) # computes frequency of factors table(x) # strips out the class unclass(x)
The order of the levels can be set using the levels argument to
factor(). This can be important in linear modeling because the first level is used as the baseline level.
x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"))
Factors can be given names using the
label argument. The label argument changes the old values of the variable to a new one. For example,
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"), label = c(1,2) )
x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("Level-1", "level-2")) x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("group-1", "group-2"))
Suppose, you have a factor variable with numerical values. You want to compute the mean. The mean vector will result in the average value of the vector, but the mean of the factor variable will result in a warning message. To calculate the mean of the original numeric values of the
"f" variable, you have to convert the values using the
level argument. For example,
# vector v <- c(10,20,20,50,10,20,10,50,20) # vector converted to factor f <- factor(v) # mean of the vector mean(v) # mean of factor mean(f) mean(as.numeric(levels(f)[f]))
Use of cut( ) Function to Create a Factor Variable
cut( ) function can also be used to convert a numeric variable into factor. The
breaks argument can be used to describe how ranges of numbers will be converted to factor values. If the
breaks argument is set to a single number then the resulting factor will be created by dividing the range of the variable into that number of equal-length intervals. However, if a vector of values is given to the
breaks argument, the values in the vectors are used to determine the breakpoint. The number of levels of the resultant factor will be one less than the number of values in the vector provided to the
breaks argument. For example,
attach(mtcars) cut(mpg, breaks = 3) factors <- cut(mpg, breaks = c(10, 18, 25, 30, 35) ) table(factors)
You will notice that the default label for factors produced by
cut() function contains the actual range of values that were used to divide the variable into factors.