# Data Frame in R Language

In R Programming language a data frame is a two-dimensional data structure. The data frame objects contain rows and columns. The number of rows for each column should have equal length. The cross-section of the row and column can be considered as a cell. Each cell of the data frame is associated with a combination of row number and column number.

One can modify, extract, and re-arrange the data contents of a data frame; the process is called the manipulation of the data frame. To create a data frame a general syntax can be followed

### Data Frame Syntax in R

df <- data.frame(first column = c(data values separated with commas,
second column = c(data values separate with commans,
......
)

An exemplary data frame in the R language is

df = data.frame(age = c(23, 24, 25, 26, 23, 25, 29, 20),
marks = c(99, 80, 67, 56, 98, 65, 45, 77),
grade = c("A", "A", "C", "D", "A", "B", "F", "B")
)
print(df)

One can name or rename the columns and rows of the data frame

# Naming / renaming columns
colnames(df) <- c("Age", "Score", "Grad")

# Naming / renaming rows
row.names(df) <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th")

### Subsetting a Data Frame

The subset() method can be used to create a new data set by removing specified column(s). This splits the data frame into two sets, one with excluded colujmns and the other with included columns. To understand sbusetting a data frame, let us create a data frame first.

# creating a data frame
df = data.frame(row1 = 0:3, row2 = 3:6, row3 = 6:9)

# creating a subset
df <- subset(df, select = c(row1, row2))

Question: Data Frame in R Language

Suppose we have a frequency distribution of sales from a sample of 100 sales receipts.

Calculate mean, median, variance, standard deviation and coefficient of variation by using R code.

Solution

# Crate a data frame

df <- data.frame(lower_class = seq(0, 100, by = 20), upper_class=seq(20, 120, by=20), freq = c(16, 18, 14, 24, 20, 8))

# mid points
m <- (df["lower_class"] + df["upper_class"])/2

mf <- df["freq"] * m
mfsquare <- df["freq"] * m^2

data <- cbind(df, m, mf, mfsquare)
colnames(data) <- c("LL","UL", "freq" , "M", "mf", "mf2")

# Computation
avg = sum(data$mf)/sum(data$freq)
var = (sum(data$mf2) - sum(data$mf)^2 / sum(data$freq))/(sum(data$freq)-1)
sd = sqrt(var)
CV = sd/avg * 100

## Outputs
paste("Mean = ", round(avg, 3))
paste("Variance = ", round(var, 3))
paste("Standard Deviation = ", round(sd, 3))
paste("Coefficient of Variation = ", round(CV, 3))

### Using Logical Conditions for Selecting Rows and Columns

For selecting rows and columns using logical conditions, we consider the iris data set. Here, suppose we are interested in Selecting rows who values are higher than the median for Sepal Length and whose Petal.Width >= 1.7. In the code below, each value in Sepal.Length variable (column) is compared with the median value of Sepal.Length. Similarly, the each value of Petal.Width is compared with 1.7 to extract the required values from these to columns.

attach(iris)

iris[(Sepal.Length > median(Sepal.Length) & Petal.Width >= 1.7), ]

One can select only the numeric columns from the data frame by following the code below

# Selecting Numeric Columns only
iris[ , sapply(iris, is.numeric)]

# Selecting factor columns only
iris[, sapply(iris, is.factor)]

# Selecting only certain Species
iris[Species == "virginica", ]

### Omitting Missing Observations in a Data Frame

# Omit rows with missing data
na.omit(iris)

# check for missing data across rows
apply(iris, 2, is.na)
iris[complete.cases(iris), ]

https://itfeature.com

https://gmstat.com

## Discover more from R Frequently Asked Questions

Subscribe now to keep reading and get access to the full archive.

Continue reading