Grouping functions (tapply, by, aggregate) and the *apply family

The *apply Family in R

When it comes to applying functions in R, the *apply family of functions is often used. This family of functions includes sapply, lapply, apply, tapply, and by, each serving a different purpose. However, understanding the differences between them and knowing which one to use can be confusing. In this article, we will explore the *apply family of functions, focusing on sapply, lapply, apply, tapply, and by, and discuss how to use each one in different scenarios.

sapply and lapply

The sapply and lapply functions are two of the most commonly used functions in the *apply family. Both functions are used to apply a function to elements of a vector or list and return the results. Let's take a closer look at each one:

sapply

The sapply function is used when we want to apply a function to each element of a vector or list and get a simplified output. The syntax for sapply is:

            
                sapply(X, FUN)
            
        

Where X is the vector or list and FUN is the function to be applied. The output of sapply is a vector or matrix, depending on the output of the function applied to each element. If the function returns a single value, sapply will simplify the output into a vector. However, if the function returns multiple values, sapply will return a matrix with each element representing the corresponding output value.

Example:

            
                # Applying the sqrt function to each element of a vector
                vec <- c(1, 4, 9)
                sapply(vec, sqrt)
                # Output: 1 2 3
            
        

In this example, the sqrt function is applied to each element of the vector "vec", resulting in the square root of each element as the output.

lapply

The lapply function is similar to sapply, but it returns a list instead of simplifying the output. The syntax for lapply is:

            
                lapply(X, FUN)
            
        

The "X" parameter can be a vector, list, or data frame, and the "FUN" parameter is the function to be applied. The output of lapply is always a list, with each element representing the output of applying the function to the corresponding element of "X".

Example:

            
                # Applying the length function to each element of a list
                lst <- list(a = c(1, 2), b = c(3, 4, 5))
                lapply(lst, length)
                # Output: $a[1] 2, $b[1] 3
            
        

In this example, the length function is applied to each element of the list "lst", resulting in a list with two elements. The first element represents the length of the vector "a", while the second element represents the length of the vector "b".

apply

The apply function is used when we want to apply a function to either the rows or columns of a matrix or data frame. The syntax for apply is:

            
                apply(X, MARGIN, FUN)
            
        

The "X" parameter can be a matrix or data frame, "MARGIN" specifies whether the function should be applied to the rows (MARGIN = 1) or columns (MARGIN = 2), and "FUN" is the function to be applied. The apply function returns a vector, where each element represents the output of applying the function to the corresponding row or column.

Example:

            
                # Applying the mean function to each row of a matrix
                mat <- matrix(1:6, ncol = 2)
                apply(mat, 1, mean)
                # Output: 1.5 3.5 5.5
            
        

In this example, the mean function is applied to each row of the matrix "mat", resulting in a vector with three elements. Each element represents the mean of the corresponding row.

tapply

The tapply function is used when we want to apply a function to subsets of a vector, considering one or more grouping factors. The syntax for tapply is:

            
                tapply(X, INDEX, FUN)
            
        

The "X" parameter is the vector to be grouped and analyzed, "INDEX" specifies the grouping factors (can be one or more), and "FUN" is the function to be applied. The tapply function returns an array or matrix, where each element represents the output of applying the function to a particular grouping combination.

Example:

            
                # Applying the sum function to subsets of a vector based on a grouping factor
                vec <- c(1, 2, 3, 4)
                group <- c("A", "A", "B", "B")
                tapply(vec, group, sum)
                # Output: A  B
                #          3  7
            
        

In this example, the sum function is applied to subsets of the vector "vec" based on the grouping factor "group". The output is a matrix with two rows (one for each group) and one column, where each element represents the sum of the corresponding grouping combination.

by

The by function is used when we want to apply a function to subsets of a data frame, considering one or more grouping factors. The syntax for by is:

            
                by(dataframe, INDICES, FUN)
            
        

The "dataframe" parameter is the data frame to be grouped and analyzed, "INDICES" specifies the grouping factors (can be one or more), and "FUN" is the function to be applied. The by function applies the function to each column of the data frame, and the output is displayed in a user-friendly format with the grouping information and the value of the function at each column.

Example:

            
                # Applying the mean function to subsets of a data frame based on a grouping factor
                df <- data.frame(
                  group = c("A", "A", "B", "B"),
                  value1 = c(1, 2, 3, 4),
                  value2 = c(5, 6, 7, 8)
                )
                by(df[, c("value1", "value2")], df$group, mean)
                # Output: df$group: A
                #   value1 2
                #   value2 5.5
                #
                # df$group: B
                #   value1 3.5
                #   value2 7.5
            
        

In this example, the mean function is applied to subsets of the data frame "df" based on the grouping factor "group". The output is displayed in a user-friendly format with the grouping information and the mean of each column at each grouping combination.

plyr and reshape

The plyr and reshape packages in R provide advanced tools for handling data manipulation and transformations. Both packages offer functions that can be used to replace or enhance the functionality of the *apply family of functions. However, mastering plyr and reshape is beyond the scope of this article. It is worth exploring these packages if you frequently encounter complex data manipulation tasks.

Conclusion

In this article, we covered the basics of the *apply family of functions in R, including sapply, lapply, apply, tapply, and by. We discussed the syntax, purpose, and output of each function, with examples demonstrating their usage. The *apply family provides a powerful set of tools for applying functions to vectors, lists, matrices, and data frames, allowing for efficient data manipulation and analysis. And while plyr and reshape offer additional functionality for advanced data transformations, understanding and utilizing the *apply family is essential for any R programmer.