data:image/s3,"s3://crabby-images/a02df/a02dfd7edc4e44cdb29d3c1bcd49d84021c45fdc" alt="Hands-On Data Science with R"
Calculating mean, median, and mode with base R
Altogether, the mean, median, and mode are the most popular measures of central tendency. They kind of tell us where the distribution is centered. The following code block shows how to calculate the first two of them:
mean(small_sample, na.rm = T)
# outputs [1] 7.546716
mean(big_sample, na.rm = T)
# outputs [1] 9.97051
median(small_sample, na.rm = T)
# outputs [1] 8.449614
median(big_sample, na.rm = T)
# outputs [1] 9.979968
The mean() and median() functions respectively return the mean and median from a set of numbers. If you have any NA at your set and you still want to compute the mean/median no matter what, the na.rm = T argument will prevent your function from crashing. This argument will demand the function remove NAs before handling the computation.
Given that the sample comes from continuous data, even with 100,000 observations, it's very unlikely for a single value to show up more than once. One or more modes are much more likely to show up if we looked into rounded samples. Base R does not have a fully dedicated function to calculate mode but we can easily wrap a function to do so. The next code block shows how to do it:
find_mode <- function(vals) {
if(max(table(vals)) == min(table(vals)))
'amodal'
else
names(table(vals))[table(vals)==max(table(vals))]
}
Modes can be also estimated for non-numeric distributions. A distribution can have no mode if all values can be seen as much as any other in the sample. Those are called amodal (with no mode). Now, we can now supply our recently crafted function (find_mode) with big_sample:
find_mode(big_sample)
# outputs [1] "amodal"
find_mode(round(big_sample))
# outputs [1] "10"
Even for big samples of continuous variables, there are considerable chances of not finding a mode. It's way easier to find one or more modes in a sample of integers. These are not the only central tendency measures available. A package called psych has functions that calculate harmonic and geometric means. The following code block demonstrates how to install psych and draw the calculations:
if(!require(psych)){ install.packages('psych')}
psych::harmonic.mean(big_sample)
# outputs [1] 7.419585
psych::geometric.mean(big_sample)
# outputs [1] 8.793195
# Warning message:
# In log(x) : NaNs produced
Let me break down the preceding code block:
- if(!require(psych)){ install.packages('psych')} can be read as if the psych package is not installed yet, install it
- psych::harmonic.mean(big_sample) tells R to calculate the harmonic mean from big_sample using the harmonic.mean() function of psych
- psych::geometric.mean(big_sample) asks for the geometric.mean() function of psych to calculate the geometric mean from big_sample
It would be most common for R users to load the entire package using either library(psych) or require(psych) and only then calling functions names (without saying from which package they came from).
There are far more central tendency measures than those five presented until now. There is no one-size-fits-all kind of measure; different situations will benefit from different measures, but let's move on to next section.