上QQ阅读APP看书，第一时间看更新

Calculating mean, median, and mode with base R

Altogether, the mean, median, and mode are the most popular measures of central tendency. They kind of tell us where the distribution is centered. The following code block shows how to calculate the first two of them:

mean(small_sample, na.rm = T)
# outputs [1] 7.546716
mean(big_sample, na.rm = T)
# outputs [1] 9.97051
median(small_sample, na.rm = T)
# outputs [1] 8.449614
median(big_sample, na.rm = T)
# outputs [1] 9.979968

To keep it simple, the arithmetic mean is the sum of all values divided by the number of observations. Median is the middle observation (center) of a sorted sample and mode is the value (or values) that are most frequent in the dataset (if there is one).

The mean() and median() functions respectively return the mean and median from a set of numbers. If you have any NA at your set and you still want to compute the mean/median no matter what, the na.rm = T argument will prevent your function from crashing. This argument will demand the function remove NAs before handling the computation.

Skip the na.rm = T argument if your data is not supposed to have any NAs. A warning will be displayed if any NA is found and you will notice that something may have gone wrong.

Given that the sample comes from continuous data, even with 100,000 observations, it's very unlikely for a single value to show up more than once. One or more modes are much more likely to show up if we looked into rounded samples. Base R does not have a fully dedicated function to calculate mode but we can easily wrap a function to do so. The next code block shows how to do it:

find_mode <- function(vals) {
  if(max(table(vals)) == min(table(vals)))
    'amodal'
  else
    names(table(vals))[table(vals)==max(table(vals))]
}

Modes can be also estimated for non-numeric distributions. A distribution can have no mode if all values can be seen as much as any other in the sample. Those are called amodal (with no mode). Now, we can now supply our recently crafted function (find_mode) with big_sample:

find_mode(big_sample)
# outputs [1] "amodal"
find_mode(round(big_sample))
# outputs [1] "10"

Even for big samples of continuous variables, there are considerable chances of not finding a mode. It's way easier to find one or more modes in a sample of integers. These are not the only central tendency measures available. A package called psych has functions that calculate harmonic and geometric means. The following code block demonstrates how to install psych and draw the calculations:

if(!require(psych)){ install.packages('psych')}
psych::harmonic.mean(big_sample)
# outputs [1] 7.419585
psych::geometric.mean(big_sample)
# outputs [1] 8.793195
# Warning message:
# In log(x) : NaNs produced

Let me break down the preceding code block:

if(!require(psych)){ install.packages('psych')} can be read as if the psych package is not installed yet, install it
psych::harmonic.mean(big_sample) tells R to calculate the harmonic mean from big_sample using the harmonic.mean() function of psych
psych::geometric.mean(big_sample) asks for the geometric.mean() function of psych to calculate the geometric mean from big_sample

It would be most common for R users to load the entire package using either library(psych) or require(psych) and only then calling functions names (without saying from which package they came from).

Using library() or require() to load packages will spare you some typing while making your code cleaner. On the other hand, calling a function by <package name>::<function> will make your code extensive but more explicit about what is being made, while also avoiding possible naming conflicts.

There are far more central tendency measures than those five presented until now. There is no one-size-fits-all kind of measure; different situations will benefit from different measures, but let's move on to next section.