Accuracy and Asymmetry

Let’s develop a simple prediction technique to identify the sex of a person based on height. Here is data from 1050 participants and has the following form.

The first step is to plot them and check their distributions.

A naive way to set up the prediction is to assign everyone with height > 64 inches as male.

y_hat <- ifelse(heights$height > 64, "Male", "Female") 
mean(heights$sex == y_hat)

The answer is an impressive 83%

But how well did it predict individually?

mean(yy[heights$sex == "Male"] == y_hat[heights$sex == "Male"])
mean(yy[heights$sex == "Female"] == y_hat[heights$sex == "Female"])

For males, the accuracy is about 94% and for females, it’s only 44%. The discrepancy prompts us to look at the respective number of samples in the set.

length(heights$sex[heights$sex == "Female"])
length(heights$sex[heights$sex == "Male"])
Females are 238, and males are 812.