Let’s develop a simple prediction technique to identify the sex of a person based on height. Here is data from 1050 participants and has the following form.
The first step is to plot them and check their distributions.
A naive way to set up the prediction is to assign everyone with height > 64 inches as male.
y_hat <- ifelse(heights$height > 64, "Male", "Female")
mean(heights$sex == y_hat)
The answer is an impressive 83%
But how well did it predict individually?
mean(yy[heights$sex == "Male"] == y_hat[heights$sex == "Male"])
mean(yy[heights$sex == "Female"] == y_hat[heights$sex == "Female"])
For males, the accuracy is about 94% and for females, it’s only 44%. The discrepancy prompts us to look at the respective number of samples in the set.
length(heights$sex[heights$sex == "Female"])
length(heights$sex[heights$sex == "Male"])
Females are 238, and males are 812.