Let’s develop a simple prediction technique to identify the sex of a person based on height. Here is data from 1050 participants and has the following form.
data:image/s3,"s3://crabby-images/77ef9/77ef97339f29827a796793d58fd4e6882b304b85" alt=""
The first step is to plot them and check their distributions.
data:image/s3,"s3://crabby-images/828b1/828b17733bd92e9daec7b954958fd4898ebbd281" alt=""
data:image/s3,"s3://crabby-images/468d4/468d43e49944fbec8b6b68274bb686ec09ab938f" alt=""
A naive way to set up the prediction is to assign everyone with height > 64 inches as male.
y_hat <- ifelse(heights$height > 64, "Male", "Female")
mean(heights$sex == y_hat)
The answer is an impressive 83%
But how well did it predict individually?
mean(yy[heights$sex == "Male"] == y_hat[heights$sex == "Male"])
mean(yy[heights$sex == "Female"] == y_hat[heights$sex == "Female"])
For males, the accuracy is about 94% and for females, it’s only 44%. The discrepancy prompts us to look at the respective number of samples in the set.
length(heights$sex[heights$sex == "Female"])
length(heights$sex[heights$sex == "Male"])
Females are 238, and males are 812.