Blogs – Page 35

Breast Cancer Diagnostic Data Set

August 16, 2023

The dataset, known as the Breast Cancer Wisconsin (Diagnostic) Dataset, was obtained from Kaggle. It was built by Dr Wolberg, who used fluid samples from patients with solid breast masses. It provides ten features of cells from each sample – the mean value, extreme value and standard error of 10 features for the image returning 30 variables. Those ten components are:

radius
texture
perimeter
area
smoothness
compactness
concavity
concave points
symmetry
fractal dimension

The objective is to match the outcome, and diagnosis, which takes two values vz benign (B) or malignant (M). The following plot gives the overall summary of how the various features compare between benign (B) and malignant (M)

or a density plot

We do a correlation plot next:

corr_mat <- cor(b_data[,2:ncol(b_data)])
corrplot(corr_mat)

Breast Cancer Diagnostic Data Set Read More »

Logistic Regression

August 15, 2023

We know linear regression, which allows us to find the relationship between two variables and let us predict a dependent variable from an independent variable.

In this example, the function associated with the red dotted line lets us estimate the fat% if a BMI value is known.

But what happens if the data is available, like the following?

Here, the survey gives either a YES or NO as the answer (1 = YES, 0 = NO). The linear regression and the subsequent equation are meaningless here. In such cases, we resort to logistic regression.

The objective of the logistic regression is not to get the value of Y but the probability. E.g., if the X value is 9, there is a 50% chance of getting a YES. On the other hand, X = 2 has a higher probability of getting a NO.

The plot tells you that the data is best suited for classification. Ys with < 50% chance to occur will be classified as the YES category, and < 50% is in NO.

Logistic Regression Read More »

Logistic Regression – Cleveland Data

August 14, 2023

Let’s do a logistic regression of health data. Experiments with the Cleveland database focused on distinguishing the presence (value: 1,2,3,4) from the absence (value 0). The featured health parameters are

Age
Sex
CP: chest pain
Trestbps: resting blood pressure (mm Hg)
Chol: serum cholesterol (mg/dl)
Fbs: fasting blood sugar > 120 mg/dl
Restecg: Rest ECG
Thalach: maximum heart rate achieved during the thallium stress test
Exang: exercise-induced angina
Oldpeak: ST depression induced by exercise relative to rest
Slope: the slope of the peak exercise ST segment
Ca: number of major vessels (0-3) coloured by fluoroscopy
Thal:
Hd: diagnosis of heart disease

After cleaning up and conditioning, the data looks like this:

297 obs. of  14 variables:
 $ Age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ Sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
 $ CP      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ Trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ Chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ Fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ Restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ Thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ Exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ Oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ Slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ Ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ Thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
 $ Hd      : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...

logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
par(cex=0.8, mai=c(0.7,0.7,0.2,0.5), bg = "antiquewhite1")
plot(x = predicted.data$HD, y = predicted.data$Prob.HD)

An even fancier plot can be made using the following code:

logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
predicted.data <- predicted.data[order(predicted.data$Prob.HD, decreasing = FALSE),]
predicted.data$Rank <- 1:nrow(predicted.data)
  
ggplot(data = predicted.data, aes(x = Rank, y = Prob.HD)) +
  geom_point(aes(color = HD), alpha = 1, shape = 4, stroke = 2) +
  xlab("Index") +
  ylab("Predicted Probability of Getting Heart Disease") +
  theme(text = element_text(color = "white"), 
        panel.background = element_rect(fill = "black"), 
        plot.background = element_rect(fill = "black"),
        panel.grid = element_blank(),
        legend.text = element_text(color = "black"),
        legend.title = element_text(color = "black"),
        axis.text = element_text(color = "white"),
        axis.ticks = element_line(color = "white"))

Logistic Regression – Cleveland Data Read More »

Shortcuts to Accidents

August 13, 2023

We saw cognitive reflection problems, where our mind (brain) wants us to lock in – what it believes to be – a ‘timely’ answer which it gets via mental shortcuts. Here is one such question

Road	Major Accidents	Minor Accidents
Road 1	2000	16
Road 2	1000	?

Fill the box with the question mark to make the accidents in two roads equivalent.

Studies have shown a high proportion of people answered 8. Their attempt was perhaps to maintain the same ratio (2000:16 == 1000:8). But the question was to estimate the number of minor incidents required for a road with fewer major accidents to make it equivalent to the one with more major accidents. Naturally, it should be much more than 1000 (the shortfall of major accidents on Road 2 vs Road 1).

Cars and workers

Another famous trick puzzle has the following form:

It takes 7 workers to make 7 cars in 7 days. How many days would it take 5 workers to make 5 cars?

Park your instincts to answer 5 (so that 5-5-5 matches with 7-7-7!) for a while. Try this first,
If 7 workers can build 4 cars in 3 days, how many days would it take 8 workers to build 6 cars??
I assume more people answer the second one correctly because it shows fewer visible patterns and may slow you down.

Answer: car per worker per day = (4/7)/3 = 4/21. So, 8 workers can make 32/21 cars in a day. But we want 6 cars => (32/21) x X (days) = 6. X = (21 x 6)/32 = 3.9 days.

In the same way, the first question is answered as follows:
(7/7)/7 = 1/7 car per worker per day. 5 workers can make 5/7 cars in a day. For making 5 cars, one needs (5/7) x X (days) = 5 or X = 35/5 = 7 days.

Shortcuts to Accidents Read More »

Time Series Analysis – Decomposition

August 12, 2023

Here is another time series, namely, the air passengers.

A key task of the time series analysis is to break down the data into signal and noise. In R, there is a function called decompose to do the job.

decom_AP <- decompose(AP, type = "additive")
plot(decom_AP)

Note that the data is already in a time series format. If it is a regular data frame, use function ‘ts’ first before attempting the decompose function.

Here is the illustration – the data (blue circle), compared with the seasonality.

Here is data with seasonality + trend

And finally, data is compared with the sum of all three, seasonality + trend + random

Time Series Analysis – Decomposition Read More »

Time Series Analysis – Pollution

August 11, 2023

Time Series Analysis – Pollution Read More »

Time Series Analysis

August 10, 2023

Time series is data of the same entity collected at regular intervals. And the analysis of this is a time series analysis. Here, the time is the independent variable (typically the X-axis), and a characteristic is measured, which forms the dependent variable. The objective of the time series analysis is to understand the pattern of changes over time. And to make projections about the future.

Components of time series analysis

The long-term tendencies of the data are called trends.
The repeating feature of the pattern is called seasonality
The repeating but non-seasonal patterns are called cycles.
The unpredictable ups and downs of the data is the last component, which is variation.

Time Series Analysis Read More »

Finite Population Correction

August 9, 2023

Finite population correction is the factor applied to reduce the error when the sample size is significant in comparison to the total population.

If the sample size is n and the population size is N, the finite correction population factor is,

$FPC = \sqrt{\frac{N-n}{N-1}}$

To apply this correlation, multiply the standard error with this factor.

Finite Population Correction Read More »

Surprisingly Popular

August 8, 2023

We saw Galton’s “wisdom of the crowd” before. It says that a crowd’s judgement is more accurate than an individual’s. The near-accurate estimate of the weight of a prize-winning ox by the common public became famous after Galton. But what happens if the mass is wrong?

These are questions on specialised subjects that a knowledgeable minority knows. When such questions are asked, unsurprisingly, the wrong answers get the majority.

Surprisingly popular algorithm

To deal with this problem, researchers from Princeton and MIT have developed a solution that involves two questions instead of one (What do they think the right answer is, and how popular do they think each answer will be?). Take this example.

1) Is Philadelphia the capital of Pennsylvania (Y/N)?
2) What do you think is the prevalent answer (Y/N)?

Philadelphia is not the correct answer (it’s Harrisburg), and only the minority knows that. The majority will say YES to the first; of those, most will respond YES about the others. On the other hand, the minority will answer NO, and since they know it’s specialised information, they also expect most others to say YES. Thus, the ‘YES’ will be more, or the ‘NO’ will be lower in the second case.

Take the difference between the first question and the ‘popular’ question. ‘Yes’ will be negative (first YES < second YES), and ‘NO’ will be positive (first NO > second NO). Therefore, No is surprisingly popular and the correct answer.

Surprisingly Popular: Princeton University

Surprisingly Popular Read More »

Binomial Probability Calculator

August 7, 2023

I found this cool Binomial Probability Calculator from Stat Trek. Plug in the probability of success, the number of trials and the number of successes, and you get a set of probabilities ranging from exact to cumulative.

Here is one problem to try: In a city, it has been estimated that the probability of drivers not wearing seat belts is 10% and driving under the influence of alcohol is 5%. If the police check five people at random, what is the probability of catching at least one person who has committed at least one offence?

The first step is to estimate the probability of success of a single trial (person). Probability of not wearing a seat belt (SB) or drink and drive (DD) = P(SB U DD) = P(SB) + P(DD) – P(SB & DD) = 0.05 + 0.1 – 0.05 x 0.1 = 0.145. The rest is simple, # trials = 5; # success (x) = 1.

The answer we are looking for is the probability of at least one person committing a crime, which is = P(X >/= x) = 0.543 (the last entry in the results).

Now, try this one: The probability of failure (on demand) for a safety instrument is 1 in 10000. A plant has 1000 such instruments. What is the chance that there is at least one (x = 1) failed instrument? The answer P(X >/= x) = 0.095 or about 10%.

References

Binomial probability calculator: Stat Trek

Binomial Distribution Word Problems: superprof

Binomial Probability Calculator Read More »