Question 6.106

6.106 False-positive rate.

With the big data movement, companies are searching through thousands of variables to find patterns in the data to make better predictions on key business variables. For example, Walmart found that sales of strawberry Pop-Tarts increased signifcantly when the surrounding region was threatened with an impending hurricane.22 Imagine yourself in a business analytics position at a company and that you are trying to find variables that signifcantly correlate with company sales . Among the variables you are going to compare against are 80 variables that are truly unrelated to . In other words, for each of these 80 variables, the null hypothesis is true that the correlation between and the variables is 0. You are unaware of this fact. Suppose that the 80 variables are independent of each other and that you perform correlation tests between and each of the variables at the 5% level of significance.

343

  1. What is the probability that you find at least one of the 80 variables to be signifcant with ? This probability is referred to as a false-positive rate. If you had done only one comparison, what would be the false-positive rate?
  2. Refer to Exercise 6.103 to apply the Bonferroni procedure with . What is now the probability that you find at least one of the 80 variables to be signifcant with ? What do you find this false-positive rate to be close to?
  3. For the signifcant correlations you do find in your current data, explain how you can use new data on the variables in question to feel more confident about actually using the discovered variables for company purposes.