/*! This file is auto-generated */ .wp-block-button__link{color:#fff;background-color:#32373c;border-radius:9999px;box-shadow:none;text-decoration:none;padding:calc(.667em + 2px) calc(1.333em + 2px);font-size:1.125em}.wp-block-file__button{background:#32373c;color:#fff;text-decoration:none} Problem 56 Explain what's wrong with the wa... [FREE SOLUTION] | 91Ó°ÊÓ

91Ó°ÊÓ

Explain what's wrong with the way regression is used in each of the following examples: a. Winning times in the Boston marathon (at www. bostonmarathon.org) have followed a straight-line decreasing trend from 160 minutes in 1927 (when the race was first run at the Olympic distance of about 26 miles) to 128 minutes in 2014. After fitting a regression line to the winning times, you use the equation to predict that the winning time in the year 2300 will be about 13 minutes. b. Using data for several cities on \(x=\%\) of residents with a college education and \(y=\) median price of home, you get a strong positive correlation. You conclude that having a college education causes you to be more likely to buy an expensive house. c. A regression between \(x=\) number of years of education and \(y=\) annual income for 100 people shows a modest positive trend, except for one person who dropped out after 10 th grade but is now a multimillionaire. It's wrong to ignore any of the data, so we should report all results including this point. For this data, the correlation \(r=-0.28\)

Short Answer

Expert verified
Regression is misused by extrapolating too far, mistaking correlation for causation, and allowing outliers to distort results.

Step by step solution

01

Understanding Extrapolation in Regression

In example (a), using regression to predict winning times until the year 2300 assumes that the trend will remain linear indefinitely. However, this scenario ignores real-world limits on human speed and environmental changes. Predicting that the winning time will be 13 minutes is an unrealistic extrapolation beyond the feasible bounds of the data.
02

Distinguishing Correlation from Causation

In example (b), the mistake is concluding causation from correlation. A positive correlation between education level and home prices does not imply that one's education directly causes more expensive home purchases. Other factors, like income or socioeconomic status, might play significant roles.
03

Handling Outliers in Regression Analysis

In example (c), including an outlier (the multimillionaire with less education) skews the correlation significantly. While outliers shouldn't always be ignored, it's crucial to analyze their influence on the data. Reporting results that include such an outlier without additional context may result in misleading interpretations.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with 91Ó°ÊÓ!

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

Extrapolation in Regression
Extrapolation in regression involves predicting values outside the range of the observed data based on a trend that is observed within the data. It's like trying to guess what the weather will be like next month using data from the current week. The problem arises because the assumptions made during interpolation, or predictions within data bounds, don't necessarily hold true when applied to extrapolation. When you fit a regression line to data like the Boston marathon winning times and extend it far into the future, say to the year 2300, you are assuming that the conditions affecting the data will remain unchanged. However, real-life situations often involve changes not accounted for in historical data. The human limit on running speeds, changes in training, environmental factors, and technological advancements mean that predicting a 13-minute marathon in 2300 is unrealistic. When using regression, always consider if your prediction is grounded in reality, especially when it involves extrapolation.
Correlation vs Causation
It’s easy to confuse correlation with causation. Correlation means that two variables move together, but it doesn’t mean that one causes the other. For instance, having a college education and owning an expensive house might show strong positive correlation, meaning they tend to occur together. However, concluding that education causes one to buy a costly home misses other crucial factors. Consider that income is perhaps a more direct factor linking these variables. People with higher income might both pursue higher education and buy more expensive homes. It’s crucial to investigate the underlying factors and remember that a statistical relationship doesn’t confirm a direct cause-and-effect link.
Outliers in Statistics
In statistics, outliers are data points that are distinctly separate from the rest of the dataset. These can have a significant impact on results, especially in regression analysis. In the example with the multimillionaire who didn’t finish high school but earns extraordinarily well, this person represents an outlier. Including or excluding outliers from the analysis is sometimes a tough decision. They can skew data and affect the calculation of correlation and regression slopes negatively. In this case, including the multimillionaire drastically altered the correlation, giving a misleading picture of the relationship between education and income. It’s important to contextualize outliers: sometimes they indicate variability and sometimes measurement error or unique conditions. Reporting findings with and without outliers, as well as examining their reason, can help deliver a more accurate analysis.

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

According to data obtained from the General Social Survey (GSS) in 2014,1644 out of 2532 respondents were female and interviewed in person, 551 were male and interviewed in person, 320 were female and interviewed over the phone and 17 were male and interviewed over the phone. a. Explain how we could regard either variable (gender of respondent, interview type) as a response variable. b. Display the data as a contingency table, labeling the variables and the categories. c. Find the conditional proportions that treat interview type as the response variable and gender as the explanatory variable. Interpret. d. Find the conditional proportions that treat gender as the response variable and interview type as the explanatory variable. Interpret. e. Find the marginal proportion of respondents who (i) are female, (ii) were interviewed in person.

In 2013, data was collected from the U.S. Department of Transportation and the Insurance Institute for Highway Safety. According to the collected data, the number of deaths per 100,000 individuals in the U.S would decrease by 24.45 for every 1 percentage point gain in seat belt usage. Let \(\hat{y}=\) predicted number of deaths per 100,000 individuals in 2013 and \(x=\) seat belt use rate in a given state. a. Report the slope \(b\) for the equation \(\hat{y}=a+b x\). b. If the \(y\) intercept equals \(32.42,\) then predict the number of deaths per 100,000 people in a state if (i) no one wears seat belts, (ii) \(74 \%\) of people wear seat belts (the value for Montana), (iii) \(100 \%\) of people wear seat belts.

In 2015, eighth-grade math scores on the National Assessment of Educational Progress had a mean of 283.56 in Maryland compared to a mean of 284.37 in Connecticut (Source: http://nces.ed.gov/nationsreportcard/ naepdata/dataset.aspx). a. Identify the response variable and the explanatory variable. b. The means in Maryland were respectively \(274,284,285,\) 291 and 294 for people who reported the number of pages read in school and for homework, respectively as \(0-5,6-10,11-15,15-20\) and 20 or more. These means were 270,281,284,289 and 293 in Connecticut. Identify the third variable given here. Explain how it is possible for Maryland to have the higher mean for each class, yet for Connecticut to have the higher mean when the data are combined. (This is a case of Simpson's paradox for a quantitative response.)

Identify the values of the \(y\) -intercept \(a\) and the slope \(b\), and sketch the following regression lines, for values of \(x\) between 0 and 10 a. \(\hat{y}=7+0.5 x\) b. \(\hat{y}=7+x\) c. \(\hat{y}=7-x\) d. \(\hat{y}=7\)

Statistical studies show that a negative correlation exists between the number of flu cases reported each week throughout the year and the amount of ice cream sold in that particular week. Based on these findings, should physicians prescribe ice cream to patients who have colds and flu or could this conclusion be based on erroneous data and statistically unjustified? a. Discuss at least one lurking variable that could affect these results. b. Explain how multiple causes could affect whether an individual catches flu.

See all solutions

Recommended explanations on Math Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.