/*! This file is auto-generated */ .wp-block-button__link{color:#fff;background-color:#32373c;border-radius:9999px;box-shadow:none;text-decoration:none;padding:calc(.667em + 2px) calc(1.333em + 2px);font-size:1.125em}.wp-block-file__button{background:#32373c;color:#fff;text-decoration:none} Problem 9 (Data file: salarygov) The data ... [FREE SOLUTION] | 91Ó°ÊÓ

91Ó°ÊÓ

(Data file: salarygov) The data file gives the maximum monthly salary for 495 nonunionized job classes in a midwestern governmental unit in 1986. The variables are described in Table 5.9 a. Examine the scatterplot of Maxsalary versus Score, and verify that simple regression provides a poor description of this figure. b. Fit the regression with response Maxsalary and regressors given by B-splines, with \(d\) given by \(4,5,\) and \(10 .\) Draw the fitted curves on a figure with the data and comment. c. According to Minnesota statutes, and probably laws in other states as well, a job class is considered to be female dominated if \(70 \%\) of the employees or more in the job class are female. These data were collected to examine whether female-dominated positions are compensated at a lower level, adjusting for Score, than are other positions. Create a factor with two levels that divides the job classes into female dominated or not. Then, fit a model that allows for a separate B-spline for Score for each of the two groups. since the coefficient estimates for the B-splines are uninterpretable, summarize the results using an effects plot. If your program does not allow you to use B-splines, use quadratic polynomials

Short Answer

Expert verified
The scatterplot should show that a simple linear regression fits poorly. For the B-spline regression, different curves are obtained for \(d = 4, 5, 10\). The analysis of female-dominated jobs should reveal if they receive less compensation after adjusting for 'Score'.

Step by step solution

01

- Scatterplot Analysis

The data for the variables 'Maxsalary' and 'Score' should be plotted on a scatterplot to visualize their correlation. If the data points in the scatterplot do not follow a linear trend, then a simple regression is likely to poorly describe the figure.
02

- B-spline Regression

B-spline regression is to be fit taking 'Maxsalary' as the response variable and 'Score' as the regressor. The process is repeated with degrees \(d = 4, 5, 10\) and the output is the curves that fit the data.
03

- Female-dominated Job Classes

A factor variable needs to be created to divide the job classes into those that are female dominated and those that are not. This is based on the criteria that if \(70%\) or more of the employees comprising the job class are female, then it is considered female-dominated. After dividing the data, another B-spline regression is fitted for the 'Score' for each group.
04

- Effects Plot

Since the coefficient estimates for the B-splines are uninterpretable, the results are better visualized by plotting an effects plot. This plot should reveal any differences in compensation between female-dominated positions and others, after adjustments for 'Score'.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with 91Ó°ÊÓ!

Key Concepts

These are the key concepts you need to understand to accurately answer the question.

B-spline regression
B-spline regression, or Basis spline regression, is an advanced technique used to model complex relationships. It is particularly useful when the dependency between your dependent variable (response) and independent variable (predictor) is non-linear. Instead of fitting a single linear line, B-splines fit a piecewise polynomial that can approximate curved relationships much more accurately.

In the context of the exercise, B-splines are used to model the relationship between 'Maxsalary' (response) and 'Score' (predictor). By experimenting with different degrees such as 4, 5, and 10, you can adjust the flexibility of the fitting curve. A higher degree allows the curve to be more flexible and fit the data more closely, while a lower degree offers a smoother and less flexible fit.

The benefit of using B-splines is their ability to handle complex datasets where simple linear regression falls short. This technique helps produce a curve that better captures the nuances in the data, which is particularly helpful for data with many fluctuations or non-linear trends.
female-dominated job classes
A job class is considered female-dominated when 70% or more of the employees are female, in compliance with certain state laws like those in Minnesota. This classification is crucial for analyzing whether female-dominant classes face disparities. In our exercise, we use this classification to explore if these job classes are compensated any differently when adjusted for other factors like 'Score'.

By creating a factor with two levels — "female-dominated" and "not female-dominated" — we can divide the data into these categories. This division is vital in examining how gender representation affects pay within the same industry.

Once the data is divided, separate B-spline regressions are fitted for each category. This separation allows for a more nuanced understanding of how 'Score' influences salary across different demographic structures. It provides insights into whether systemic biases are present in salary allocation towards female-dominated job classes.
effects plot
An effects plot is a powerful visualization tool for understanding the impact of explanatory variables on a response variable, especially when using complex models like B-spline regression. Given the difficulty in interpreting B-spline coefficients directly, effects plots offer a visual summary of the relationships present in your model.

In this exercise, the effects plot aids in illustrating the differences in salary compensation between female-dominated and non-female-dominated job classes while adjusting for 'Score'.

These plots depict how the response variable (Maxsalary) varies with changes in the predictor (Score), separated by the levels of the factor variable (female-dominated vs. not female-dominated).
  • This visual tool helps highlight disparities in pay, if they exist.
  • It provides a straightforward way to analyze complex models.
  • Ultimately, effects plots help to present intricate data findings accessibly and illustratively.
scatterplot analysis
A scatterplot is a fundamental visual tool in statistics to depict the relationship between two quantitative variables. By plotting the 'Maxsalary' against the 'Score' variables, you can visually assess their association. Scatterplots are particularly useful at the start of data analysis to check linear trends.

In this scenario, the scatterplot reveals whether a simple linear regression model would be appropriate for describing the connection between Maxsalary and Score. If the data points spread widely with no clear linear path, it suggests that a linear model may not be the ideal choice.

These are some key aspects to consider in scatterplot analysis:
  • Data clustering and spread – Are points clumped together or widely spread?
  • Pattern recognition – Do the points follow a straight line (indicating linearity) or a curve?
  • Outlier presence – Are there any points that deviate significantly from the rest?
In our task, the analysis showed a poor linear relationship, alerting us to consider more flexible modeling approaches like B-spline regression, which can better capture the true nature of the association.

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

(Data file: MinnLand) The data file includes information on nearly every agricultural land sale in the six major agricultural regions of Minnesota for the period \(2002-2011\). The data are from the Minnesota Department of Revenue and were provided by Steven Taff. Two of the variables in the data are acrePrice, the selling price per acre adjusted to a common date within a year, and year, the year of the sale. All the variables are described in Table 5.8 a. Draw boxplots of \(\log (\text { acrePrice })\) versus year, and summarize the information in the boxplots. In particular, housing sales prices in the United States were generally increasing from about \(2002-\) \(2006,\) and then began to fall beginning in 2007 or so. Is that pattern apparently repeated in Minnesota farm sales? b. Fit a regression model with \(\log (\text { acrePrice })\) as the response and a factor representing the year. Provide an interpretation of the estimated parameters. Interpret the \(t\) -statistics. ( Hint: since year is numeric, you may need to turn it into a factor. c. Fit the regression model as in the last subproblem, but this time omit the intercept. Show that the parameter estimates are the means of \(\log (\text { acrePrice })\) for each year. The standard error of the sample mean in year \(j\) is \(\mathrm{SD}_{j} / \sqrt{n_{j}},\) where \(\mathrm{SD}_{j}\) and \(n_{j}\) are the sample standard deviation and sample size of the for the jth year. Show that the standard errors of the regression coefficients are not the same as these standard errors and explain why they are different.

(Data file: BGSall) Refer to the Berkeley Guidance study described in Problem \(3.3 .\) Using the data file BGSall, consider the regression of HT18 on HT9 and the grouping factor Sex. a. Draw the scatterplot of HT18 versus HT9, using a different symbol for males and females. Comment on the information in the graph about an appropriate mean function for these data. b. Obtain the appropriate test for a parallel regression model. c. Assuming the parallel regression model is adequate, estimate a \(95 \%\) confidence interval for the difference between males and females. For the parallel regression model, this is the difference in the intercepts of the two groups

For a factor \(X\) with \(d\) categories, the one-factor mean function is $$\mathrm{E}\left(Y | U_{2}, \ldots, U_{d}\right)=\beta_{0}+\beta_{2} U_{2}+\cdots+\beta_{d} U_{d}$$ where \(U_{j}\) is a dummy variable equal to 1 for the \(j\) th level of the factor and 0 otherwise. a. Show that \(\mu_{1}=\beta_{0}\) is the mean for the first level of \(X\) and that \(\mu_{j}=\beta_{0}+\beta_{j}\) is the mean for all the remaining levels, \(j=2, \ldots, d\) b. It is convenient to use two subscripts to index the observations, so \(y_{j i}\) is the \(i\) th observation in level \(j\) of the factor, \(j=1, \ldots, d\) and \(i=\) \(1, \ldots, n_{j} .\) The total sample size is \(n=\Sigma n_{j} .\) The residual sum of squares function can then be written as $$\operatorname{RSS}(\boldsymbol{\beta})=\sum_{j=1}^{d} \sum_{i=1}^{n_{j}}\left(y_{j i}-\beta_{0}-\beta_{2} U_{2}-\cdots-\beta_{d} U_{d}\right)^{2}$$ Find the ous estimates of the \(\beta s,\) and then show that the ous estimates of the group means are \(\hat{\mu}_{j}=\bar{y}_{1}, j=1, \ldots, d,\) where \(\bar{y}_{j}\) is the average of the \(y\) s for the \(j\) th level of \(X.\) c. Show that the residual sum of squares can be written $$\mathrm{RSS}=\sum_{j=1}^{d}\left(n_{j}-1\right) \mathrm{SD}_{j}^{2}$$ where \(\mathrm{SD}_{j}\) is the standard deviation of the responses for the \(j\) th level of \(X .\) What is the \(d f\) for RSS? d. If all the \(n_{j}\) are equal, show that (1) the standard errors of \(\hat{\beta}_{2}, \ldots, \hat{\beta}_{d}\) are all equal, and ( 2 ) the standard error of \(\hat{\beta}_{0}\) is equal to the standard error of each of \(\hat{\beta}_{0}+\hat{\beta}_{j}, j=2, \ldots, d.\)

(Data file: MinnLand) Refer to Problem 5.4. Another variable in this data file is the region, a factor with six levels that are geographic identifiers. a. Assuming both year and region are factors, consider the two mean functions given in Wilkinson-Rogers notation as: (a) log (acreprice) \(\sim\) year \(+\) region (b) log (acrePrice) \(\sim\) year \(+\) region \(+\) year:region Explain the difference between these two models (no fitting is required for this problem) b. Fit model (b). Examining the coefficients of this model is unpleasant because there are so many of them, and summaries either using graphs or using tests are required. We defer tests until the next chapter. Draw an effects plot for the year by region interaction and summarize the graph or graphs.

Interpreting parameters with factors and interactions Suppose we have a regression problem with a factor \(A\) with two levels \(\left(a_{1}, a_{2}\right)\) and a factor \(B\) with three levels \(\left(b_{1}, b_{2}, b_{3}\right),\) so there are six treatment combinations Suppose the response is \(Y\), and further that \(\mathrm{E}\left(Y | A=a_{i}, B=b_{j}\right)=\mu_{i j}\). The estimated \(\mu_{i j}\) are the quantities that are used in effects plots. The purpose of this problem is to relate the \(\mu_{i j}\) to the parameters that are actually fit in models with factors and interactions. a. Suppose the dummy regressors (see Section 5.1.1) for factor \(A\) are named \(\left(A_{1}, A_{2}\right)\) and the dummy regressors for factor \(B\) are named \(\left(B_{1}, B_{2}, B_{3}\right) .\) Write the mean function $$\mathrm{E}\left(Y | A=a_{i}, B=b_{j}\right)=\beta_{0}+\beta_{1} A_{2}+\beta_{2} B_{2}+\beta_{3} B_{3}+\beta_{4} A_{2} B_{2}+\beta_{5} A_{2} B_{3}$$ in Wilkinson-Rogers notation (e.g., (3.19) in Chapter 3). b. The model in Problem 5.5 .1 has six regression coefficients, including an intercept. Express the \(\beta\) s as functions of the \(\mu_{i j}\) c. Repeat Problem \(5.5 .2,\) but start with \(Y \sim A+B\) d. We write \(\mu_{+j}=\left(\mu_{1 j}+\mu_{2 j}\right) / 2\) to be the "main effect" of the \(j\) th level of factor \(B,\) obtained by averaging over the levels of factor \(A\). For the model of Problem \(5.5 .2,\) show that the main effects of \(B\) depend on all six \(\beta\) -parameters. Show how the answer simplifies for the model of Problem 5.5 .3 e. Start with the model of Section \(5.5 .1 .\) Suppose the combination \(\left(a_{2}, b_{3}\right)\) is not observed, so we have only five unique cell means. How are the \(\beta\) s related to the \(\mu_{i j} ?\) What can be said about the main effects of factor \(B ?\)

See all solutions

Recommended explanations on Math Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.