Regression and Correlation Analysis & ANOVA
Overview and Rationale This assignment is designed to provide you with hands-on experiences in performing regressions and correlation analysis. The data set is provided in an Excel workbook and contains a wide range to data types that you will need to work with.
This assignment is directly linked to the following key learning outcomes from the course syllabus:
CO1: Explore the use of statistical software in data analysis through hands-on applications
CO5: Conduct regression and chi-squared test of independency to study associations between numerical and categorical variables respectively; and justify the legitimacy of the regression model.
Using the data provided in the attached Excel workbook, apply regression and correlation analysis on two data sets.
Follow the instructions in this project document to analyze the data presented in the Excel workbook. Then complete a report summarizing the results in your Excel workbook (or R script file). Submit both the report and the Excel workbook (or R script file).
Using the Data worksheet found in the Module 6 Project_ US Occupations.xlsx Excel workbook, complete the following analyses’ regarding US occupation data. Place the results in the worksheet specific in each part of the assignment.
In some parts of this project, you are asked to create random samples from a given population. Random sampling methods have been covered in Module 3, and tutorials are available in the Instructor Perspective folder in your Blackboard course page.
Part 1 (Q1)
The location quotients are given for NY (population 1) and LA (population2) in columns B and D, respectively, of worksheet Q1. The location quotient (LOC_QUOTIENT) represents the ratio of an occupation’s share of employment in a given area to that occupation’s share of employment in the U.S. as a whole. For example, an occupation that makes up 10 percent of employment in a specific metropolitan area compared with 2 percent of U.S. employment would have a location quotient of 5 for the area in question.
1. Use the random sampling method explained in the Instructor Perspective of Module 3 to draw a random sample of 350 from the NY LOC QUOTIENTs and a random sample of size 350 from the LA LOC QUOTIENTs.
2. Copy your samples into columns F and G of worksheet Q1.
3. Standardize both sets samples of LOC QUOTIENTs and display the standardized values (????) in columns I and J respectively.
4. For each of the two sets of LOC QUOTIENTs values, partition the standardized values into seven groups according to the following group specifications:
Group 1: Standardized values that are less than or equal to – 0.5 (that is, ???? ≤ −????. ????) Group 2: Standardized values satisfy: −????. ???? < ???? ≤ ???? Group 3: Standardized values satisfy: ???? < ???? ≤ ???? Group 4: Standardized values satisfy: ???? < ???? ≤ ???? Group 5: Standardized values satisfy: ???? < ???? ≤ ???? Group 6: Standardized values satisfy: ???? < ???? ≤ ???? Group 7: Standardized values satisfy: ???? > ????
5. Next, count the number of NY and the number of LA standardized LOC QUOTIENT values that fall into each of the above seven groups and complete Table A. 6. Use alpha = 0.10 to perform a Chi-squared test of independency to test the claim that the standardized LOC QUOTIENTs and locations (NY and LA) are independent factors by completing Tables B, C, and D.
Important: In your report, explain your solution procedures and your finding.
Part 2 (Q2)
The data in Q2 consists of a sample of LOC QUOTIENTs for both NY and LA for 317 randomly selected professions
1. Complete Table A in the Q2 worksheet by calculating the slope and the intercept of the regression line and the coefficients of correlation and determination of the regression model.
2. Create a scatter plot of the LA LOC QUOTIENTS versus those of NY. Display the regression line along with its equation and the ???????? value on the graph.
3. Use the calculated slope and intercept in Table A to calculate the predicted Y values and the residuals in columns I and J respectively.
4. Complete Table B.
5. Perform the procedure for creating a normal probability plot of the residuals (v).
6. Check the independency of the residuals graphically (vi).
7. Check the homoscedasticity of the residuals graphically (vii).
8. Construct a frequency distribution of the residuals consisting of 18 bins (viii).
9. Then use the Chisquared Goodness of Fit test to test the normality of the residuals.
10. Complete your work in the designated cells of worksheet Q2, and complete table D.
In your report, explain your solution procedures and your finding.
Format & Guidelines
You must submit 2 files:
An Excel workbook file contains all statistical work, the formulas must be present in their corresponding cells.
The Excel file should be name as: (Your last name Week 5 Data)
A report submitted in PDF format, it should include all your findings along with important statistical issues.
The report should be name as: (Your last name Week 5 Report)
The report should follow the following format: (i) Introduction (Develop ideas and concepts about US occupations and statistical tool to be utilized. Be substantial and include references). (ii) Analysis (Separate sections by title, include all figures and tables as JPG, including their corresponding titles and descriptive). (iii) Conclusion (Include your conclusions about the observed outcomes, what do you learnt from the statistical tools used and the results? Use references to support your conclusions). (iv) References (Minimum two, proper format will be graded)
Report should be 800-900 words and be presented in the APA format.