CORRELATION AND REGRESSION ANALYSIS QUESTIONS

BUS 3700: Project on regression analysis This is to be done individually. 1. Collect data on about ten to fifteen single family houses for sale in one locality (using websites such as realtor.com or zillow.com – try to get a random sample: for example, do not select all high priced or low priced houses) on these variables: address area in square feet, number of bedrooms, age in years, and asking price. (Make sure that all houses that you select have these information – if any is missing for a particular house, choose another house. If year built is given, subtract that from 2020 to get age in years.) Example: address area (sq. ft) number of bedrooms age (years) asking price 1060 Blackhawk Dr, University Park, IL 60484 1,870 3 19 135,000 2. Run a simple regression analysis on ‘asking price’ versus ‘area in square feet’ (the latter being the independent variable). You may use Excel or StatCrunch. Create a report with page 1: the data table (address, area in square feet, number of bedrooms, age, asking price) page 2: scatter plot of ‘asking price’ versus ‘area in square feet’ and page 3: a narrative with the line of best fit (the equation), a statement about how good the fit is, an interpretation of the “fit”, whether the slope is significant, an interpretation of the slope, a 95% confidence interval for the slope, and an estimate of the ‘asking price’ for a house with an area of 2100 square feet. (Extra credit: using StatCrunch, find the 95% confidence interval for the mean price of houses with an area of 2100 square feet, and a prediction interval for the price of a house with 2100 square feet). 3. Run a multiple regression analysis on ‘asking price’ versus ‘area in square feet’, ‘number of bedrooms’, and ‘age’. Add one more page to the previous report: page 4: a narrative with the line of best fit, a statement about how good the fit is, an interpretation of the “fit”, whether at least one slope is (significantly) different from zero, and if so, whether each of the individual slopes is different from zero, a 95% confidence interval for each slope,

and an estimate of the ‘asking price’ for a house with an area of 2100 square feet, 4 bedrooms, and 20 years old. (Extra credit: using StatCrunch, find the 95% confidence interval for the mean price of houses with an area of 2100 square feet, 4 bedrooms, and 20 years of age; and a prediction interval of the price of a house with an area of 2100 square feet, 4 bedrooms, and 20 years of age.) Submit the report (a Word file, with tables and charts copied and pasted into it) with a cover page which has your name. Report is due in Blackboard on Sunday, August 16, 2020 (it would be best to do it on August 9 or so, as soon as you have finished the quiz on multiple regression).

Assignment 1:

Plotting linear and nonlinear regressions: we downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10) ^2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm (formula = weight ~ age10)

                  coef.est coef.se

(Intercept) 161.0     7.3

age10             2.6       1.6

  n = 2009, k = 2

  residual sd = 119.7, R-Squared = 0.00

lm (formula = weight ~ age10 + age10.sq)

                  coef.est coef.se

(Intercept) 96.2      19.3

age10             33.6      8.7

age10.sq          -3.2      0.9

  n = 2009, k = 3,

  residual sd = 119.3, R-Squared = 0.01

lm (formula = weight ~ age30.44 + age45.64 + age65up)

                  coef.est coef.se

(Intercept) 157.2    5.4

age30.44TRUE      19.1     7.0

age45.64TRUE      27.2     7.6

age65upTRUE       8.5      8.7

  n = 2009, k = 4

  residual sd = 119.4, R-Squared = 0.01

a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.

b) On the same graph, draw the fitted regression line from the second model.

c) On another graph with the same axes and scale, draw the fitted regression line from the third model. (It will be discontinuous.)

A sales region has been divided into five territories, each of which was believed to have equal sales potential. The actual Sales Volume for several sampled days is logged in DATA. At a = 0.05, do the territories have equal Sales Volume?
O None of the answers are correct
O HO: Territories have equal Sales Volume is not rejected with pvalue 0.074. The counts are consistent with the model of equal proportions.
O HO: Territories have equal Sales Volume is not rejected with pvalue 0.334. The counts are consistent with the model of equal proportions.
O HO: Territories have equal Sales Volume is rejected with pvalue 0.012. The counts are not consistent with the model of equal proportions. Territories have unequal Sales Volume.
O HO: Territories have equal Sales Volume is rejected with pvalue 0.041. The counts are not consistent with the model of equal proportions. Territories have unequal Sales Volume.

Question 1(Introduction to Probability) a. The following two-way contingency table gives the breakdown of the population in a particular locale according to age and tobacco usage: Age Tobacco Use Smoker Non-smoker Under 30 0.05 0.20 Over 30 0.20 0.55 A person is selected at random. Find the probability of each of the following events. i. The person is a smoker. ii. The person is under 30. iii. The person is under 30. b. The following two-way contingency table gives the breakdown of the population of adults in a particular locale according to highest level of education and whether or not the individual regularly takes dietary supplements: Education Use of Supplements Takes Does not take No High school Diploma 0.04 0.06 High school Diploma 0.06 0.44 Undergraduate Degree 0.09 0.28 Graduate Degree 0.01 0.02 An adult is selected at random. Find the probability of each of the following events. i. The person has a high school diploma and takes dietary supplements regularly. ii. The person has an undergraduate degree and takes dietary supplements regularly. iii. The person takes dietary supplements regularly. iv. The person does not take dietary supplements regularly. (25 marks)

Question 2 (Mutually Exclusive, Independent Events, Conditional Probability) a. Volunteers for a disaster relief effort were classified according to both specialty (C: construction, E: education, M: medicine) and language ability (S: speaks a single language fluently, T: speaks two or more languages fluently). The results are shown in the following two-way classification table: Specialty Language Ability S T C 12 1 E 4 3 M 6 2 A volunteer is selected at random, meaning that each one has an equal chance of being chosen. Find the probability that: i. His specialty is medicine and he speak two or more languages; ii. Either his specialty is medicine or he speaks two or more languages; iii. His specialty is something other than medicine. b. A jar contains 12 marbles, 7 black and 5 white. Two marbles are drawn without replacement, which means that the first one is not put back before the second one is drawn. i. What is the probability that both marbles are black? ii. What is the probability that exactly one marble is black? iii. What is the probability that at least one marble is black? c. The probability that a regularly scheduled flight departs on time is P(D) = 0.83; the probability that it arrives on time is P(A) = 0.82; and the probability that it departs and arrives on time is P(A∩D) = 0.78. Find the probability that a plane: i. Arrives on time given that it departed on time. ii. Departed on time given that it has arrived on time. iii. Arrives on time, given that it did not depart on time.

Question 3 (Discrete Event Probabilities) a. The following table gives the probability distribution of a random variable X. X 1 2 3 4 P(X) 0.25 c 0.5 0.125 i. Find the value of the constant c ii. Find; i. P(X>1) ii. P(0<X=2) b. Three coins are tossed at once and the outputs are noted. Let X denoted by the number of heads that are obtained. i. Construct the probability distribution for X. ii. Compute the mean σ of X iii. Find: i. P(X>=2) ii. P(X3) iv. P(1<X<X8) iii. P(X>=4) iv. Calculate the Cumulative Distribution Function (CDF)

b. The probability density function of a continuous random variable is given by i. Find the value of the constant k. ii. Find the cumulative distribution function of X iii. Use the Cumulative distribution function to find the following probabilities: a. P(X>4), b. P(2<=X<=3), c. P (X >7) (25 marks) Question 5(Central tendency) a. A random sample of ten students is taken from the student’s body of a college and their GPAs are recorded as follows: 1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.00 Find the following; i. Range ii. Mean iii. Mode iv. Median b. The daily profits for 100 shops in a departmental store are distributed as follows: Profit per shop (in Rs) 0 - 100 100 - 200 200 - 300 300 - 400 400 - 500 500- 600 No. of shops 12 18 27 20 17 6 Calculate the following: i. Mean ii. Mode iii. Median