Recipes for the Design of Experiments/Chapter 7: Resampling Methods

The dataset under analysis is Release 27 of the USDA National Nutrient Database for Standard Reference. There are 8,618 different foods included, but for the Analysis of Variance, this list will be truncated to 25 levels based on the first letter of the food name. Note, there was no food name starting with the letter “X”. ANOVA is used to determine the effect of a food's Short_Description on the protein content (g) of that food. Resampling is then performed with bootstrapping and Monte Carlo simulation to overcome the limiting assumption of normality.

The following analysis is a one factor ANOVA performed on a dataset concerning marital affairs. The data set contains 601 entries and looks to determine whether or not religion and how religious a person is has an effect on the number of marital affairs they have. The experiment has two parts, the first being a regular ANOVA using a linear model and the second involves using resampling via a Monte Carlo simulation and the bootstrap method in order to perform a secondary ANOVA test.

This analysis is a single factor multiple level ANOVA on a dataset that gives the performance of minorities on a Computer science exam. From the initial results (ANOVA testing) we decide on a resampling technique and again perform ANOVA on the simulated data. The resampling technique found suitable for this experiment is 'bootstrapping'. Both the results from the initial data set and the bootstrapped version fail to reject the null hypothesis for the experimental question set-up in the beginning of the analysis.

The experiment that will be conducted in this recipe will be to determine if the variation of NFL player weight can be attributed to the variation of NFL teams. An analysis of variance with a confidence interval of 95% will be performed to determine if the team that a player is a member of will have an effect on his weight. After the ANOVA has been performed, a "Monte Carlo" simulation and a bootstrap method will be implemented to determine the effect of resampling.

This recipe for the design of experiments uses a data set containing crime rates in the United States to demonstrate proper usage of resampling techniques to determine the actual distribution of a data set to be used in the analysis of variance (ANOVA). By doing so, the results of the ANOVA are more accurate because the test does not function off of its inherent assumption of normality.

This recipe is examining the budget share of food for Spanish households from the Ecdat package.Using this data set, we are testing a model with a single factor with multiple levels on a single response variable.

Following test is about effect of computer size on computer price. In this test we use ANOVA to investigate whether variance of price can be explained by variance of screen size. To doublecheck validity of ANOVA using a more computational approach and improve its accuracy, we use Bootstrap resampling technique. It turns out analytic and theoretical F-distribution are very different, along with ANOVA based on data used and resampling, indicating data might not be normally distributed.

The following recipe analyzes US unemployment data, and how the various reasons for unemployment may explain the variation in the response variable - duration of unemployment (in weeks). A one way, 4 level analysis of variance (ANOVA) was used for this recipe. Various resampling techniques were also used for model adequacy.

The following analysis was performed on a dataset of ship accidents. A one factor multiple level ANOVA test was performed on a single response. Boostrapping was used as a resampling technique for model adequacy checking.

In this study, a single-factor, multi-level experiment is performed [using the "Benefits" dataframe (nested within the R package called "Ecdat") that contains data pertaining to the research publication entitled “The impact of unemployment insurance benefit levels on recipiency” (McCall, B.P. 1995)] to see if the factor pertaining to the reasons for job loss among blue collar workers has a statistically significant effect on the state unemployment rate (in %). In the dataset, the factor 'joblost' refers to the reasons for job loss among blue collar workers. Additionally, this analysis' response variable is referred to in the dataset as 'stateur', which denotes the state unemployment rate in the analysis. In determining this level of significance, an ANOVA analysis (with and without Resampling via Bootstrapping) is performed and Tukey Honest Significant Differences are computed.

In this recipe an analysis was performed on the Earnings dataset within the Ecdat data package. To perform this analysis, an ANOVA analysis (with and without Resampling via Bootstrapping) is performed.

The purpose of this recipe is to use the resampling methods to repeat the ANOVA and compare the results. The dataset used for this experiment is the "Star" from the "Ecdat Package" in R, which is used to explore the effects on learning of small class sizes. In this study, we focus on the effect of class type on students' total math scaled scores. ANOVA is performed on both the original data and the re-sampled data (by the bootstrapping method）, and the results are proved to be consistent.

The following recipe examines the Crime dataset from the Edcat package names on the list of 100+ interesting datasets webpage. Regions in North Caroline are examined to see if they can explain any of the variation in crime rate in a single factor, multiple level ANOVA. Resampling (Monte Carlo and Bootstrapping) ANOVA's are also performed, followed by checks of the model adequacy. Contingencies are also discussed for when model assumptions are broken.

The following analysis uses one-factor ANOVA and resampling to examine how mean singer height in inches varies across the four voice parts in the NY Choral Society.

The California Test Score Data Set from 1998-1999 was used for comparing the average expenditure per student between districts of different grade span (K-6 or K-8). The data were blocked by county to reduce the effect of socioeconomic disparities that might result between districts located in different regions of the state. A function to calculate Cohen's d was used to select the response variable. G*Power software was used to generate an ideal sample size for alpha = 0.05 and power = 0.9. ANOVA using this sample size demonstrated a statistically significant difference in the mean between samples, but the result was shown to be not robust due to violations of the normality assumption of the data. Two alternatives to null hypothesis statistical testing, Resampling and Plot Plus Error Bar, were used to analyze the data.

This recipe analyzes the effect that union representation has on the hourly wage of young male workers employed as Craftsmen from 1980-1987. These workers were employed across 12 industries. Since industry would likely impact wage, this factor was blocked in the ANOVA test. This resulted in a randomized incomplete block design being used to determine and analyze the significance of the main effect of union representation. A sample size of 50 was used along with an alpha of 0.05 and beta of 0.05. The ANOVA test output gave a p-value that indicated a statistically significant difference in the means between the two groups (union and non-union representation). Additionally, two alternatives to Null Hypothesis Statistical Testing (NHST) were used. Resampling using the t-statistic and effect size with significance interval were conducted to help validate and substantiate the conclusion from the NHST.

(KU) BudgetFood, selected from Ecdat package, represents a cross-section of “Budget Share of Food for Spanish Households” from 1980 and the main source of the data is Journal of Applied Econometrics. Percentage of total expenditure which the household has spent on food is the continuous response variable for the data set. Gender of the reference person (income owner) is selected as independent variable and this factor is blocked with town size variable. Our null hypothesis for this design will be: “Percentage of total expenditure which the household spend on food does not affected by gender of the reference person.” For testing our null hypothesis first the sample size was defined with G*Power and data has been re-organized randomly for creating a balanced design according to calculated sample size. ANOVA has been performed and after confidence intervals and re-sampling methods have been used as alternative methods for Null Hypothesis Statistical Testing.

(MR) The Housing data set, found in the Ecdat package shows data regarding housing prices- the actual selling price, number of bedrooms and bathrooms, if the house has a driveway, and if it's in a 'preferred area' of town, to name a few. An analysis was done to determine the effect that having a driveway has on the selling price of a house, blocked by whether or not the house is in a 'preferred area' of town. The null hypothesis is that the driveway or area has no difference on the selling price of a house. This was first examined using an ANOVA, then a resampling technique was implemented, and then a power analysis using G*Power.

(M Wassick) The Star data frame, from the Ecdat package is a study of learning taken from a study of class style on learning from 1985-1989. Learning was assessed with test scores and analyzed based on a variety of factors. This analysis will use the blocking variable sex and the testing variable classk, which is the class style. The impact of class style on the achievement of the student will be tested in multiple ways. First, an ANOVA will be conducted using the null hypothesis that classk does not have an impact on achievement and the alternative hypothesis that there is a significant effect of class style on student achievement. Additionally, the hypotheses will be tested and corroborated using CI assessment and resampling with bootstrapping ANOVA analyses.

(Y Ding) Data utilized in this experiment is from the package "Ecdat" built in R language. It is a data frame containing parameters such as mileage and country of brand of 60 types of cars. The objective of this experiment was the to test the variation of car price in response to the changes in Mileage. Mileage and Country were categorical independent variables, while the dependent variable Price is continuous. The experiment execution was random, and nuisance factor was blocked (Country). To economize on time and expenditures, we experimented with a sample size that was just enough to obtain the effect size of Mileage. With the null hypothesis that Mileage had no effect on car Price, we performed different analysis assuming error in the data was normally distributed. However, according to the results, our analysis based on the assumption could be inaccurate, and another model should be considered.

(DR) A study developed from the Housing data of the city of Windsor was used to test the factors that affect sales price of houses in urban areas. Null Hypothesis testing techniques were performed where the response variable was price. Two factors were analyzed in the study: the number of bathrooms and the preference in neighborhood. Confidence Intervals and Resampling Techniques were used as alternatives to detect significant differences from the different groups according to number of bathrooms. The significance level selected was 0.05 and the Power was 0.90. Results have shown that the price of houses differ significantly when changing from 1, 2 or 3 bathrooms.

(LZ) In this study, we use the data contains 6259 observations from 1993 to 1995 across the United States. The information included in in this data set contains price, hardware conditions, manufacturer, and etc. The main purpose is that we would like to test whether CD-ROM installation has a sizeable effect on price. Blocking is used in our analysis. We believe the size of the screen is not a main factor that can affect the price and we are not interested in this factor, so the samples are blocked by screen size. We generate the sample size of 84 by setting the alpha =0.05 and beta = 0.05. The ANOVA analysis shows the significant difference based on the sample we selected. In the end, we use two alternatives to test out results.

(MS) In this recipe, we used data that contained 4165 observations from 1976 to 1872. The information in this data set contains logarithmic wag, union worker, marital status, gender, whether or not the individual is black, and etc. The main purpose is that we like to test whether whether or not an individual is black has a significant effect on logarithmic wage. Blocking is used in the analysis. An ANOVA analysis was utilized as well as Resampling and Effect Size.

(Bok, Joonhyuk) Among a dataset from the Ecdat R Package, we select “Diamond” which would be useful for expecting and calculating diamond prices. In Diamond data, ‘colour’ and ‘clarity’ are selected as factors, which represent the characteristic of Diamond and have 6 levels and 5 levels respectively. We choose 'price' as a response variable. After selecting a couple of categorical Independent variables and a continuous response variable, we conduct the experiment for null hypothesis statistical testing on the non-blocked independent variable, colour. A randomized incomplete block design will be utilized to investigate the effect of colour on price. ANOVA with two factors will be examined to analyze the effect of colour while blocking clarity. Therefore, we only consider main effect without an interaction effect. Also, we make use of G*Power to determine sample size given that Alpha = 0.05, Beta = 0.05, Power = 1 - Beta = 0.95 and effect size = 0.03467091. In addition to Null Hypothesis Statistical Testing (NHST), we will also conduct alternative evaluation, resampling statistics and confidence intervals.

(SW)The European Community Household Panel conducted a study on hourly wages in Belgium in 1994. The Wages in Belgium dataset contains 1472 observations of individuals and includes information on hourly wages, education level, years of work experience, and sex. The two categorical independent variables (IVs) that I selected were sex and educ, which are the sex and education level of the individual. The continuous dependent variable (DV), or response variable is wage, which is hourly wage. This analysis uses the blocking variable educ, which is education level. This is done to remove noise that could come from wage differences in individuals who have different levels of education. The impact of gender on hourly wages will be tested using multiple methods. First, ANOVA will be conducted using the null hypothesis that there is no significant difference between the wages of males and females. In addition, Resampling Statistics and Plot Plus Error Bar Procedure are conducted as alternatives to NHST.

(AZ) This study analyzes data of corporate profits by industry and year from 1929-1947. Data are from the Bureau of Economic Analysis and are presented in billions of dollars. In this experiment, the two factors (categorical independent variables) are industry and year, while the continuous dependent variable is corporate profits. Industry contains twelve different levels: Corporate Profits With Adjustments, Financial, Nonfinancial, Rest Of World, Federal Reserve Banks, Other, Manufacturing, Durable Goods, Nondurable Goods, Transportation And Utilities, and Wholesale Trade Retail Trade Automobile. Year contains 19 levels (each year from 1929-1947). This experiment utilizes a randomized block design. One variable, year, will be blocked, and the hypothesis will be tested on the second, industry. Boxplot exploratory analysis will first be conducted. ANOVA will be conducted along with alternatives to null hypothesis statistical testing: confidence intervals and resampling. The experiment will culminate with model validation.

(AV) Dataset from the ecdat package pertaining to extramarital affairs. There are 601 observations gathered from a Yale University study conducted in 1977. Factors include gender, whether or not the person has children, religious believes, age, years of marriage, and a few others. The experiment examines two independent variables and utilizes blocking. G*power is used to determine a sample size and a Monte Carlo simulation is used for randomization and statistical resampling. Tukey's test was the other alternative to the NHST. Cohen's D is also used to determine effect size.

(FO) Looking at data of Affairs, we conducted tests to see the effect that the marriage self "rating" had on the "number of affairs" while blocking for the separate factor "children." This study used ANOVA to test a Hypothesis, and then looked at the importance of investigating ANOVA results. Following this alternative hypothesis testing was shown, and analysis was done to find if the factor did in fact have an effect on the dependent variable.

(PD) The data used for this project was the VietNamH dataset in the Ecdat package of R. The data describes the total, medical and food expenditures of households in Vietnam. For the purposes of our analysis, the response variable we chose was 'Total Expenditure' and the two independent variables we chose were gender of head of the household (male, female) and whether the household is in urban or non-urban areas. As it turns out, there is a significant influence of whether the house is in urban or non-urban areas on the household expenditure. On the other hand, there doesn't seem to be any significant influence of the gender of head of the household.

(TE) This study analyzes cigarette sales data for two decades in 42 states of the US. We evaluate the effect of two factors, year and region on the sales volume measured as cigarette packs per capita. We estimate main effect using analysis of variance, as well as hypothesis testing. We validate the adequacy of our model and perform resampling via Monte Carlo Simulation to control for the normality assumption. The results show that there is a statistically significant effect of geography on cigarette sales.