Problem 5-1, Factorial designs

From Statistics
Jump to navigation Jump to search

Problem Statement[edit]

The yield of a chemical process is being studied. The two most important variables are thought to be the pressure and the temperature. Three levels of each factor are selected, and a factorial experiment with two replicates is performed. The yield data follow:

Temperature (°C)Pressure (psig)
200215230
15090.490.790.2
90.290.690.4
16090.190.589.9
90.390.690.1
17090.590.890.4
90.790.990.1
  1. Analyze the data and draw conclusions. Use α=0.05.
  2. Prepare appropriate residual plots and comment on the model's adequacy.
  3. Under what conditions would you operate this process?

Solution[edit]

Figure 1: This plot of yield vs. pressure, at the different levels of temperature, helps to visualize the different effects in our problem.


Figure 2: This plot illustrates the various effects on the data in problem 5-1. The hats (^) indicate that these are approximate effects calculated from our sample data.

We are given a set of data that describes the yield of a chemical process, but the yield is affected by two factors: temperature and pressure. At each combination of levels of factor A (temperature) and levels of factor B (pressure), data points have been collected (see chart in problem statement). Not only do these two factors influence the yield data directly, but they might also interact to affect the yield in an unexpected manner.

The average of the data points at a specific temperature and pressure in our problem is . An interaction effect would cause the difference between two values in a column of our chart to vary as pressure is changed. For example, say we set the pressure to 200 psig and observe the average yield increase from 90 to 95 as temperature is increased from 150 °C to 160 °C. But when we set the pressure to 215 psig the yield drops from 90 to 85 as temperature is again increased from 150 °C to 160 °C. This would be an interaction effect, since the effect on the yield cannot be separated into independent temperature and pressure effects. Of course, what appears to be an interaction effect may just be large random error, so we'll have to check for that.

The plot shown in Figure 1 is very useful for visualizing the factor and interaction effects. We can see an apparent temperature effect that indicates 170 °C produces the highest yield. 215 psig also appears to be the best pressure if we want high yield, as the green line is higher than the others. Interaction effects in this type of plot are indicated by non-parallel lines, so there is not much apparent interaction here. Before we draw any conclusions, we must test that this plot provides a true image of the chemical process, and is not just a product of high random error.

Each data point in our problem can be thought of as the sum of several effects:

Where:



and:
data point from the th level of factor A and the th level of factor B
the grand mean
the effect of the th level of factor A
the effect of the th level of factor B
the effect of interaction between factors A and B
a random error affecting the data point from the th level of factor A and the th level of factor B

Figure 2 illustrates how these effects sum up to each data point. The blue line indicates the grand mean , the red X indicates the sample mean of the th level of temperature Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \bar{y}_i.} , the green # indicates the #th pressure level effect indicator, the brown triangle indicates the interaction effect indicator, and the black # indicates the data point from the #th level of pressure. The projection of each line onto the x-axis tells you the quantitatively what its estimated effect is, and each point can be thought of as where the data would be if effects closer to the real data point were not present.

Hypothesis Testing[edit]

We would like to know if any of our estimated effects are significant. Therefore, we have three tests to do, one for each type of effect. For factor A, our null hypothesis will be that all the treatment means are equal (they all come from the same distribution). Our alternative hypothesis is that they are not all equal:

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H_0\!:~\tau_1 = \tau_2 = ... = \tau_a = 0 }
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H_1\!:~\tau_i \ne 0 }

Similarly for factor B and the interaction effect:



(for all , )
(for at least one combination of and )

To test these hypotheses, we are interested in the following sums of squares. Each of the following equations estimates the effect on our data due to the subscripted factor ( → factor A, → factor B, → interaction, and → random error).





Before we can compare these quantities to each other, we must divide by the degrees of freedom to normalize. We calculate the normalized mean squares with:





We test whether our factor and interaction effects are significant by dividing their mean squares by the mean square of our random error to get . This value must be large for us to conclude that the effect we're testing is significant (the effect must be large relative to the random error). We pick a value from the F distribution based on and our degrees of freedom – this determines what "large" is. Do this in Excel with FINV(α,numerator's DOF,denominator's DOF) and in R with qf(1-alpha, numerator's DOF, denominator's DOF) (DOF = degrees of freedom).

Source of Variation Sum of Squares Degrees of Freedom Mean Square
Factor A (Temperature) 0.301 2 0.151 8.469 4.256
Factor B (Pressure) 0.768 2 0.384 21.594 4.256
Interaction 0.069 4 0.017 0.969 3.633
Error 0.160 9 0.018
Total 1.298 17

We conclude from these tests that the effects from factors A and B (temperature and pressure) are significant, and there is no significant interaction effect. However, these tests are only accurate if the model is correct.

Residual Plots and Model Adequacy Testing[edit]

Figure 3: These plots indicate that our data is sufficiently gaussian and we can believe the results of our hypothesis testing.

To test whether our data is sufficiently gaussian that our hypothesis testing is valid, we create a normal probability plot. As before, we calculate the residuals (by subtracting the expected values from the datapoints, Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_{ijk} - \bar{y}_{ij}.} ) and sorting them in an array. Then we plot the sorted residuals against z-values, where the z-values are calculated in Excel by doing NORMSINV(percent) or in R with qnorm(percent). Percent is equal to index_of_array/(DOF + 1). If the plot does not roughly fall along a straight line, the data is not from a Gaussian distribution.

There are several other useful plots we can make with residuals. Plotting against residuals produces a plot that should have data randomly scattered throughout its entire area, if not the data may not be gaussian. Plotting the residuals against either of the factors should indicate that the data is more or less equally random in each level of the factors. Note that because , there is some symmetry present in each of these plots.

Given that our data fits the model reasonably well, and that each of the factors are significant, the optimum conditions for producing high yield are when temperature is 170 °C and pressure is 200 psig.