The problem with applying the logistic linear equation to partial data sets
Even though I had abandoned it for the purposes of estimating changes in “a” on the decline side of the production curve, Hubbert’s logistic linear equation was quite good at predicting Q∞ and "a" on the growth side of the curve.
One general problem with using the linear equation, however, is that I don’t always have a good estimate of Qo. For instance, for the USA production data, I only have the production rates starting in 1949. I don’t want to assume that I know anything about the accumulation, Q, up to that year. I could probably find it for the USA (after all, Hubbert had data back to 1900), but this procedure has to be generally applicable to other countries as well. Moreover, such data is probably going to be hard to come by other countries during, and prior to, WWII.
Not knowing Qo has important implications when using Hubbert’s logistic linear equation, because a poor estimate of Qo introduces strong a non-linearity into the linearized data set, thereby causing errors in the estimation of Q∞ and "a."
This is best illustrated using simulated data.
Let’s reconsider the simulated data presented in Figure 1 in part 4 of this series, with “a” fixed to 0.0687 yr-1 and Q∞ equal to 170 bbls. The linearization of this data set was presented in Figure 2 of Part 2 and is roughly linear throughout the entire time span selected 1901 to 2010.
But now look what happen to this same data set if we had only been given dQ/dt data starting in 1930:
For the plot shown in Figure 15, I have made no assumptions about Qo prior to 1930. This is the same as assuming that the value of Qo the year before the data starts is equal zero. Assuming a zero value, however, introduces a very strongly curving function at low Q values. Of course, doing linear regression on the whole data set gives terrible results: “a” (the y-intercept) is grossly overestimated and Q∞ (the x-intercept) is underestimated; r2 is only 0.3723.
Eventually, the plot becomes more linear as Q ® Q∞. For instance, the linear regression results for the last 40 data points gives better estimates of “a” and Q∞ and r2 is quite high. However, “a” is still over estimated, and, Q∞ is still underestimated.
I want to emphasize that this is exactly the same simulated data set as presented in Figures 1 and 2 of Part 4; the only difference is that I have just considered the production data starting from 1930 and on, (i.e., about 1/3 up the growth side of the curve shown in Figure 1).
The strong curvature in the earlier years of the production curve (low Q) is due to the overestimation of (dQ/dt)/Q at low Q values because we are underestimating Qo. The problem is we only have the values of dQ/dt starting in 1930 and, in the absence of addition information. the accumulation that occurred earlier than this year is unaccounted for. The improved linear behavior as Q Þ Q∞ is because the relative importance of assuming that Qo equals zero diminishes for increasing Q. Nevertheless, “a” and Q∞ are still systematically over and under estimated, respectively.
It is not clear whether or not Hubbert appreciated that this was an inherent feature of his logistic linear equation. I suspect not, because he attributed the non-linearity at low Q values due to scatter in the data:
The virtue of the first of these two equations lies in the fact that it depends only upon the plotting of primary data, (dQ/dt)/Q. versus Q. with no a priori assumptions whatever. Using actual data for Q and dQ/dt, it is to be expected that there will be a considerable scatter of the plotted points as Q ® 0, because in that case both Q and dQ/dt are small quantities and even small irregularities of either quantity can produce a large variation in their ratio. ("Techniques", p. 52)
No doubt Hubbert is right that for low Qs, scatter will amplify the scatter in (dQ/dt)/Q. But in the present case, my simulated data set has no scatter. Thus, analyzing the simulated data has helped reveal the inherent properties of the logistic linear equation. The equation is linear when you have an accurate estimate of Qo.
A solution—SOLVER to the rescue
It occurred to me that I can resolve this issue by once again using the SOLVER feature in EXCEL to help pick the value of Qo that makes the data transformed according to logistic linear equation as linear as possible.
Here, the target cell for SOLVER holds the linear regression constant r2 which, in EXCEL, is represented by the formula RSQ(x-first cell:x-last cell, y-first cell:y-last cell). We want to maximize the value of r2 by adding a number (representing Qo) to the column of Q values that are used in the linear regression analysis. SOLVER can be set up do this automatically in a millisecond or so, but it could also be done manually by systematically trying different Qo until the maximum r2 in the plot of (dQ/dt)/Q vs Q is found.
If I do this for the same data set as shown in Figure 15, here are the results:
The only difference between the plot shown in Figure 15 and the plot shown in Figure 16, is that Qo=22.17 in the later and Qo=0 in the former. For Figure 16, Qo=22.17, as selected by SOLVER, was added as an 1929 entry in the column of Q values to maximize the value of r2 to 1. Now the resulting y- and x-intercepts are pretty close to the true values of “a” and Q∞.
Moreover if we perform the same analysis with the data set from the growth side of the production curve 1930 to 1956 we still get reasonably good estimates of “a” and Q∞ when SOLVER picks a Qo equal to 22.82:
1930 to 1956 data only
Based on these results, I have revised the procedure for analyzing the real production/consumption data
1) Apply the logistic linear equation to analyze the growth side of the production curve (i.e., the data up to the peak production year) in which SOLVER is applied to choose the best fit value of Qo which maximizes r2 in the plot of (dQ/dt)/Q versus Q. The resulting y-intercept and x-intercept from linear regression analysis of this plot give the best estimates of "a" and Q∞ on the growth side of the curve. Again, these estimates of “a” Q∞ and Qo is what I will refer to as the growth-side production parameters.
2) Use the values of the growth-side production parameter "a" as the initial seed values for the spreadsheet column used to estimate dQ/dt for the first of the successive 5-year segments on the decline side of the production curve, and, fix Q∞ (and indirectly, Qo) to the value determined in step (1), the "a" estimated from the first five-year segment is used to as the seed value for the next year segment and so on.
3) Run the NLLS analysis for the successive 5-year or longer segments to predict “a” and compare this to the average “a” for the same-year segment being analyzed to determine the % difference from this true value of “a.”
I already know that this will work for the simulated data sets described in the context of Figure 10 in Part 4,
because the only different from my previous analysis is that instead of in step (1) of using NLLS to determine Q∞ and Qo from the growth side of the production curve (recall that Q=20 to Q=50) was used, corresponding to 1934 to 1954), I am now using a new step (1) the logistic linear equation to analyze the growth side of the production curve.
If I do the analysis with the 1934 to 1954 data as described for the new step (1), I get the following results (SOLVER choosing Qo (1933) = 28.75):
1934 to 1956 data only
This is pretty close to the results obtained when the NLLS was performed on the same data set:
Q∞ = 170.4 bbls; Qo (1933) = 29.13 bbls; and “a” = 0.06857 yr-1
To the extent that the linear analysis Q∞ overestimates the true values (i.e., 170 or a percentage error of 2%) we know that "a" will tend to be underestimated and that the error worsens as Q ® Q∞ (see e.g., Figure 10 of Part 4), although the error is relatively small.
Okay, I think I am ready to retry that analysis of the USA production data.