Yes, you could typeset a scientific paper in the Playfair 2·0 typeface family
You are looking at one. This document demonstrates how Playfair 2·0 can be employed to typeset a formal scientific paper. In this demonstration a paper first published 1960 in the American journal ‘Psychological Bulletin’.
Though several serious objections to the null-hypothesis significance test method are raised, “its most basic error lies in mistaking the aim of a scientific investigation to be a decision, rather than a cognitive evaluation… . It is further argued that the proper application of statistics to scientific inference is irrevocably committed to extensive consideration of inverse probabilities, and to further this end, certain suggestions are offered.”
Psychological Bulletin, 1960, Vol. 57, № 5, pp. 416–428.
The theory of probability and statistical inference is various things to various people. To the mathematician, it is an intricate formal calculus, to be explored and developed with little professional concern for any empirical significance that might attach to the terms and propositions involved. To the philosopher, it is an embarrassing mystery whose justification and conceptual clarification have remained stubbornly refractory to philosophical insight. (A famous philosophical epigram has it that induction [ a special case of statistical inference ] is the glory of science and the scandal of philosophy.) To the experimental scientist, however, statistical inference is a research instrument, a processing device by which unwieldy masses of raw data may be refined into a product more suitable for assimilation into the corpus of science, and in this lies both strength and weakness. It is strength in that, as an ultimate consumer of statistical methods, the experimentalist is in position to demand that the techniques made available to him confirm to his actual needs. But it is also weakness in that, in his need for the tools constructed by a highly technical formal discipline, the experimentalist, who has specialized along other lines, seldom feels competent to extend criticisms or even comments; he is much more likely to make unquestioning application of procedures learned more or less by rote from persons assumed to be more knowledgeable of statistics than he. There is, of course, nothing surprising or reprehensible about this — one need not understand the principles of a complicated tool in order to make effective use of it, and the research scientist can no more be expected to have sophistication in the theory of statistical inference than he can be held responsible for the principles of the computers, signal generators, timers, and other complex modern instruments to which he may have recourse during an experiment. Nonetheless, this leaves him particularly vulnerable to misinterpretation of his aims by those who build his instruments, not to mention the ever present dangers of selecting an inappropriate or outmoded tool for the job at hand, misusing the proper tool, or improvising a tool of unknown adequacy to meet a problem not conforming to the simple theoretical situations in terms of which existent instruments have been analyzed. Further, since behaviors once exercised tend to crystallize into habits and eventually traditions, it should come as no surprise to find that the tribal rituals for data-processing passed along in graduate courses in experimental method should contain elements justified more by custom than by reason.
In this paper, I wish to examine a dogma of inferential procedure which, for psychologists at least, has attained the status of a religious conviction. The dogma to be scrutinized is the “null-hypothesis significance test” orthodoxy that passing statistical judgment on a scientific hypothesis by means of experimental observation is a decision procedure wherein one rejects or accepts a null hypothesis according to whether or not the value of a sample statistic yielded by an experiment falls within a certain predetermined “rejection region” of its possible values. The thesis to be advanced is that despite the awesome pre-eminence this method has attained in our experimental journals and textbooks of applied statistics, it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research. This is not a particularly original view — traditional null-hypothesis procedure has already been superceded in modern statistical theory by a variety of more satisfactory inferential techniques. But the perceptual defenses of psychologists are particularly efficient when dealing with matters of methodology, and so the statistical folkways of a more primitive past continue to dominate the local scene.
To examine the method in question in greater detail, and expose some of the discomfitures to which it gives rise, let us begin with a hypothetical case study:
Suppose that according to the theory of behavior, T₀, held by most right-minded, respectable behaviorists, the extent to which a certain behavioral manipulation M facilitates learning in a certain complex learning situation C should be null. That is, if “f” designates the degree to which manipulation M facilitates the acquisition of habit H under circumstances C, it follows from the orthodox theory T₀ that f = 0. Also suppose, however, that a few radicals have persistently advocated an alternative theory T₁ which entails, among other things, that the facilitation of H by M in circumstances C should be appreciably greater than zero, the precise extent being dependent upon the values of certain parameters in C. Finally, suppose that Igor Hopewell, graduate student in psychology, has staked his dissertation hopes on an experimental test of T₀ against T₁ on the basis of their differential predictions about the value of f.
Now, if Hopewell is to carry out his assessment of the comparative merits of T₀ and T₁ in this way, there is nothing for him to do but submit a number of Ss to manipulation M under circumstances C and compare their efficiency at acquiring habit H with that of comparable Ss who, under circumstances C, have not been exposed to manipulation M. The difference, Δ, between experimental and control Ss in average learning efficiency may then be taken as an operational measure of the degree, f, to which M influences acquisition of H in circumstances C. Unfortunately, however, as any experienced researcher knows to his sorrow, the interpretation of such an observed statistic is not quite so simple as that. For the observed dependent variable Δ, which is actually a performance measure, is a function not only of the extent to which M influences acquisition of H, but of many additional major and minor factors as well. Some of these, such as deprivations, species, age, laboratory conditions, etc., can be removed from consideration by holding them essentially constant. Others, however, are not so easily controlled, especially those customarily subsumed under the headings of “individual differences” and “errors of measurement.” To curtail a long mathematical story, it turns out that with suitable (possibly justified) assumptions about the distributions of values for these uncontrolled variables, the manner in which they influence the dependent variable, and the way in which experimental and control Ss were selected and manipulated, the observed sample statistic Δ may be regarded as the value of a normally distributed random variate whose average value is f and whose variance, which is independent of f, is unbiasedly estimated by the square of another sample statistic, s, computed from the data of the experiment.1
The import of these statistical considerations for Hopewell’s dissertation, of course, is that he will not be permitted to reason in any simple way from the observed Δ to a conclusion about the comparative merits of T₀ and T₁. T₀ conclude that T₀, rather than T₁, is correct, he must argue that f = 0, rather than f > 0. But the observed Δ, whatever its value, is logically compatible both with the hypothesis that f = 0 and the hypothesis that f > 0. How then, can Hopewell use his data to make a comparison of T₀ and T₁? As a well-trained student, what he does, of course, is to divide Δ by s to obtain what, under H₀, is a t statistic, consult a table of the t distributions under the appropriate degrees-of-freedom, and announce his experiment as disconfirming or supporting T₀, respectively, according to whether or not the discrepancy between Δ and the zero value expected under T₀ is “statistically significant” — i. e., whether or not the observed value of Δ/s falls outside of the interval between two extreme percentiles (usually the 2.5th and 97.5th) of the t distribution with that Δf. If asked by his dissertation committee to justify this behavior, Hopewell would rationalize something like the following (the more honest reply, that this is what he has been taught to do, not being considered appropriate to such occasions):
In deciding whether or not T₀ is correct, I can make two types of mistakes: I can reject T₀ when it is in fact correct [ Type I error ], or I can accept T₀ when in fact it is false [ Type II error ]. As a scientist, I have a professional obligation to be cautious, but a 5% chance of error is not unduly risky. Now if all my statistical background assumptions are correct, then, if it is really true that f = 0 as T₀ says, there is only one chance in 20 that my observed statistic Δ/s will be smaller than t.025 or larger than t.975, where by the latter I mean, respectively, the 2.5th and 97.5th percentiles of the t distribution with the same degrees-of-freedom as in my experiment. Therefore, if I reject T₀ when Δ/s is smaller than t.025 or larger than t.975, and accept T₀ otherwise, there is only a 5% chance that I will reject T₀ incorrectly.
If asked about his Type II error, and why he did not choose some other rejection region, say between t.475 and t.525, which would yield the same probability of Type I error, Hopewell should reply that although he has no way to compute his probability of Type II error under the assumptions traditionally authorized by null-hypothesis procedure, it is presumably minimized by taking the rejection region at the extremes of the t distribution.
Let us suppose that for Hopewell’s data, Δ = 8.50, s = 5.00, and Δf = 20. Then t.975 = 2.09 and the acceptance region for the null hypothesis f = 0 is −2.09 < Δ/s < 2.09, or −10.45 < Δ < 10.45. Since Δ does fall within this region, standard null-hypothesis decision procedure, which I shall henceforth abbreviate “NHD,” dictates that the experiment is to be reported as supporting theory T₀. (Although many persons would like to conceive NHD testing to authorize only rejection of the hypothesis, not, in addition, its acceptance when the test statistic fails to fall in the rejection region, if failure to reject were not taken as grounds for acceptance, then NHD procedure would involve no Type II error, and no justification would be given for taking the rejection region at the extremes of the distribution, rather than in its middle.) But even as Hopewell reaffirms T₀ in his dissertation, he begins to feel uneasy. In fact, several disquieting thoughts occur to him:
Despite his inexperience, Igor Hopewell is a sound experimentalist at heart, and the more he reflects on these statistics, the more dissatisfied with his conclusions he becomes. So while the exigencies of graduate circumstances and publication requirements urge that his dissertation be written as a confirmation of T₀, he nonetheless resolves to keep an open mind on the issue, even carrying out further research if opportunity permits. And reading his experimental report, so of course would we — has any responsible scientist ever made up his mind about such a matter on the basis of a single experiment? Yet in this obvious way we reveal how little our actual inferential behavior corresponds to the statistical procedure to which we pay lip-service. For if we did, in fact, accept or reject the null hypothesis according to whether the sample statistic falls in the acceptance or in the rejection region, then there would be no replications of experimental designs, no multiplicity of experimental approaches to an important hypothesis — a single experiment would, by definition of the method, make up our mind about the hypothesis in question. And the fact that in actual practice, a single finding seldom even tempts us to such closure of judgment reveals how little the conventional model of hypothesis testing fits our actual evaluative behavior.
By now, is should be obvious that something is radically amiss with the traditional NHD assessment of an experiment’s theoretical import. Actually, one does not have to look far in order to find the trouble — it is simply a basic misconception about the purpose of a scientific experiment. The null-hypothesis significance test treats acceptance or rejection of a hypothesis as though these were decisions one makes on the basis of the experimental data — i. e., that we elect to adopt one belief, rather than another, as a result of an experimental outcome. But the primary aim of a scientific experiment is not to precipitate decisions, but to make an appropriate adjustment in the degree to which one accepts, or believes, the hypothesis or hypotheses being tested. And even if the purpose of the experiment were to reach a decision, it could not be a decision to accept or reject the hypothesis, for decisions are voluntary commitments to action — i. e., are motor sets — whereas acceptance or rejection of a hypothesis is a cognitive state which may provide the basis for rational decisions, but is not itself arrived at by such a decision (except perhaps indirectly in that a decision may initiate further experiences which influence the belief).
The situation, in other words, is as follows: As scientists, it is our professional obligation to reason from available data to explanations and generalities — i. e., beliefs — which are supported by these data. But belief in (i. e., acceptance of) a proposition is not an all-or-none affair; rather, it is a matter of degree, and the extent to which a person believes or accepts a proposition translates pragmatically into the extent to which he is willing to commit himself to the behavioral adjustments prescribed for him by the meaning of that proposition. For example, if that inveterate gambler, Unfortunate Q. Smith, has complete confidence that War Biscuit will win the fifth race at Belmont, he will be willing to accept any odds to place a bet on War Biscuit to win; for if he is absolutely certain that War Biscuit will win, then odds are irrelevant — it is simply a matter of arranging to collect some winnings after the race. On the other hand, the more that Smith has doubts about War Biscuit’s prospects, the higher the odds he will demand before betting. That is, the extent to which Smith accepts or rejects the hypothesis that War Biscuit will win the fifth at Belmont is an important determinant of his betting decisions for that race.
Now, although a scientist’s data supply evidence for the conclusions he draws from them, only in the unlikely case where the conclusions are logically deducible from or logically incompatible with the data do the data warrant that the conclusions be entirely accepted or rejected. Thus, e. g., the fact that War Biscuit has won all 16 of his previous starts is strong evidence in favor of his winning the fifth at Belmont, but by no means warrants the unreserved acceptance of this hypothesis. More generally, the data available confer upon the conclusions a certain appropriate degree of belief, and it is the inferential task of the scientist to pass from the data of his experiment to whatever extent of belief these and other available information justify in the hypothesis under investigation. In particular, the proper inferential procedure is not (except in the deductive case) a matter of deciding to accept (without qualification) or reject (without qualification) the hypothesis: even if adoption of a belief were a matter of voluntary action — which it is not — neither such extremes of belief or disbelief are appropriate to the data at hand. As an example of the disastrous consequences of an inferential procedure which yields only two judgment values, acceptance and rejection, consider how sad the plight of Smith would be if, whenever weighing the prospects for a given race, he always worked himself into either supreme confidence or utter disbelief that a certain horse will win. Smith would rapidly impoverish himself by accepting excessively low odds on horses he is certain will win, and failing to accept highly favorable odds on horses he is sure will lose. In fact, Smith’s two judgment values need not be extreme acceptance and rejection in order for his inferential procedure to be maladaptive. All that is required is that the degree of belief arrived at be in general inappropriate to the likelihood conferred on the hypothesis by the data.
Now, the notion of “degree of belief appropriate to the data at hand” has an unpleasantly vague, subjective feel about it which makes it unpalatable for inclusion in a formalized theory of inference. Fortunately, a little reflection about this phrase reveals it to be intimately connected with another concept relating conclusion to evidence which, though likewise in serious need of conceptual clarification, has the virtues both of intellectual respectability and statistical familiarity. I refer, of course, to the likelihood, or probability, conferred upon a hypothesis by available evidence. Why should not Smith feel certain, in view of the data available, that War Biscuit will win the fifth at Belmont? Because it is not certain that War Biscuit will win. More generally, what determines how strongly we should accept or reject a proposition is the probability given to this hypothesis by the information at hand. For while our voluntary actions (i. e., decisions) are determined by our intensities of belief in the relevant propositions, not by their actual probabilities, expected utility is maximized when the cognitive weights given to potential but not yet known-for-certain pay-off events are represented in the decision procedure by the probabilities of these events. We may thus relinquish the concept of “appropriate degree of belief” in favor of “probability of the hypothesis,” and our earlier contention about the nature of data-processing may be rephrased to say that the proper inferential task of the experimental scientist is not a simple acceptance or rejection of the tested hypothesis, but determination of the probability conferred upon it by the experimental outcome. This likelihood of the hypothesis relative to whatever data are available at the moment will be an important determinant for decisions which must currently be made, but is not itself such a decision and is entirely subject to revision in the light of additional information.
In brief, what is being argued is that the scientist, whose task is not to prescribe actions but to establish rational beliefs upon which to base them, is fundamentally and inescapably committed to an explicit concern with the problem of inverse probability. What he wants to know is how plausible are his hypotheses, and he is interested in the probability ascribed by a hypothesis to an observed experimental outcome only to the extent he is able to reason backwards to the likelihood of the hypothesis, given this outcome. Put crudely, no matter how improbable an observation may be under the hypothesis (and when there are an infinite number of possible outcomes, the probability of any particular one of these is, usually, infinitely small — the familiar p value for an observed statistic under a hypothesis H is not actually the probability of that outcome under H, but a partial integral of the probability-density function of possible outcomes under H), it is still confirmatory (or at least nondisconfirmatory, if one argues from the data to rejection of the background assumptions) so long as the likelihood of the observation is even smaller under the alternative hypotheses. To be sure, the theory of hypothesis-likelihood and inverse probability is as yet far from the level of development at which it can furnish the research scientist with inferential tools he can apply mechanically to obtain a definite likelihood estimate. But to the extent a statistical method does not at least move in the direction of computing the probability of the hypothesis, given the observation, that method is not truly a method of inference, and is unsuited for the scientist’s cognitive ends.
The preceding arguments have, in one form or another, raised several doubts about the appropriateness of conventional significance-test decision procedure for the aims it is supposed to achieve. It is now time to bring these changes together in an explicit bill of indictment.
So far, my arguments have tended to be aggressively critical — one can hardly avoid polemics when butchering sacred cows. But my purpose is not just to be contentious, but to help clear the way for more realistic techniques of data assessment, and the time has now arrived for some constructive suggestions. Little of what follows pretends to any originality; I merely urge that ongoing developments along these lines should receive maximal encouragement.
For the statistical theoretician, the following problems would seem to be eminently worthy of research:
Pr (H₀, Δ) | = |
Pr (H₀) | × |
Pr (Δ, H₀) | [ 1 ]2 |
Pr (H₁, Δ) | Pr (H₁) | Pr (Δ, H₁) |
Therefore, if the experimental report includes the probability (or probability density) of the data under H₀ and H₁, respectively, and its reader can quantify his feelings about the relative pre-experimental merits of H₀ and H₁ (i. e., Pr (H₀) / Pr (H₁)), he can then determine the judgment he should make about the relative merits of H₀ and H₁ in light of these new data.
The traditional null-hypothesis significance-test method, more appropriately called “null-hypothesis decision [ NHD ] procedure,” of statistical analysis is here vigorously excoriated for its inappropriateness as a method of inference. While a number of serious objections to the method are raised, its most basic error lies in mistaking the aim of a scientific investigation to be a decision, rather than a cognitive evaluation of propositions. It is further argued that the proper application of statistics to scientific inference is irrevocably committed to extensive consideration of inverse probabilities, and to further this end, certain suggestions are offered, both for the development of statistical theory and for more illuminating application of statistical analysis to empirical data.
( Received June 30, 1959 )
s is here the estimate of the standard error of the difference in means, not the estimate of the individual SD.
When the numbers of alternative hypotheses and possible experimental outcomes are transfinite, Pr (Δ, H) = Pr (H, Δ) = Pr (H) = 0 in most cases. If so, the probability ratios in Formula 1 are replaced with the corresponding probabilistic-density ratios. It should be mentioned that this formula rather idealistically presupposes there to be no doubt about the correctness of the background statistical assumptions.
Technical note: One of the more important problems now confronting theoretical statistics is exploration and clarification of the relationships among inverse probabilities derived from confidence-interval theory, fiducial-probability theory (a special case of the former in which the estimator is a sufficient statistic), and classical (i. e., Bayes’) inverse-probability theory. While the interpretation of confidence intervals is tricky, it would be a mistake to conclude, as the cautionary remarks usually accompanying discussions of confidence intervals sometimes seem to imply, that the confidence-level a of a given confidence interval I should not really be construed as a probability that the true hypothesis, H, belongs to the set I. Nonetheless, if I is an a-level confidence interval, the probability that H belongs to I as computed by Bayes’ theorem given an a priori probability distribution will, in general, not be equal to a, nor is the difference necessarily a small one — it is easy to construct examples where the a posteriori probability that H belongs to I is either 0 or 1. Obviously, when different techniques for computing the probability that H belongs to I yield such different answers, a reconciliation is demanded. In this instance, however, the apparent disagreement is largely if not entirely spurious, resulting from differences in the evidence relative to which the probability that H belongs to I is computed. And if this is, in fact, the correct explanation, then fiducial probability furnishes a partial solution to an outstanding difficulty in the Bayes’ approach. A major weakness of the latter has always been the problem of what to assume for the a priori distribution when no pre-experimental information is available other than that supporting the background assumptions which delimit the set of hypotheses under consideration. The traditional assumption (made hesitantly by Bayes, less hesitantly by his successors) has been the “principle of insufficient reason,” namely, that given no knowledge at all, all alternatives are equally likely. But not only is it difficult to give a convincing argument for this assumption, it does not even yield a unique a priori probability distribution over a continuum of alternative hypotheses, since there are many ways to express such a continuous set, and what is an equilikelihood a priori distribution under one of these does not necessarily transform into the same under another. Now, a fiducial probability distribution determined over a set of alternative hypotheses by an experimental observation is a measure of the likelihoods of these hypotheses relative to all the information contained in the experimental data, but based on no pre-experimental information beyond the background assumptions restricting the possibilities of this particular set of hypotheses. Therefore, it seems reasonable to postulate that the no-knowledge a priori distribution in classical inverse probability theory should be that distribution which, when experimental data capable of yielding a fiducial argument are now given, results in an a posteriori distribution identical with the corresponding fiducial distribution.
Braithwaite, R. B. Scientific explanation. Cambridge, England: Cambridge Univer. Press, 1953.
Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416–428. https://doi.org/10.1037/h0042040