Table of Contents
The Core Definition of the Thurstone Scale
The Thurstone scale represents a landmark achievement in the field of psychometrics, serving as the first formal methodology developed specifically for the measurement of human attitude. Developed in 1928 by Louis Leon Thurstone, this scaling technique aims to quantify the degree of favorability or unfavorability an individual holds toward a particular psychological object, concept, or social issue. Unlike simple agreement or disagreement methods, the Thurstone scale operates on the principle of equal-appearing intervals, a robust statistical mechanism that ensures the psychological distance between adjacent points on the scale is perceived as equivalent by the population being studied. This rigorous approach allows researchers to treat the resulting attitude scores as measurements along a true interval scale, which is essential for performing advanced statistical analyses and drawing meaningful conclusions about the intensity of beliefs.
The fundamental mechanism of the Thurstone scale involves compiling a large set of statements relevant to the attitude object—typically 100 or more—and subjecting them to a panel of expert judges. These judges are tasked with sorting the statements into a specified number of categories (often 11), ranging from extremely unfavorable to extremely favorable, regardless of their personal opinion on the issue. The goal is to determine the scale value of the statement itself, independent of any individual respondent’s belief. This process yields a numerical value, or scale score, for each statement, indicating its position on the continuum of favorability. The final instrument then presents a subset of these calibrated statements to the actual research participants, who simply check every statement with which they agree. The participant’s final attitude score is computed as the mean or median scale value of all the statements they endorsed, thereby providing a precise, interval-level measurement of their stance.
This method is conceptually distinct from later scaling techniques because its primary focus is on calibrating the stimuli (the statements) rather than directly calibrating the respondents. The underlying principle is that if judges can consistently assign a statement to a specific point on the continuum, that statement possesses an objective measurable characteristic—its scale value. Therefore, the resulting measurement is not a count of endorsements but a determination of where the individual falls on the pre-established psychological continuum defined by the statements themselves. This innovative approach allowed early psychologists to move beyond nominal classifications and assign true quantitative meaning to complex internal states like attitudes.
Historical Development and Context
The development of the Thurstone scale is inextricably linked to the work of Louis Leon Thurstone in the late 1920s, a period marked by intense interest in applying rigorous statistical and mathematical principles to psychological phenomena. Prior to Thurstone’s contributions, measurement in psychology often lacked the quantitative precision found in the physical sciences. Thurstone sought to bridge this gap, arguing that if something exists, it can be measured, even complex constructs like human attitudes. His seminal work, published primarily between 1927 and 1929, introduced methods that fundamentally transformed psychological measurement, establishing the field of psychometric scaling.
The initial application for which the scale was developed was the measurement of attitudes toward religion, a highly sensitive and complex topic that provided a challenging test case for his methodology. Thurstone recognized that simply asking if a person liked or disliked religion was insufficient; a nuanced scale was necessary to capture the intensity and direction of belief. His work was heavily influenced by the earlier tradition of psychophysics, which studied the relationship between physical stimuli and human perception. He adapted and extended these principles, arguing that the judgment of an attitude statement could be treated similarly to the perception of a physical weight or brightness—a concept foundational to the success of the equal-appearing intervals method.
The historical context also highlights Thurstone’s ambition to establish a scientific foundation for social psychology. By providing a reliable, standardized, and quantifiable measure of attitudes, he offered researchers a powerful tool to study social change, prejudice, and public opinion. The scale served as a prototype for future attitude measurement techniques, including the far more widely used Likert scale, which, while simpler to construct, owes its theoretical lineage to Thurstone’s pioneering work in establishing the necessity of interval-level measurement for psychological constructs.
The Law of Comparative Judgment and Theoretical Basis
The theoretical backbone of the Thurstone scale is the Law of comparative judgment, which Thurstone introduced in 1927. This law provides a mathematical model for relating the frequency with which a given stimulus is judged greater than another stimulus to the scale values and discriminal dispersions of the two stimuli. Essentially, it posits that when an individual compares two stimuli (in this case, two attitude statements), the perception of the difference between them is normally distributed. This allows the researcher to translate observed proportions of judgments (e.g., the proportion of judges who rate statement A as more favorable than statement B) into distances on a psychological continuum defined by standard deviation units (z-scores).
Thurstone’s method of pair comparisons, which is the operational prototype for the Law of Comparative Judgment, can be seen as a sophisticated statistical procedure based on the assumption of a normal distribution of psychological values. The theory is detailed and complex, but the algorithm for generating the scale values is systematic. For the basic theoretical scenario known as Case V, the frequency dominance matrix—which records how often one statement dominates or is judged more favorable than another—is converted into proportions. These proportions are then interfaced with standard scores, resulting in a matrix of Z-values. The final scale values are obtained by calculating the left-adjusted column marginal average of this standard score matrix, effectively positioning each statement along a single, continuous psychological dimension.
The application of this law allows for the calculation of the “psychological scale separation between any two stimuli.” This separation is crucial because it ensures that the resulting scale is truly an interval scale—meaning that the difference between a score of 2 and 3 is psychologically equivalent to the difference between a score of 8 and 9. This rigorous mathematical foundation is what distinguishes the Thurstone scale from simpler ordinal ranking methods and provides its strength, enabling researchers to make precise statements about the relative intensity of attitudes across different individuals or groups.
Methodology and Construction of the Scale
The construction of a Thurstone scale is significantly more labor-intensive and requires far more statistical computation than modern scaling methods. The process ensures that the statements selected for the final instrument are evenly spaced along the attitude continuum, hence the term equal-appearing intervals. The methodology involves several sequential, rigorous steps, emphasizing the calibration of the stimulus items over the response patterns of the subjects.
The construction process typically follows these structured steps:
Item Generation: A large pool of statements (often 100 to 200) covering the entire spectrum of favorability toward the attitude object is collected or written. These statements must be clear, unambiguous, and express a definite opinion.
Judge Sorting: A panel of expert judges (ideally 50 to 300 individuals) independently sorts each statement into a predetermined number of categories (e.g., 11 categories, where 1 is extremely unfavorable and 11 is extremely favorable). Judges are explicitly instructed to ignore their personal attitude toward the topic.
Calculating Cumulative Proportions: For each statement, the cumulative proportion of judges who placed it into a given category or any category below it is calculated. This creates a cumulative frequency distribution for every statement.
Deriving Scale Values (S) and Ambiguity Measures (Q): Using the cumulative proportions, the scale value (S) for each statement is calculated, typically corresponding to the median of its distribution on the continuum. Simultaneously, the ambiguity index (Q), which is the interquartile range, is calculated. Statements with high Q values (meaning judges disagreed widely on their placement) are discarded, as they are considered unreliable or ambiguous.
Final Scale Selection: Approximately 20 to 30 statements are selected for the final instrument. These statements are chosen specifically to ensure they have low Q values and are evenly distributed across the entire range of S values (e.g., selecting statements with S values near 1.0, 3.0, 5.0, 7.0, 9.0, etc.).
Once the final scale is constructed, respondents are presented with only the selected statements, without the category numbers. They mark all statements with which they agree. The final score is then calculated as the mean or median scale value (S) of the endorsed items. This final score is an interval measurement, reflecting the respondent’s precise location on the attitude continuum established by the judges.
Practical Application and Real-World Example
To illustrate the utility of the Thurstone scale, consider a real-world scenario focused on measuring public attitude toward mandatory environmental conservation policies. A researcher aims to determine not just whether people support the policies, but the intensity of that support, treating the sentiment as a continuous variable.
First, the researcher would generate hundreds of statements, ranging from “Conservation policies are an unnecessary government interference” (very unfavorable, Scale Value ~1.0) to “Conservation policies are the single most important step for human survival” (very favorable, Scale Value ~11.0). A panel of judges then sorts these statements into 11 piles. Through statistical analysis, the researcher determines the scale value (S) and ambiguity (Q) for each statement. For instance, Statement A might receive a scale value of S=2.1 (mildly unfavorable) and Statement B might receive S=8.9 (strongly favorable).
In the final survey, a participant might encounter a set of 25 pre-calibrated statements. The participant agrees with the following three statements:
- Statement 1 (S=4.5): “Conservation policies are generally good, but need more flexibility.”
- Statement 2 (S=5.8): “It is important that the government addresses environmental issues.”
- Statement 3 (S=7.1): “Mandatory policies are necessary to ensure future environmental stability.”
The “How-To” of scoring involves calculating the mean of the scale values of the endorsed items. In this example, the participant’s attitude score would be calculated as: (4.5 + 5.8 + 7.1) / 3 = 5.8. This mean score of 5.8 places the individual slightly above the neutral midpoint (6.0) on the 11-point conservation attitude continuum, indicating a moderate level of support. If another participant had a mean score of 9.2, the researcher could quantitatively state that the second participant’s attitude is significantly more favorable toward conservation than the first, with the difference being measurable in interval units.
Limitations and Methodological Difficulties
Despite its theoretical elegance, the Thurstone scale faces several significant practical and methodological difficulties that have limited its widespread use compared to simpler methods like the Likert scale. The primary critique often revolves around the assumption that the judges’ subjective sorting process successfully creates an objective, interval-level scale. Critics argue that the judges’ own attitudes, which they are instructed to ignore, may subtly influence their sorting, thereby contaminating the calculated scale values. This reliance on a separate panel of judges introduces an expensive and time-consuming step that modern techniques bypass.
A core technical difficulty arises directly from the statistical requirements of the Law of comparative judgment, specifically concerning the handling of extreme proportions in the frequency dominance matrix. When comparing two stimuli, if the proportion of judges rating one stimulus as more favorable than the other is exactly 1.00 or 0.00, this translates mathematically into Z-values of plus or minus infinity, respectively. This issue of indeterminacy makes it impossible to apply the standard averaging procedure to arrive at the final scale values. Early psychometricians, such as Guilford and Edwards, recognized this problem and recommended avoiding the use of proportions more extreme than .98 or less than .02, effectively truncating the data to maintain mathematical solvability.
The omission of these extreme values, however, leaves empty cells in the Z matrix, necessitating elaborate procedures for the estimation of unknown parameters, further complicating the already complex construction process. While alternative solutions, such as those proposed by Krus and Kennedy, were developed, the inherent statistical instability when dealing with perfectly consistent judgments remains a significant constraint on the applicability of the method, particularly when working with smaller panels of judges or very distinct attitude statements. These technical hurdles contributed to the eventual dominance of item response theory (IRT) models which handle extreme responses more robustly.
Significance, Impact, and Modern Usage
The Thurstone scale holds immense historical and theoretical significance because it fundamentally established the feasibility of rigorous, quantitative psychological measurement. By demonstrating that attitudes could be measured on an interval scale, Thurstone shifted the focus of social science from mere categorization to precise quantification. This pioneering work laid the groundwork for all subsequent psychometric scaling techniques, including the development of the widely adopted Likert and Guttman scale methodologies. Even though the Likert scale is simpler and more common in current practice, it relies on the same basic premise: that multiple items can be combined to reflect a single underlying continuous construct.
In contemporary psychology, the Thurstone scale is less frequently used for standard attitude surveys due to the high cost and complexity of its construction process. However, its principles remain vital in specialized research. It is still utilized when researchers require the highest degree of confidence that the scale possesses true interval properties, such as in certain areas of experimental psychology or advanced scaling projects where the exact distance between scale points is critical for modeling. Furthermore, the methodology of using expert judges to determine item characteristics remains influential in fields like content analysis and test development, where objective scaling of stimuli is paramount.
Its enduring impact is primarily theoretical, serving as the conceptual progenitor for modern item response theory (IRT). The very idea of defining an item’s difficulty or favorability independent of the respondent’s ability or attitude—a core tenet of the Thurstone method—is central to advanced modeling techniques. Thus, while the physical instrument may be rare, the underlying statistical philosophy continues to shape how psychologists conceptualize and execute measurement.
Connections to Other Psychometric Models
The Thurstone scale exists within the broader category of psychometrics, specifically within the domain of unidimensional scaling models. Its primary theoretical connection is to the Rasch model, a modern item response theory approach that has largely superseded Thurstone’s method in many applications. The Rasch model, developed by Georg Rasch, shares a close conceptual relationship with Thurstone’s Law of comparative judgment, particularly in its focus on separating item characteristics (favorability/difficulty) from person characteristics (attitude/ability).
However, key differences exist between the two models. The Rasch model, unlike the Thurstone model, directly incorporates a person parameter into the equation, allowing for the simultaneous estimation of both the item scale values and the individual’s attitude level from the response data itself, eliminating the need for a separate panel of judges. Furthermore, the mathematical function employed differs: the Thurstone method is based on the cumulative normal function derived from the assumption of normally distributed judgments, whereas the Rasch model utilizes a simpler logistic function. This logistic form often provides a more computationally tractable and statistically robust framework for modern data analysis.
The Thurstone framework also relates to other scaling innovations, such as unfolding models like the Hyperbolic Cosine Model (HCM). Unfolding models assume that respondents have an “ideal point” on the continuum and that they prefer items closer to this ideal point, regardless of the item’s inherent favorability or unfavorability. While the Thurstone scale focuses on a cumulative measurement (a higher score means more favorable agreement), unfolding models attempt to map both the item and the person onto the same continuum, suggesting a more complex, non-monotonic relationship between attitude and agreement. Despite these advanced developments, the historical context provided by Louis Leon Thurstone’s initial work remains essential for understanding the evolution of quantitative measurement in psychology.