The biologist Medawar proposed the idea of scientists spending a lot of their time focusing their research on how they do their research. The question remains whether this could interrupt the research itself. Philosopher Feyerabend (1988) suggested that the term scientific method is not a good description, as this would imply that every finding in science would stem from the same strategy in a formula-type of way. Instead he believes there are several moves and tricks in successful research. It is an accepted view that there is no fixed strategy when employing the scientific method, also often called the ‘’outlook’’ that consists of content that is logical and empirically-based.
Empirical reasoning is another term used to describe this process, but it is not the same as logical positivism. Popper argues against the verifiability principle, which states that through the accumulation of factual observations one can arrive at the truth. He believes in falsifiable hypotheses, where hypotheses go through a cycle of testing and re-testing. He calls this himself the ‘’searchlight theory’’.
Next to the empirical factors, additional factors can have an effect on the advances in science. Physicist Thomas S. Kuhn proposed that advances in science occur rather irregularly and randomly and that the process of it moves in abrupt bouts and stops.
‘’Paradigm shifts’’ characterize changes in how humans view the world due to revolutionary insights and mark the entire history of science. Popper has a contrary view to that of Kuhn, saying that progress in science goes along Darwin’s natural selection theory.
What do Behavioral Researchers: What do they really know?
A distinction is made between why-questions and how-questions among the philosophers of science. Why-questions call for answers that entail causal inferences or inferences about the purpose. How-questions look for answers that provide descriptions of how something works, also called descriptive research orientation.
Three viewpoints of the why and how-questions of behavioral and social researchers are:
- Social constructionism
Social constructionism views reality as everything that a human mind constructs, such as discourse or narratives, the linguistic constructions being the basis of it. The experimentation is not a preferred source of research. Instead, social constructivists believe that the study of relations, especially the communication, between humans mark the truth.
- Contextualism/ Perspectivism
Contextualism/perspectivism focuses on the necessity of empirical research and different theoretical perspectives. Knowledge representations are limited in a way that they are context and perspective based.
- Evolutionary epistemology (or organic evolution)
Evolutionary epistemology goes along Darwin’s principles of natural selection and adaptation. Knowledge advances as biological progress and adaptation occurs. Yet, contrary to this it can be said that hypothesis and theories are not random developments.
Social Constructionism is different from the other named views, as it does not make use of the traditional scientific methods (e.g. experiment) and deductive strategies in psychology for answering why-questions. Instead, they use a model called “interpretive” model, based on human narratives. Kenneth Gergen, a social psychologist and pioneer of social constructionism, proposes that the empirical approach in social psychology generalizes superficial snapshots of human behavior, and that it would be more useful to analyze what the individuals report themselves verbally of their own experiences. Truth and knowledge are products of a reality that is socially constructed.
John Shotter said that only our own experiences create social reality and that a “real” social world does not exist.
Gergen has received criticism for saying that “real science” entails that events have to be able to be repeated and that social behavior is the opposite of that, being sporadic, random and unpredictable. In addition he was criticized for trying to employ the narrative self-reports as the sole experimental method.
Contrary to this criticism, findings of the study of Muzafer Sherif shed light on how experimental manipulation can have an effect on social construction and that an empirical demonstration of it is possible.
According to this approach, it is necessary as a scientist to make use of several strategies and units of analysis to grasp a well-rounded picture of human nature. There are several right views and each of them is limited in some way, so they need to be combined. In addition, knowledge moves in the dynamic context and depends on the perspective. With change in context, the perspective and hence cognitive clustering takes on a different form.
The adding and averaging models, the evolutionary model and the cyclical model are all different views within the contextual/perspectivism approach and according to this approach would all be relevant and applied depending on the context they are fit in.
Just like in constructionism, there is no reality in contextualism, as one would have to know what and how reality is like in order to make judgments about what real truth is. But since we are all biased by our culture, language, any previous experiences etc., this becomes impossible, so it is not even a point of debate in the contextualism/perspectivism approach.
Contrary to the contextualism view that theories survive because they are applicable in the current context, the epistemology approach holds that a theory survives if it proves to be useful by winning in the competition with others about the empirical manner of, for example, predicting a future event. The goal is for the a priori reality to be uncovered, which waits at the end of a long path of selection and elimination.
Just like in contextualism, the epistemologists believe that knowledge advances in a dynamic way as scientists make their best guesses and continue research from there.
One problem that this approach is faced with is that mutations are random according to the Darwinian Theory, while setting up hypothesis and theories is not. David Campbell sees the epistemological approach as “potentially unifying philosophy of knowledge in psychology”.
Pierce’s Four Ways of Knowing
Philosopher Charles Sanders Pierce defined three methods that serve as a traditional base of knowledge:
- The method of tenacity
This method holds that people believe in certain ideas because they have been around for so long. Even through reasoning and demonstrating the opposite of such a belief, Pierce concluded that they are nevertheless hard to get rid of.
- The method of authority
This method is less primitive than the method of tenacity, but has limits in certain ways. Being told by an authority what to do, because they are for example an expert in a subject, can lead to many positive changes. But this can also be exploited by authorities who use fake expertise. An extreme example of this can be the false accusation of witchcraft that has led to horrible consequences.
- The a priori method
With this method we can think as independent individuals in a logical way to protect ourselves against influence from external authorities. For example, the multiplication table can be seen as free from all temporal bias and is always precise.
Rhetoric, Perceptibility, Aesthetics
Also a number of extraempirical factors play a role in science. These are rhetoric, perceptibility and aesthetics. Empirical content marks the last of the in total four supports of convictions. All these four supports are needed for conviction in science.
The term rhetoric justification is used to describe the language that is used to persuade, such as when a scientist wants to persuade others with his findings. In order to succeed at this, the scientist has adapted to the linguistic rules in his field of expertise, so he can present information and communicate appropriately. He thereby speaks the same language as all other scientists in this field. Yet, this way of conforming could lead to deficits in the underlying pattern of reasoning. Language has the power to affect non-linguistic aspects of beliefs by dividing the world of science by it. Different conclusions are therefore often drawn when there is no context for or experience with the matter of subject, which results in different opinions of what the truth is.
Visualizations and Perceptibility
When a scientist has an image in his mind which fits to his hypothesis, idea or conclusion, the stronger this image is, the more it supports the belief in the idea. Looking into history, it seems as if most scientists in behavioral and social research have used an image as a starting point from whereon they begin drawing hypotheses, which are then being tested. A range of metaphors and proverbs are available that a scientist can make use of to relate a still rather unknown idea to something they are familiar with. These metaphors are interwoven into our cognitive structure, especially for metaphors that are part of our everyday use (e.g. time is money). Formal ideas can be represented in a perceptible form, and often stay with us longer that way.
Sometimes an idea, such as for example a perceptible image, is rejected because it is not found to be aesthetic, rather than because the idea cannot be understood correctly. It is judged whether it is “beautiful” or elegant, which words are not rarely used to define models in science. The periodic table in chemistry for example was said to be beautiful because of the structured outlay, but also because there were gaps left, which showed certain courage of the creator Mendeleyev.
Making generalizations in science can be difficult because the empirical facts that we have are often not enough to prove them. The four supports of conviction do have its limitations, which are important when judging empiricism and apprehending that it does not have all the properties that we sometimes believe it has.
Defining Behavioral Research
When dissecting the broad spectrum of behavioral research, we can divide each field into its focus and into what it tries to explain, the explanatory emphasis. The focus ranges from: most micro, more micro, more macro to most macro.
A neuroscientist would have the focus of most micro, as he wants to investigate the inside of the object of interest in its smallest unit possible. The explanatory emphasis here lies on the biological and biochemical factors. The field of cognition has a more micro focus with its explanatory variables being thinking and reasoning. Social psychology takes the more macro focus and tries to explain interpersonal and group factors. Sociology focuses at the most macro level investigating how and why societal systems work.
Researchers often do not only work with their methodology, but also look at the strategies of others, thereby getting a broader spectrum. Jean-Paul Sartre believed that a certain switching back and forth in our focus of attention to get a clear whole picture of what we are viewing. The more we focus on one thing, the more we lose sight of other features which are then in danger of going unnoticed. Interdisciplinary fields in science offer the opportunity of sharing methods, which make a better switching in concentration possible.
Three Broad Research Orientations
There is the descriptive, relational and experimental researcher. They each follow their own research orientation.
The descriptive researcher
This type of researcher works with making observations and then describing the behavior he observed. This approach is usually necessary to start a study, but it does not provide any causal explanations, meaning why some behavior is occurring.
The relational researcher
The relational research orientation is a method where two groups are compared and these two groups naturally exist already. For example, one could study smokers and nonsmokers and regard the proportion of them eating in the cafeteria and not eating in the cafeteria. Two conditions are put in relation with each other. Quantitative statements can be made from this point on when specific observations are carried out on a sample. The relational research orientation hence allows defining the strength, also known as effect size, and the form (linear, nonlinear) next to the correlation itself.
The experimental researcher
After having received results, the researcher can then employ an ad hoc hypothesis, which is tailored to the received results. The researcher can also introduce manipulations to get further with the working hypothesis, the hypothesis the researcher works with. This is known as the experimental research orientation. This approach centers on what the cause of a certain behavior is. Relational research cannot go this far, as it only indicates a relationship between X and Y but not if X leads to Y or Y to X, as an experiment would provide. In an experiment we have the experimental group and the control group. The control group does not experience a certain manipulation, while the experimental group does.
Prime examples illustrating the different research orientations:
The Descriptive Research Orientation
During World War II many psychologists were recruited to do research in order to help with the personnel selection of men in the armies. The men were observed in a certain setting by the researchers. In a descriptive research orientation a relational research strategy can be added. For performance of men in the army, a high correlation between the assessment of the researcher and the actual performance of the soldier would mean that the assessment predicted the actual performance. The assessment here is the predictor variable and the actual performance the criterion variable, also called the outcome.
The Relational Research Orientation
A construct represents an idea that is abstract and tries to explain something.
The construct of need for social approval was supported by a variety of replications of many studies after the initial study of Douglas P. Crowne and David Marlowe. Also, the Marlowe-Crowne Social Desirability (MCSD) scale was supported in its validity by numerous studies. The variety of studies named in the book show how replications of a construct can be achieved in the relational research with a variety of techniques. Testing of a sample of participants should always occur in a reliable (consistent) manner.
The Experimental Research Orientation
Experimental research entails that the cause of an outcome is identified. In the study of Harlow and Harlow where the aim was to show how soft material affected new-born primates short and long-term, the different experimental groups consisted of primates of whom half were given a wire-mother and the other half a cloth-mother after being separated from their real mother after birth.
Another study also by Harlow and Harlow used the experimental design to test the effects of social isolation on infant monkeys, where the differing degrees of isolation represented the experimental groups.
Empirical Principles: Probabilistic Assertions
How-questions can be answered through descriptive research and relational research. Why-questions can be answered through the use of experimental research. Experimental research works with probabilistic assertions. Probabilistic assertions in behavioral sciences state that a statement is likely to be true if evidence for this was collected through the empirical research strategy. Hans Reichenbach named it “implicit probability value”.
This implicit likelihood also becomes evident in the deductive-statistical explanation. The Hempel-Oppenheim model or covering law model explains this term with: All of A is true of B, and all of B is true of C, this means that all of A must be true of C. This is built on the assumption that there is at least one universally true presumption. The implicit likelihood would thereby be 100%.
With the inductive-statistical reasoning approach, there is uncertainty regarding the probabilistic assertions and they do not possess at least one universally true presumption. This would be the case if for example 95 % of marbles in a bowl are red. There will never be a 100% chance of drawing a red marble.
The Empirical principle will nevertheless not be able to guarantee an exact answer when it comes to predicting human behavior as it is affected by:
1. The state of mind of the individual
2. Nature of the situation, in regard to the historical moment
3. Sociocultural conditions, which are often not predictable
This can result in variability, which again proves how there cannot be universal truths in behavioral (and social) research.
Good Scientific Practice: Orienting Habits
A good researcher is just as important as good research strategies. A good researcher should possess following traits:
3. Common sense
4. Role-taking ability
6. Confidence in own judgment
7. Consistency and care regarding the details
8. Ability to communicate
Ethical guidelines have to be followed throughout the entire research and in every aspect of the research. This can be a complex task sometimes, as there are always new ethical rules to adhere to.
Inspiration and Explanation
Discovery is a term coined by Hans Reichenbach, which occurs when theories and hypotheses are born. There are multiple ways that an idea in research can be sparked. These theories and hypotheses are settled in an empirical manner, which gives rise to the term justification. Since rules and conventions are subject to change in our dynamic world of continuous advances, the term justification should be used instead of the term decision.
Null hypothesis significance testing (NHST), a dichotomous form of decision making, has some limitations. The American Psychological Association advocates that the effect size, the relationship between the independent and dependent variables, becomes the central idea of what researchers speak in regard to the results of a study. In addition, the APA advocates for confidence intervals to be built around this effect size.
Theories and Hypotheses
A theory is the main skeleton of explanation for relationships and the hypothesis can be described as a speculation of an instance from this main structure. Theoretical thinking occurs in an inductive manner. Abstract concepts find themselves in every area of life, and human observations all call for some theoretical explanations, meaning they are all theory-laden in one way or the other, as Feyerabend postulated. Popper argues that the right formulation of theories and hypotheses leave room for possible falsification.
Inspiration and Insight: The Sources
Peter Caw believes that progress in science is unavoidable and that discoveries will always be made. He argues that for example if an important scientist such as Newton had never existed, the discoveries would be made by someone else under the condition that all necessary advances up to this point were made. A step-wise development in science hence takes place. Usually from a known phenomenon the hypothesis is derived and not the other way around. The various ways how a researcher comes to a bright-outlook idea to research is illustrated below.
Modifying a Classic Relationship
This approach works by taking the reversal of a common relationship. A good example is the principle of self-fulfilling prophecy by Robert K. Merton, which states that certain beliefs can actually make a behavior become reality. This is the reverse of John Venn’s principle of suicidal prophecy, which states the opposite that if one has negative beliefs about a future event, this can prevent this event from occurring.
Using a Case Study for Inspiration
Another common approach is to use a case-study as a starting point, from which generalizations are sought out to be made. The Kanner’s autism was termed after Leo Kanner (1943), who used this term to describe similarities in a group of disturbed children, which were individually subject to clinical casework.
Making Sense of a Paradoxical Situation
In this situation finding an explanation can be a great source of inspiration for researchers. Picture the situation where no one helps an old man that got hurt on the street even though there were plenty of chances where many people could have helped him. The principle of “diffusion of responsibility” illustrates this approach of making sense of a paradoxical situation.
A researcher uses this creative approach of a metaphor to understand and explain a real-life concept better. Metaphors or analogies hence give rise to the development of hypotheses. An example would be that sensory overload results in “deindividuation”, which then led to the hypothesis that people in crowded urban areas feel less respect in regard to being treated as an individual.
Serendipity in Behavioral Research
This principle builds on the idea that a certain portion of luck is needed in science in order to yield success. Chances randomly occur that can inspire a scientist, so it is important that he is attentive at all times. An example of this was the random and lucky discovery of ulcers in monkey’s stomach by pathologist R.W. Porter, which led to a number of studies also with rats discussing the role of “executive control” in stressful situations and the development of ulcers.
Molding Ideas into Working Hypotheses
Once an idea has been developed, researchers are faced with several questions which guide the process of producing a working hypothesis. A working hypotheses contains the following things: novelty, utitly, consistency, testability, refutability, clarity and conciseness.
Novelty, Utility, and Consistency
Novelty: Is this idea exciting and novel enough to make a contribution to science? A researcher should ask this question beforehand, especially because conducting research on an irrelevant topic to the field is just as effortful as conducting research on a novel hypothesis.
Utility and consistency: Is the idea useful in regard to what wants to be yielded? Does it agree with the knowledge in the field? Utility can lead to having a practical end as well as a theoretical end.
These three points will be covered if literature review is being done in a regular and active fashion by the researcher. Nevertheless, it is no guarantee that an idea will hold as plausible throughout the research. Also, what is known as “accepted wisdom” should never be blindly accepted when research of empirical manner shows different results.
Testability and Refutability
Popper makes a distinction between non-science and science, by saying that verifiability goes with non-science but that in real science falsifiability should be possible. Commonly a researcher would rather modify the testing process, re-analyze, or take other factors into account or out, than leave their hypothesis that brought the unwelcome result. According to modern philosophers, the main goal hence seems to be the authentication.
Clarity and Conciseness
Operational definitions are definitions yielded through the empirical approach, such as the Marlowe-Crowne Social Desirability scale which defines “need for social approval”. Theoretical definitions are different in a way that they do not require empirically based definitions.
Percy W. Bridgman first introduced the term operational definitions, meaning that concept is just another word for a corresponding set of operations. Operationalism, which is another name for it, means that certain observational procedures can empirically define scientific concepts. A criticism of operational definitions is that they are said to not fully define valid concepts, but only parts of it. This would require several operations for one concept.
Aggression is a good example to illustrate how difficult it is to limit multiple methods of defining a concept to a single method. One way how to create definitions of one sentence is to create a typology, which systematically makes classifications. By formulating a typology, there are several measuring operations, and there is no need for trying to fit it in one measuring operation. By focusing on the dimensions of the construct, a facet analysis can be used, which logically formulates a system classification.
Coherence describes the sticking together of the hypothesis’ statement in a logical way and parsimony describes whether it is in its simplest form possible. Occam’s razor is a process in which any unnecessary information will be taken out, or “cut” out.
Positivism, Falsificationism, and Conventionalism
August Comte was a sociologist and philosopher advocated for an approach called positivism, and often logical positivism, in science which was based on the idea that a statement that would be observable has a greater likelihood to be true. Same goes for any sensory experience which would prove a statement to be correct. Comte took sciences such as chemistry and physics as a role model, in which observations are frequently made and contribute to discoveries. He wanted to transfer this idea to the social sciences then, calling it “social physics”.
Criticism came from philosopher David Hume (18th century), who argued from the side of probabilistic assertions, meaning that there is always a possibility that these observations or sensory experiences do not provide the truth.
Popper disagreed with the positivism approach, and hence created falsificationism. Both positivism and falsificationism are based on empiricism and scientific realism, but falsificationism differs from positivism in a way that it does not follow the verifiability principle, meaning to confirm evidence. If a proposition is not correct, then a set of observations should be able to reject it. This means that the formulation of a proposition should allow for possible rejection. One theory is superior to another if it received fewer rejections through empirical testing. Criticism of this approach is:
Even with the most reliable observation, it is not possible to falsify an entire theory which consists of more than one statement.
There is often no consensus in social and behavioral sciences on how a theory is adequately tested and how analysis and interpretation of results would be done to come to the conclusion that this theory has successfully been falsified.
Since there can be no clear boundaries defined in respect to theories of human behaviors, in contrast to physics for example, it is hard to make a clear cut of when a theory is falsified.
This approach has language as its basis and is also known as the Duhem-Quine thesis, by science-historian, philosopher, and physicist Pierre Duhem. He argues that since theories derive from linguistic conventions and do not come from the basis of surviving empirical disconfirmation tests, crucial experiments do not exist. He also advocates for adjusting theories instead of abolishing them in case they were refuted, as Popper proposed with his empirical jeopardy principle.
The cognitive dissonance theory (Festinger, 1957) gives a good illustration of how a theory evolves under adjustments made, but also shows how the old theory cannot describe phenomena of the new theory anymore at the end of the metamorphosis.
An amalgamation of ideas
A mix of the different approaches is the current view on what scientific theories and hypotheses require. Falsification, conventionalism and practicality mesh together to follow conclusions now mainly agreed on in behavioral research:
Formulation of hypotheses must allow disconfirmation by “a finite set of observations” if they are false. This is also called “finite testability”.
Falsifiability instead of verifiability.
Scientific theories can add to or replace existing behavioral models.
If a proposition is not supported, the according theoretical model is not necessarily correct.
Repetition of failing support of a proposition should lead to revision or discarding of a theory.
Only because a proposition gets support, does not mean correctness of the according theory, as there might be a better theory to explain the outcomes.
Consistent findings and inconsistent findings are both said to have supporting value.
A theory must be clearly stated to prevent confusion about the assertion.
Decision Errors: Type I and Type II Error
In null hypothesis testing the null hypothesis H0 says that the groups tested do not differ or indicates no relationship between two variables, in contrast to the alternative hypothesis (H1 or Ha). A Type I error means that a H0 which is true gets rejected. A Type II error is not rejecting a false H0. The probability of a Type I error is the p-value, alpha. Beta represents the probability of Type II error. The power is 1 – beta (β), which says how well a false H0 would be detected. 1 – alpha (α) is known as confidence, which denotes the probability that a Type I error is not made.
In science more attention is given to the Type I error, representing an error resulting from “gullibility”, than to the Type II error which would result from “blindness” in the scientific method.
Statistical Significance: The Effect Size
The value of the significance test increases, meaning the p-value gets smaller, if the effect size and/or the size of the study increases. This equation represents the relationship:
Significance test = Size of effect x Size of study
This way you could calculate how large a sample needs to be to reach a certain statistical level. The effect size multiplied by the study size will hence give a certain significance level.
We can calculate the chi-square or a number of t-tests with this information. Researchers who are familiar with the relationship between statistical significance and effect size will perform significance testing which will most likely yield a higher power.
Two Families of Effect Size
The correlation family and the difference family are two effect size families of great importance. There is also the ratio family, which makes up the third family. Each family has subtypes.
The three subtypes of the difference family are: Cohen’s d, Hedges’s g, and Glass’s ∆. The numerator is the same for all three, but the denominators differ.
Cohen’s d = M1-M2 / σpooled
Hedges‘s g = M1-M2/ Spooled
Glass’s ∆ = M1-M2/ Scontrol
S being the square root of the pooled unbiased estimate of the population variance. Scontrol is the same but solely for the control group.
Part of this family are: phi coefficient (φ), rpb, where the former applies when both variables are dichotomous in the Pearson product-moment correlation and the latter applies when one variable is continuous and the other dichotomous. Further in this family belongs the Fisher transformation of r which is denoted as z, and indices of r and related quantities such as the coefficient of determination r2, but also Ω2 (omega2), ε2 (epsilon2), and η2 (eta2) . When these indices of correlation are squared like here, the directionality is no longer indicated, which does not make them useful for indices of effect size. Also the practical value of small effect sizes is another issue with squared indices, which might be misleading.
The family of correlation indices proves to be generally more useful when it comes to effect size measures.
Interval Estimates Around Effect Sizes
The obtained margin of error around the effect size index value represents the confidence interval of the effect size. For example, an obtained effect size r of .21 has a confidence interval with a lower bound r of 0.15 and an upper bound r of 0.28. The interval gets smaller if the confidence level gets lowered, for example from 95% to 90% and larger if the confidence level increases (e.g. from 90% to 95%). Increasing N, the observations in the sample, will also contribute to a smaller confidence interval.
The null-counternull interval can warn of a false rejection of the alternative hypothesis in favor of the null hypothesis, meaning to warn of a false conclusion of “no effect”. Instead of being based on a specified alpha in advance, this interval works with the actual resulting p-value.
It is important to keep in mind that solely significance testing without an estimate of effect size does not give a full and correct picture alone. P-values become more useful with the prep statistic by Peter Killeen, that a replication of same size will have a same direction effect as the original study, providing the probability of this happening:
Prep = 1/ 1+ (p/1-p)^2/3
For p put in the significance level which was obtained. Even though this statistic is helpful, it is not quite as useful as estimates of effect sizes and their interval estimates.
Puzzles and Problems
Ludwig Wittgenstein, professor of philosophy at Cambridge, and Karl Popper disagreed on the role of philosophy around the time following World War II. Wittgenstein argued that ambiguity arises from the language used in philosophy, which he thought was imprecise and ordinary. For Wittgenstein, philosophy centered on problems of linguistic puzzles; not real problems that must be discussed. Popper disagreed with this view, just like Russell who felt philosophy is being put in jeopardy as an academic field because of Wittgenstein’s declarations.
This reminds of the discussion that is sparked through ethics in research, namely ethical values, especially in regard to their subjectivity, and scientific values opposing each other.
The American Psychological Association (APA) has created guidelines of ethics in psychology that are now most widely used. There is also the Association for Psychological Science (APS) which has own guidelines. Since behavioral researchers belong to either one of these associations, no consensus has been established in behavioral research regarding the ethical guidelines.
A Delicate Balancing Act
The stage-wise process that research follows presents the issue that every step is value-laden in a way. Even though the topic might seem neutral to the researcher it could spark great arousal as not everybody might regard the topic as neutral. Greater concern about ethics and values arises when a sensitive issue for society is tapped by research. Researchers can make their own assessment of how moral their scientific conduct is by adhering to their national code of ethics made by the national psychological association. The institutional review board (IRB) represents another station for a researcher to pass, which is an independent group that evaluates the researcher’s work on moral and ethical grounds.
The two most important points in conducting ethical research are not harming the subjects in a psychological or physical way and doing research which is beneficial and valid in its results. Research that is well designed is often better defended regarding the ethical dilemma, as this would mean a good investment of resources (including the research subjects). Nevertheless, human rights often seem to be of secondary concern though, if one takes in account that research is often conducted with the use of active and passive deception and that privacy is also often invaded.
The American Psychological Association Code: A Historical Context
Before the 1930’s deception was used seldom in social psychology, but experienced an increase until 1950 and drastically increased in the time after. Nowadays, the use of deception is declining again.
The Tuskegee study
The Tuskegee study, from 1932 to 1972, illustrates how an extreme form of deception has horribly gone wrong. 400 African American men of low income in Alabama who have had syphilis were recruited for a study that intended to investigate the course of the disease, but presented to the subjects as getting free yearly check-ups for their “bad blood”. They were not given penicillin, discovered in 1941 and which could cure syphilis, so that researchers could keep on investigating the course of the disease. This resulted in severe damage of health and even death of the research subjects.
In the 1980’s three ethical principles in human subject research became installed everywhere and without any exception in all European and American codes:
1. No physical harm should occur.
2. No psychological harm should occur
3. Confidentiality of data is ensured.
The confidentiality agreement would ensure privacy of the participant, allowed the researchers to keep the private data, and that participants were more open and honest because they know their information is safe.
The ethical principles for research with human participants are represented in a table, with principles from ranging from A - J (Table 3.1, p.66 in Research Methods).
The Belmont Report, Federal Regulations, And The Institutional Review Board
The Belmont Report
This report emphasizes that: 1)autonomy of the individual, 2)less autonomous individuals are protected, 3)benefits are maximized, 4)harm is prevented as much as possible, 5)a fair procedure of the distribution of risks and benefits. The IRB will force research studies that entail more than “minimal risk” to install specific safeguards, which are there to assist the participants in the process of explaining the purpose, procedure, potential discomforts and risks of the study.
The IRB received criticism for apparently hindering many studies in a way that the benefits were not fully weighed in.
Minimal risk research
Minimal risk research describes studies that do not yield greater risks than subject are exposed to in normal life. Even minimal research studies are subject to the IRB, but the review can be accelerated. Nevertheless, minimal risk research could potentially cause harm for an individual as well.
State laws can also have an effect on the standards of the IRB, meaning that the standards have to be exacerbated by the IRB. Also, the members of the IRB also often differ slightly in their values, which could lead to approval by the IRB in one institution, but not in another.
The APA code is based on five principles that are emphasized: a) respect for persons and their autonomy b) beneficence and non-maleficence c) justice d) trust e) fidelity and scientific integrity.
Principle I: Respect for Persons and Their Autonomy
The informed consent is the most important feature of this principle. The informed consent provides information regarding the purpose, nature, risks and benefits of the study and most importantly the participant indicates there that he or she is participating voluntarily. The informed consent necessarily has to be provided by the principal investigator and agreed to by the participant. For children a legal parental consent is necessary or the consent of an advocate. Criticism towards the informed consent is that the information provided could lead to invalidity of the research or generally impede the research.
Gerald T. Gardner had found that in his study of effect of noise on task performance that participants did not show negative effects if they had received an informed consent beforehand. He attributed this to the participant’s knowledge of being able to leave the study any time without any consequences, which made them feel more in control over the noise, resulting in less negative effect. Schwartz had also argued that the informed consent could result in paranoid imaginativeness.
Principle II: Beneficence and Nonmaleficence
Research is supposed to yield benefits for the participant, which is also known as beneficence, and not do any harm, which is known as non-maleficence. In psychological research the two crucial points that pose the most risk are the invasion of privacy and deception being used, whether that is active or passive deception. In case that deception is used, a debriefing is required and the research must be of value in science. Also, any risks that are not mentioned cannot be more than minimal risks.
The study of Stanley Milgram illustrates how deception was used in ways that raised strong concern. The study was set up in a way that there was an authority figure who told the person being deceived, the “teacher”, that whenever a specific third person, the “learner”, in the experiment made a mistake on a task, that they would have to administer electroshocks to the learner. The learner was an actor and never received the electro shocks, but what turned out as a shocking result was that the teacher would continue giving painful and up to deadly electro shocks, as he believed, because the authority told them to do so. Even though a debriefing occurred afterwards, this study had initially caused great distress for most of the deceived participants and was therefore unethical and highly criticized.
The researcher is in a constant balancing act to estimate how a participant will react to a certain deception in order to protect his or her rights. Debriefing should occur in a gentle manner and result in the reduction of negative feelings.
Principle III: Justice
The third principle of justice holds that scientific research should be fair, meaning that there should be an equal distribution between benefits and burdens. The Tuskegee study showed how this was not the case, as the participants did not receive medication when it was discovered. Also, instead of assigning a control condition to no treatment or a placebo, they should be assigned to an effective alternative and the experimental group would be assigned to the new form of therapy. The placebo control condition is solely to be used when there is no effective alternative.
A wait-list control group is one way to overcome this dilemma of injustice, by taking people as the control condition that are waiting for a treatment anyways, such as people on a wait-list for psychotherapy. When they are off the waiting list they will receive the therapy. This alternative can be applied in randomized experiments.
There are several orientations regarding what is perceived as justice. If the consequences of an action decide over the fairness of it, this refers to the consequentialist views. If you take a boy‘s puppy away without asking, but the boy will be happy about this because the puppy was too much work, then the consequentialists would say this is a right action. The deontological view classifies certain actions as wrong, even if they would yield no or even positive consequences. The pluralistic orientation unites these two orientations.
There can be injustice even if no deception or privacy invasion takes place, for example in the 1973 field experiment at the Rushton Mining Company in Pennsylvania. The workers in the experimental condition, which were volunteers of no mining experience, benefited of a higher salary and luxurious dinners while taking care of executive functions in the mining company, while the experienced miners did not receive these.
Moral costs are another aspect contributing to the principle of justice. This can happen if a participant of a study feels embarrassed, vulnerable or exploited by being put in the public eye due to identification in the published results.
The process of how people are chosen for the lifesaving procedure of chronic hemodialysis has also been subject to the debate of injustice. As there are not enough resources in a hospital to treat everyone in need, there are several approaches to selection for the procedure. One approach is first come first serve, another approach looks at factors such as age (under 40), if one is married, community contributions and more, and lastly the selection by lottery which is mostly considered the fairest.
Principle IV: Trust
This principle rests on the agreement of confidentiality, which means that the subject’s data are safe and no one can have access to them besides the researcher. A certificate of confidentiality is available for researchers which they receive from the funding agency. An issue for researchers that work in the field of child abuse and neglect is the necessary reporting of child abuse in regard to the principle of trust between the researcher and the subject. Special training programs are necessary to yield correct recognition of abuse, so that overreporting or underreporting will not occur.
Principle V: Fidelity and Scientific Integrity
This principle highlights how the ethical quality and scientific quality relate to one another. There are several issues regarding this principle, such as the issue of hyperclaiming, which promises great achievement of a research, to the agencies granting the research, to research colleagues but also to participants. The participants are most vulnerable to hyperclaiming as they are the least familiar with the research topic. Causism represents another problem in research, which occurs when an implication of a causal relationship is made even though this has not been figured out through data analysis (yet).
How research data is analyzed also often questions the principle of fidelity in research. Among the most common ethical issues is that data not agreeing with the theory is omitted. Less often in occurrence, but worse regarding the ethical code, is fabricating data. Especially when it comes to outliers the debate gets heated, as this addresses technicality and ethicality. Outliers that negatively influence research findings are discarded more often and easily it seems than if they would support the hypothesis. Reporting of deleted outliers is necessary.
Data is often subjected a reanalysis by researchers, even though this is not considered appropriate from the technical and ethical side. Nevertheless, reanalysis data could uncover new or overseen points of great importance. Also, data deserves to be reanalyzed when there is great worth to the research and in addition when precious resources as money, time and effort are at jeopardy, not only for the researcher but for everyone who was involved as research subjects, the agencies and society as a whole.
Fraud in research occurs when data are being reported that never existed. Plagiarism in research means that one takes work or ideas of another researcher and claims it to be one’s own. There are two forms of plagiarism, intentional plagiarism and accidental plagiarism.
Costs, Utilities, And Institutional Review Boards
The Institutional Review Boards (IRB) decides whether to approve a study based on weighing the costs and the utility of it. Utility and costs can be pictured as a decision-plane, in which utility and costs represent separate dimensions. Studies which score high on utility and low on cost should be carried out, and studies which score both the same on utility and costs find themselves on the diagonal of indecision B-C, a decision making that is not easy. The A-C axis values mean low cost but also low utility and are often used as a criterion. This model of costs and utility does not take into account what the costs of not doing research are.
Scientific And Societal Responsibilities
Research on animals in psychology asks the question of how often the results of animal research are actually used. The moral contract that researchers abide to is referred to as the “three Rs principle” by Russell and Burch (1959), which describes some aspects of ethical research with animals. According to Russel and Burch, it would be senseful to use less animals in research (reduce), ameliorate experiments so there is less suffering (refine) and to use other procedures that do not include animals if possible (replace).
Scientists have the challenge of meeting certain standards in that they make contributions to society, which makes research even possible, and at the same time looking out for the well-fare and dignity of their subjects of study.
Random and Systematic Error
A study is said to have validity if it measures what it claims to measure. Reliability of a study means that the measures used yield consistency in results. Important to know is that one can have the case of reliability of a measure without validity. It is often assumed that in “hard sciences” (such as physics) there is a greater replication of measurements than in behavioral and social sciences. Hedges (1987) argued that this is not necessarily the truth, showing through various cases in thermochemistry and astronomy that the results from different laboratories differ quite a lot. Hedges therefore advocates solely for within-laboratory comparison.
Error denotes the fluctuations that occur in measurements, which can have an effect on how reliable and valid they are. This holds even in the slightest difference in measurement of e.g. 0.00001 g of the standard supermarket scales. This difference is due to chance error, or also called noise. This relationship can be described by:
Observed value = Actual (“true”) value + Chance (random) error
Chance error is the only difference that would remain between completely alike subjects. The systematic error, which is also known as bias, differs from the random error in a way that measurements deviate from the exact value in one direction. This results in mean values that are not similar to the actual value. This problem can be solved by subtracting the systematic error from the mean of the observations. The random error in contrast consists of measurements above and under an exact value, making the average and actual value similar.
If another variable changes because a change occurred in one variable, then this relationship can be described by the term internal validity.
Assessment: Stability and Equivalence
We will take a look at three types of reliability that are traditionally used, namely the test-retest reliability, the alternate-form reliability, and the internal-consistency reliability. A reliability coefficient is calculated for each type with certain mean values.
If a group is measured at a point in time for a certain characteristic and then at a later point in time again, the measurements used are of high test-retest reliability if they yield similar results. This is often applied for the measurement of trait characteristics, which are said not to change much over time and therefore are expected to have high retest reliability.
This type of reliability becomes important when the retest reliability is subject to inflation because a participant is already acquainted with the material being tested. A new version of this instrument needs to be created. Alternate-form reliability describes if these different versions are equal in the measurement of interest.
Equivalent-forms reliability refers to the measurement content of one version being compared to the content of another version in the sense of equivalence. This is being done in the same session, so the same point in time and a correlation is being calculated between them. The variance also needs to be similar of the different instruments to qualify as statistically equivalent.
Stability coefficients refer to correlations obtained from equivalent versions with the same research subjects but at different points in time. The coefficients of equivalence in contrast represents this correlation but at roughly the same point in time. Cross-lagged correlation implies time-lagged treatment of one variable in relation to the other (A or B).
Retest reliability for a test cannot be pinned down to one measure of retest reliability, because the time interval between the two measurements always influences the stability coefficient. An instrument should be sensitive to behavioral change.
Internal-Consistency Reliability and Spearman Brown
If the same feature is measured by items of the instrument, then it is said to have internal-consistency, which is also often termed the reliability of components. This is denoted as R, and following approaches of internal-consistency are discussed: 1.Spearman-Brown equation 2. Kuder and Richardson’s Formula 20 3. Cronbach’s alpha coefficient. Homogenous items can be added to increase R. The WAIS holds an average internal-consistency reliability of 0.87, which is said to be very satisfying.
With the help of the Spearman-Brown equation it is possible to see by how many items the test should be lengthened to yield a certain level of internal-consistency reliability. The formula is:
RSB= nrii / 1+ (n-1) x rij
RSB= Reliability of sum of item scores (n) (e.g. n= 4)
rij = item-to-item correlation, which is the single item reliability and calculated by taking the mean of the intercorrelations among all items (e.g. n= 0.60)
then: RSB = 4(.60) / 1+ (4-1) x .60 = 0.85
If an item added has a lower reliability than the present items, this could result in a decrease of reliability. Table 4.1 (p. 95, Research Methods) allows obtaining these results by finding the number of items (n) needed or the internal-consistency reliability (RSB) in the chart.
The KR20 and Cronbach’s Alpha
In the Split-half method a test is split in half and the halves are correlated. This gives the split-half reliability. When the test has a dichotomous format, the KR20 (RKR20) can be applied. It takes all possible divisions in halves into account. The formula is
RKR20 = (n/n-1) x (σ2t – Σ PQ / σ2t )
n= number of test items
σ2t = variance of total test score with that being: σ2t = Σ (tp –t(bar)p)2 / p - 1
P= proportion of responses labeled 1
Q= proportion of responses labeled with 0
The Kuder-Richardson equation is an undercase of Cronbach’s alpha or the alpha coefficient (RCronbach). Cronbach’s alpha can be estimated by:
RCronbach = (n/n-1) x (S2t - Σ (S2t) / S2t)
S2t= the sum of the variances of the individual items.
The Spearman-Brown formula and the Cronbach’s alpha yield approximately the same answer when the variances of all items are equal.
Effective Reliability of Judges
If we look at a group of judges and we want to make an estimation of reliability of their ratings, we are dealing with effective reliability. By using the formula of the Spearman-Brown procedure: RSB = n x rjj / 1 + (n-1)x rij
RSB= effective reliability
rjj= mean correlation among all judges; mean reliability
We can find out how many judges are needed (n) when a specific mean reliability (rjj) is given and when we know what our effective reliability (RSB ) should be at least. Cronbach’s alpha coefficient can also be calculated with this information given. With ANOVA the effective reliability can also be estimated. Named after C. Hoyt (1941), the formula is:
RHoyt = MSpersons – Msresidual / MSpersons
In addition The RHoyt equation and the equation to obtain RCronbach via ANOVA are the same. Intraclass correlation gives an estimation of judge-to-judge reliability which is calculated by:
rintraclass = rij = MSpersons – Msresidual / MSpersons + (n-1) x MSresidual
The RCronbach , RHoyt, and RSB for estimates of effective reliability all give very similar results, with only a minimal difference. Analysis of variance is a preferred approach with an increase of the number of items or judges.
Effective Cost of Judges
Effective cost helps to select the right type of judge for a certain budget which will ensure maximum effective reliability:
ECj= Cj (1-rij / rij)
Cj= cost per judges (Type j)
rij= average intercorrelation of judges (Type j) with one another.
For each judge type an effective reliability cost computation is done. Then a ranking of these effective reliability costs will be made. The selection will then start from best to worse judge, and will end once the planned in money was all spent or if the maximum number of judges have been reached. The product of effective cost and number of judges equals the total cost. When there is more than one type of judge the overall effective reliability of each type can be calculated so that they can be compared.
Effective Cost of Items
As discussed above, when we have several types of judges or just one, we can estimate the effective reliability. This idea can be transferred to choosing the best items for a test to yield the maximum internal-consistency reliability. Scoring a specific test item costs a certain amount of money with which we can calculate its effective cost. That way we can for example compare the effective cost for an essay item and for a multiple choice item, using the same formula of effective cost as for the case with judges.
Interrater Agreement and Reliability
It is necessary to make a distinction between interrater agreement and interrater reliability. Percentage agreement is computed with:
Percentage agreement = (A / A + D) x 100
A= number of agreements among judges
D= number of disagreements among judges
Similarly, the net agreement would be calculated with:
Net agreement = (A – D / A + D) x 100
A differentiation between accuracy and variability cannot be made with these calculations. That way, the outcome of these calculations might not be in agreement with the original data, which makes it not the best procedure to use. Table 4.7 (p. 104, Research Methods) illustrates this issue very well. Instead we can work with the product-moment correlation (r). For the upper example, we would use a 2 x 2 table of counts from which we would obtain the product-moment r, which is the phi (φ) coefficient. For the computation use:
φ = ( X2(1) / N )
Using this equation, the differences in results of Table 4.7 are being accurately reflected, unlike the percentage agreement which was quite insensitive to the actual differences in results.
Kappa (ĸ) deals with the problem of lack of variability in the percentage of agreement and makes adjustments according to that. With kappa you can compute the interrater reliability between judges. Regarding table 4.8 which illustrates the use of kappa, we have O, E and N with which we can compute it. O represents the observed number of agreements between the two judges. E denotes the expected number of agreements, taking chance agreement for the cells along the diagonal of the table into account. N stands for the number of cases. Multiplying the row total with the column total and all divided by number of cases gives the expected number for each individual cell. Cohen’s kappa is computed by:
ĸ = O – E / N – E
Kappa yields better results for percentage agreement than for interrater reliability. Omnibus kappas are kappas that have a statistic of df > 1, omnibus statistical procedures being the general term for this. They are the opposite of focused statistical procedures that have 1-df. The omnibus kappa is faced with the problem of not being able to tell how reliable specific judgments are being made. The 1-df kappas and omnibus kappas share an almost nonexistent or little relation. Unity or approaching unity of kappa means reliability of all judgments.
Even a kappa with df = 1 of a 2 x 2 can be problematic. This is of main concern to kappa which, due to its inequality to the product-moment correlation, cannot receive the same variety of interpretive procedures as those available to the product-moment correlations.
Looking at table 4.10 (p. 108, Research Methods), if the row totals for levels A and B and the column totals for levels A and B are the same, we also have equivalence of kappa and r. Since kappa and r are equivalent, the binomial effect size (BESD) can be used for interpretation of this specific kappa. Table 4.13 (p. 110, Research Methods), illustrates why Kappa should not be the go-to index of reliability as other problems might occur, such as when the largest count in a table switches cells resulting in different marginal totals for judges 1 and 2 regarding level A. Kappa would produce a different value than r, which is why the binomial effect size (BESD) cannot be applied to kappa, but only to r.
Replication in Research
The goal of replication in research is to be able to generalize results. This goes from generalization of time, measurements, to observers and manipulations. An exact replication of previous results will never be possible. Looking at it critically, only the same experimenter can work on the same experiment, and even that experimenter changes as he becomes older with time, just like the participants. This is why we use the term relative replications. To indicate reliability of how useful a replication is depends partly on three factors when conducting the replication: When, how and by whom.
When: The contribution replications make long after the research question has been around become less, as the first replications add the most information.
How: The way in which a replication is done is crucial, as a precise replication is very much like the original design, whereas a varied replication deliberately has adjusted some aspects. A plus of a successful varied replication is that it makes for an extension of the generalizability. An unsuccessful precise replication hurts the theory.
By whom: This point focuses on correlated replicators, describing how the independence of replicators is limited. A researcher with special interest for that field of study will work differently than a researcher not necessarily passionate about this field. Sets of replications might not be as independent for the former as for the latter. The talk is of “pre-correlation” of researchers of the same field of interest.
Regarding research groups, it is also important to address the topic of correlated replicators. It is an assumption that there is a higher intercorrelation of researchers belonging to the same research group. The correlation between students and their major professor might be the highest, and this might be attributable to selection and training. Selection refers to the choice of a student for a field of their interest in which they work with an investigator sharing these interests. Training refers to the time of training by the major professor, which is often dominating over the amount of training given by other instructors.
Correlation of replicators can be classified into a direct correlation of attributes and an indirect correlation of the data the investigators get through working with the participants. This has been a familiar idea, as already Karl Pearson (1902) noticed that there was a higher correlation between judges. Replications that have the highest degree of independence from the effects of correlation of replicators, time, expectancies, personal and physical contact are the replications that, if consistent results are present, provide the maximum amount of information and conviction.
Validity Criteria in Assessment
Validity is a criterion for whether a test measures what it says to be measuring. We will focus on content-related validity, criterion-related validity, and construct-related validity.
We have content validity if items test what they are supposed to test. Often judges are used for this assessment, but the evaluation is usually more global and not quantitative, which means that subjective judgments are yielded with this assessment. If an exam in university for example has content validity, the test reflected the material to be studied very well.
This type of validity refers to the relationship between a test and a certain criterion, with a strong correlation meaning that criterion validity is present. It is hence possible to make predictions with a test about a certain criterion, such as success in a future job.
Concurrent validity means that the criterion which we measure with a test is not occurring in the future, but is already present now. This form of validity illustrates the thin line sometimes present between the practical assessment of validity versus reliability, as it is debated that with concurrent validity you actually determine the reliability and not validity.
Predictive validity means that the test has predictive qualities for a future event or behavior taking place, such as for example the SAT test which measures how students would perform in college. In the clinical setting, judges who assess some behavior in a pooled manner will form the criterion. This type of criterion might not be as reliable, but adding more judges, will hence produce an increase of the reliability of this pooled judgment.
Watch out for correlations between a measure and a criterion that is unlike a criterion which also yielded correlation. This discrepancy should not to be disregarded as it is an indicator of no discrimination being made by the test between different criterions.
Convergent and Discriminant Validity
With construct validity the meaning of a specific test is determined. Evidence of construct validity are convergent and discriminant validity. Convergent validity is when we have a test which measures certain things as several other tests which are said to measure the same thing. In addition, a small correlation is wished for between the test and other tests that are said to not measure the same thing.
With the multitrait-multimethod matrix of intercorrelations a better overview of the convergent and discriminant validities of a construct can be made. With this method it is possible to see whether it is the same methods of measurements of the correlations that are responsible for high or low correlations instead of the convergent or discriminant validity. With the help of contrasts, combinations of convergent and discriminant can be made out.
The Rorschach has a criterion-related validity of r= .29 and the MMPI has an r of .30, both being multidimensional instruments. This, compared to other instruments, is not very high especially regarding the previously estimated high validity of these tests. In contrast to that, Jacob Cohen (1988) argued that real-life criteria rarely have a higher correlation with personality measures than r=.30.
Test Validity, Practical Utility, and The Taylor-Russell Tables
Not only does the validity tell us that a test is useful, but in regard to personnel selection another factor comes into play. This is the “selection” ratio, which determines how many of the total number of applicants will be chosen by the test, giving a proportion. If the selection ratio is very high, a high validity does not contribute much to the personnel selection, whereas for a low ratio, not the highest validity would still help a lot in the selection process. The best accuracy in selection is yielded when we have a high validity coefficient and a small selection ratio. When the distance between these two influences grows even further this causes even better selection accuracy.
Relationship of Validity to Reliability
No matter what the specific coefficients of validity and reliability of a test are, it is important to know that a minimum level of internal-consistency reliability is not needed to provide for a level of validity to count as acceptable. Table 4.17 (p. 129, Research Methods) illustrates this point. The formula for validity of a composite instrument (rnx,y) by J. P. Guilford (1954, p.470) is:
rnx,y= rxy / ((1-rxx /n) + rxx)
Three factors are involved in this equation and they become more visible if the equation is rewritten like this:
rnx,y= rxy x n x (1/ 1+(n-1)rxx)
rxy= average validity of an individual item
rxx= average intercorrelation (reliability)
n= number of items or judges
The validity of the composites increases when rxy or n increases. Yet the more rxx grows, the less the positive impact of an increasing n becomes. A decrease of rxx combined with an increase of n makes for a dramatic increase of composite validity.
Observing, Classifying, and Evaluating
In this chapter we will explore sense-making of qualitative data. Qualitative data consists of information gathered from peoples observed behavior, or words, such as from interviews, in contrast to the numerical data yielded by quantitative research. Qualitative data can also be quantified, such as when behaviors are put into certain categories by the use of dimensions that were fixed in descriptions beforehand. The focus of this chapter is the selection of judges that are most accurate, whereas in the previous chapter it was the selection of the most cost-effective judges.
Observing While Participating
Research taking on the form of participant observation is often used by social scientists called ethnographers, which focus on studying cultures. In this form of investigation, the scientist participates in the culture himself while observing it. The informants represent the people of that culture being studied. The ethnographers feel they have an advantage of scientists using other methodologies by experiencing the emotions of the people first hand by participating in their culture. Even though there are unobtrusive forms of observers, the informants might feel an invasion of privacy due to knowing about the observations made.
Maximizing Credibility and Serendipity
In qualitative research the scientist also has to report the procedures he used, alongside with descriptions of these. The making-sense approach seems to rely a lot on intuition, meaning an intuitive interpretation of qualitative data. Choosing the right time and place is very important to obtain meaningful qualitative data. Often there is also a portion of luck involved.
In animal behavior research, observation of events or states can be made, where the latter describes observation of longer term occurrences. When we have set of events, Altmann proposes that this contributes information to the frequencies of a behavior. The data can be sampled in different ways for behavior events. In time sampling a specific time frame is set, and within that time frame any observations of interest are made and noted.
In behavioral sampling the researcher fixes several periods of observations for a behavior that occurs continuously and watches out for the occurrences within these periods.
In qualitative research, ethnographers have adapted to a system regarding the field notes they take. They note down when a direct quote was taken down or when they paraphrased instead. Condensed account means that single words or fragments were noted down and the expanded account is the result of a later-on filled in condensed account.
Organizing and Sense-Making in Ethnographic Research
If these points are answered, a more organized and clearer picture emerges of the qualitative data:
1. Purpose of activities
2. Procedures used
3. Time and space needed for the activities
4. How many participants, who they are, and what their roles are
Other approaches are the day-in-the-life approach, where a whole day is described, or the approach of focusing the data organization around a problem. Key events can also be helpful in the organizing process. Just like in a mystery story, the outcomes can also be left for last. “Analytical serendipity” describes how one tries to derive theoretical sense from of the data. It is important that stereotypes are not influencing this process, and therefore well-documented conclusions are necessary for the sense-making. Also, a researcher should not only be open to expected experiences, but also embrace what has not been expected. Furthermore, metaphors can be used for concepts the activities. Lastly, the interpreter and observer bias have to be paid attention to so they can be avoided.
Interpreter and Observer Biases
Noninteractional artifacts refer to interpreter and observer biases because it means that the systematic error stems from the mind and is not a variable that cannot be controlled for by the scientist. There are two types of noninteractional artifacts, namely the interpreter bias and the observer bias.
When data is interpreted, systematic errors might be made when interpreting the data, which results in an interpreter-bias. When other researchers check the work of a colleague, they can control for interpreter bias in the results. A study by Sherwood and Nataupsky showed how different biographical characteristics of researchers led to differences in the interpretation of findings, regarding a study of IQ and racial differences.
Here the error does not occur during the interpretation of the data, but when the observations are made. What we think we see can therefore have a strong influence on the research, mostly in favor of the hypothesis. A good illustration of the observer bias is that of André Blondlot at the beginning of the 20th century, who thought he discovered “N rays”, rays that apparently had an increased intensity of light reflection. R. W. Wood proved him wrong by showing how N rays are just creations of humans being fooled by imagination, but for a while people did believe in the existence of this special ray.
Unobtrusive Observations and Nonreactive Measurements
Next to noninteractional artifacts, there are also interactional artifacts which a researcher cannot control but which directly influence the participants’ reactions. Furthermore there are reactive measurements and nonreactive measurements affecting behavior. If simply measuring the starting speed of a participant in a word ordering task for example would push the participant to work faster, this would represent a reactive measurement. Nonreactive measurements entail the use of archives, physical traces, and unobtrusive observation, which will be discussed below.
When information is recorded, it is put into archives for researchers to retrieve. We distinguish between running records and personal documents and episodic records, which represent the two subcategories of archives.
Actuarial data, records of sales, information of mass media, industrial or institutional information, or information of the political and justice system to name a few are all part of running records. Diaries, letters, photos, visual representations e.g. in books or postcards, and entered rumor control center calls are part of the subcategory personal documents and episodic records.
Archival Data and Content Analysis
With content analysis pictures, spoken and textual material can be placed into categories and then undergo evaluation. The key point is the use of commonsense and logic deciding what will be evaluated and how this will be done in terms of classification. A study by Crabb and Bielawski, for example, had judges evaluate and classify 300 pictures from children’s books, regarding the sex of the characters and the activity of using either household tools or non-household tools. From this they found that male characters more often engaged with non-household tools. Content analysis done by hand like this, dates back to the beginning of the 18th century, and eventually it was termed as a model by Harold Laswell around 1920.
Physical traces can be an indicator of several measures, such as wear and tear, but also quantitative values, such as the banal counting of nose prints on a glass wall in a museum after each day. Physical traces are simply observed and noted in a way like a detective would work. In using physical traces we mostly employ a nonreactive measure and unobtrusive observation.
In Crosby et al’s study, helping behavior of people in different cities was assessed, where the confederates that “needed help” were either local people or foreigners. Through the experiment they determined whether the locals or the foreigners received more help in certain scenarios. This is also a form of nonreactive measurement, because the participants are aware of their presence but are not aware of the confederate’s role in the experiment.
This experiment can also be termed contrived unobtrusive observation because manipulation was involved.
Selecting the Most Appropriate Judges
When the researcher does not make judgments or observations, like he would in observational studies, it is then judges who are selected to do the evaluation or categorization. These studies are termed judgment studies.
The researcher seldom makes a distinction between the judges of the type of judge he selected. Hence, the researcher rather intuitively decides on which judges of the selected type he should take. Yet, the most accurate judges can be selected not by randomly picking any judges of a type, but by careful consideration. This means that the nonrandom approach might turn out to be more beneficial in terms of accuracy of the individual judges. With pilot studies the accuracy of the potential judges regarding the criterion can be measured, and the judges with the most accurate rating of the criterion will be chosen from the pool.
It can be determined with several techniques to what extent a participant is biased or accurate. A common observation is that of a positive correlation between bias towards a category and accuracy in this category, as a bias of one category influences the accuracy for the category which is biased. From Table 5.1 (p. 136, Research Methods), this correlation would be r= 0.93.
Choosing the Number of Response Alternatives
The researcher also has to decide how many response alternatives are best. With more response alternatives, the probability of any judge being above chance-level in accuracy can be estimated easier. The statistical power of determining this is directly linked to the number of judges employed.
Table 5.3 (p.137, Research Methods) shows how many items or judgments are needed at least for a specific amount of response alternatives so that there is a significant accuracy for a judge. Adding more response alternatives to every item or judgment will help raise the statistical efficiency when there are a small number of items or judgments.
Effects of Guessing and Omission on Accuracy
We will now face the issue of the effect that guessing has on the accuracy of the judge’s ratings and how this effect can be estimated. The number of response alternatives is crucial when evaluating what the effects are of guessing a judge’s accuracy. Successful guessing
After correcting for guessing (R-adjusted) and taking into account the right responses (R), incorrect responses (W), and alternative responses (A), the standard estimate is said to be a function of the former and the three (R, W, A) latter. R-adjusted is given by:
R-adjusted = R- (W / (A-1))
For example, a test has three response alternatives per item and they got 60 out of 100 items correct, the R-adjusted equation would yield:
R-adjusted = 60 – (40 / (3-1)) = 40
where 40 is the adjusted score, meaning that correction for guessing on this score was done.
To locate zero values of adjusted accuracy scores, you can use:
R-adjusted = K / A
K= total number of items (R + W), R= right items, W= wrong items
The sum of R and W only gives K in the case when omitted items are counted as wrong.
This scoring of wrong for omitted items would give them the score of zero, which does not seem the best in regard to other strategies such as giving those omitted items the score of random guessing, such as 0.20 for five response alternatives. Omitting of items by judges should occur as little as possible, because judges hold more knowledge than they sometimes believe of themselves.
Intrinsic Factors and the Level of Accuracy
Not only do the extrinsic factors matter, such as number of response alternatives, but intrinsic factors also play a role when it comes to judge’s difficulty making the best judgment or selecting the best item. These include the time the stimulus was exposed or the quality of these. The most raw score accuracies (R) in practice lie between 0.7 and 0.85, the adjusted accuracy being 0.70. If we would have 0.50 as an average accuracy rate of the judges, we would not get much from this as this equals chance level of accuracy. 1.00 (100 %) as average accuracy rate would also not be good, as this means that there are no differences in the individual accuracy that can be used for assessment.
Next we will describe trait checklists, forced-choice judgments, and scaling for cumulative judgments, which are applied to categorical judgment studies.
Applications Of Categorical Judgments
An approach in personality assessment is using word lists which are presented to judges who then circle the traits that they think fit a person, making a categorical judgment of “yes” or “no”, with the items not circled indicating “no”. A score is determined by adding the “likable” and “unlikable” traits, where the former receive a rating in form of a positive number, and the later a rating in form of a negative number. For the negative ratings the absolute value counts, a high value indicating great unlikability of a trait.
In the forced-choice format, the judge is being presented by characteristics that are equally favorable. This way the halo effect is being avoided which for example prevents a favorable overall judgment. The judge is hence forced to choose between items that are all favorable or unfavorable to the same extent. Recent findings show that the halo effect is not as common as it was always thought to be.
Categorical Responses and Hierarchical Effects
“Interpersonal acumen” is an ability defined by being able to make discriminations between actions and intentions. Their study built on the theory that there is a hierarchy of action-intention combinations that starts with low cognitive complexity going up to high cognitive complexity. Hence, differentiation of complex action-intention combination should become better when a person has more interpersonal acumen, and thus scores higher in the hierarchy of ability of differentiation. The hierarchy-model proved well in its application, regarding the supporting results of the study.
Category Scales and Rating Scales
Dimensional judgments are used for continuous rating scales. There is the numerical and graphic form of dimensional judgment, as well as a magnitude scale format.
Unlike most people think it is important to know that both classification and evaluation are applied to all forms of judgments. The term “evaluators” was often only used for judges using a rating system, whereas “classifiers” were thought of only using classification to judge certain behaviors. Thus, all judgments include both category scales as well as rating scales.
Numerical, Graphic, and Magnitude Ratings
In all numerical scales, defined numbers in a certain sequence are used by the judges. Such a scale could for example be: How much do you like your classmates? With 1 being very much and 5 being as not at all.
In this format the subject places a check mark on a straight line, which has one trait on the left end and the opposite trait of it on the right end. With the help of a ruler, the researcher measures where they put the check mark. A division of this line into equal segments allows for a numerical scale rating, also called segmented rating scale.
Scale Points and Labels
When using more than 7 points in a rating scale, there is not much more improvement in reliability, such as a when going from a 7-point scale to an 8-point scale. The end points of the rating scale should receive labels, also called anchors. The label should be as precise and clear as possible. The points between can also be labeled. Regarding general design of rating scales, the highest number and the “positive” ratings are usually placed on the right.
In this type of scale by psychologist S. S. Stevens the judges get to freely estimate the magnitude of a certain characteristic, usually in the range of going upwards. The point of a characteristic is fixed and from there the subjects rate the magnitude of related characteristics.
Rating Biases and Their Control
Next to the halo effect there is another common error in rating, which is the error of leniency. Judges rate someone or something of familiarity better than what they are less familiar with. This can go the opposite direction if the judges are aware of this error, and therefore give them a more negative rating than they normally would. This is called the severity error. One way to prevent this error is by making one end, such as the end of the positive ratings, more stretched. For example, we would have on one end “poor” and then several options for “good”, reaching up to “excellent”. Or two unipolar scales are provided, where on one scale ratings for negative qualities are made, and on the other scale ratings for positive qualities are made.
The error of central tendency results when a subject only makes ratings close to the mean and in no extremes of either direction. This error is avoided with the same strategy of stretching the positive rating side. It is therefore okay to include more points in the positive range than the researcher believes are necessary.
The logical error in rating occurs when a rater mentally binds certain presented variables together because he believes they relate to each other in some way, and therefore rates these all more or less the same. Precisely formulated definitions and instructions can help reduce this error.
Bipolar Versus Unipolar Scales
Bipolar scales are scales where one characteristic is on one end, and the opposite of this characteristic is on the other end. Unipolar scales include one characteristic, and where one end means that this characteristic is not present at all.
Forming Composite Values
Composite variables can be formed of dependent variables which have high correlations with one another. Due to this high correlation, they can be analyzed together as there is no benefit of making separate analyses. Each variable who wants to go into the composite has to be standardized, and the mean of the standardized scores, z-scores, replaces the previous three separate scores. Large positive Z scores indicate a high score on the composite variable, whereas large negative Z scores indicate a low score on the composite variable. Each composite needs to be standardized when there are several composite variables in a context, as the standardized scores themselves do not follow a z-score distribution.
Benefits of Forming Composite Variables
The estimations made by composite variables in regard to the relationship with the other variables are of greater accuracy than those made by the individual variables that are not put into composites yet. Interpretation of levels of significance is facilitated by creating composites.
Forming Composites and Increasing Effect Size
The magnitude of the effect size of interest can also undergo the creation of composites to make estimation of effects. Correlations indices (r) are of great utility for measures of effect size. By creating composites, we yield the new correlation rcomposite. This can be done by taking the product of the multiplier (m) factor and the average individual r. Hence, we first need to find out what the multiplier m is with:
m= (n / 1 + ryy (n-1))
n= number of variables going into composite
ryy= mean intercorrelation among variables going into composite
The new r is then estimated by:
rcomposite = rindividual x m
The individual variables serving as basis of the typical effect size is multiplied with m.
Only when there are perfect correlations between the individual variables the forming of composite variables does not yield any benefits. The effect size rcomposite gets bigger with more variables being put into a composite. The increase of rcomposite will be the greatest when the mean of the correlations between the individual variables ryy is low. Individual variables are homogeneous when each individual variable shares the same amount of relation with the criterion variable, each being responsible for the prediction of a different part of this criterion.
By looking at the equation of rcomposite, we can turn around the equation for solving rindividual if this is what we are looking for instead: rindividual = rcomposite / m
Forming Multiple Composites
It is better to form more than one composite variable when we do not find homogeneous intercorrelations among the dependent variables. The large variability of these intercorrelations would hence not make for a good single composite. Table 5.8 (p. 155, Research Methods) shows a correlation matrix from which we can combine variables with similar correlations into composites. In this example, D and E would make for a good composite and A, B and C for another composite, with a certain independency between both composites.
The intra/intermatrix displays the mean correlations. We find this in part C. of table 5.8 (p. 155, Research Methods). We have the intracomposite average and the intercomposite average. We have an intracomposite average for each composite. There where composite I and II meet in the chart we have the intercomposite average, which represents the relationship of the individual variables of the first composite with the individual variables of the second composite.
Methods such as clustering, principal components analysis, factor analysis and dimensional analysis can help in building composites when the number of variables gets larger.
Quantifying the Clarity of Composites
To measure how much success the formation of composite variables brought, we use the r method and g method. The range-to-midrange-ratio method is a third method which is used when basic ingredients needed for the other methods are lacking.
The r Method
In the point-biserial correlation we correlate, with Pearson’s r, one continuous variable with one dichotomous variable. Hence, the point-biserial correlation between the continuous variable and the coded 1 versus the coded 0 variable of the intra/intermatrix is being computed with this method. From table 5.10 (p. 157, Research Methods), which holds the mean correlations of the intra/intermatrix of table 5.9 part B (p. 157, Research Methods), we can read off the average internal consistency for the individual composites and compare this to the value of the average correlation between the individual ingredients of the composites themselves.
The g Method
Hedges’s g is a commonly used index for effect size. Usually we defined hedges’s g by taking (M1-M2) and dividing this by S2, with S2 being the pooled unbiased estimate of the population variance. Applying this to composites, we compute g for quantification of clarity of composites:
g= rintra – rinter / Saggregated
M1-M2 is here substituted by rintra-rinter, the former being the on the diagonal located mean correlations and the latter being the not on the diagonal located mean correlations. Saggregated represents the S resulting from the combination of the diagonal and not-diagonal values of r. When we have a gs of 0.50 or an rs .25 we can speak of clarity of composites, meaning that the composites differentiate.
The Range-to-Midrange Ratio
If the only information that we have is the mean correlations of the diagonal (intracomposite mean rs) and the mean correlations of the not-diagonal (intercomposite mean rs), then we use the range-to-midrange ratio:
rmr= rintra-rinter / [rintra + rinter / 2]
We get a raw difference of rintra-rinter which increases in meaning when their mean (denominator) is not large. An rmr value being 0.33 or over is considered high enough that there is clarity of differentiation of the composite variables.
Experimentation in Science
Shadish et al. proposed four features that are part of an experiment:
- Variation in the treatment
- Outcomes are measured after treatment
- Observation is made on at least one unit
- A mechanism that shows what the no-treatment outcome would be
According to Shadish et al. experimentation is a systematic study design to examine the consequences of deliberately varying a potential causal agent.
Randomized Experimental Designs
This type of design is typically employed for between-subjects designs, or nested design. This means that there is a condition for each sampling unit. For example, a random experimental design would be the random assigning of a placebo to the control condition and the real medication to the experimental group.
To overcome the ethical issue of not giving every participant the benefit from the new medication by assigning them to a placebo, we can use wait-list control groups. This form of control group can nevertheless also only be used when there is no better option than assigning this group the placebo. The wait-list control group is immediately given the new drug when the effectiveness was proven.
Within-subjects designs, or crossed designs, are used when a participant judges several conditions, to adapt to the participants individual baselines in making judgments. The order of the conditions to be judged might create a confound in regard to the condition effect. With counterbalancing, this confound can be corrected for. The sequences of the conditions are rotated this way, creating a Latin square (table 7.2, p. 192, Research Methods).
The factorial design is also frequently used when there are at least two variables (factors) and each factor has at least two levels, for example a 2 x 2 factorial design. For a 2 x 2 factorial design a 2 x 2 analysis of variance would be made. A 1 x 4 contrast is better if we try to make predictions of a pattern regarding the means of all four conditions.
Fractional factorial designs, also known as fractional replications, only certain factor levels, so not the full factorials, are combined with mixed factorial designs. An example of the mixed factorial design would be when children from school A and school B take part in multiple treatments one after another, and measurements are made after each treatment. In it we have the between-subject design (school A and B) and the within-subject design (measurement after each treatment).
Characteristics of Randomization
By using randomization in experiments many sources of bias can be ruled out. As R.A. Fisher pointed out, it test of significance validity is ensured. With a large sample size it is more unlikely that there is a difference in subjects of the treatment and control conditions before the random assignment takes place. Failures of randomization represent cases where the subjects in the treatment condition differ a lot from those in the other condition, even though they were not treated yet.
One form of doing randomization is looking in a table, such as table B.9 in Appendix B (Research Methods) and picking out a random spot and taking numbers out of that column or row. Beforehand we had decided which numbers (odd/even) represent the experimental and control group.
It is important that baseline scores are recorded for the subjects, and this can be done with pretest measurements. This can indicate whether the control and experimental condition are so different from the start that a comparison of post-test scores would alone not make sense.
The Philosophical Puzzle of Causality
Causal relations describe relations of cause and effect. According to Aristotle causation could be divided into four types: material (elementary composition), formal (vision), efficient (moving force), and final (purpose). The final cause, he said, was God’s final plan. Roger Bacon in 1267 in his work Opus Maius argued that the demonstration of what reason taught us can only occur by experimentation.
Revolutions in science around the 16th to 17th century occurred such as Galileo’s support of the heliocentric theory of the universe, or Descartes theory of the world as a complex machine that everything of existence has a cause. Newton added to Descartes theory by describing nature as uniform and furthermore revolutionized science with his mechanical theory of the three axiomatic laws of motion. These insights untangled some of the inexplicable theories of medieval science.
Contiguity, Priority, and Constant Conjunction
According to David Hume causality is a product of the human mind, which is based on past experiences and the expectation that they will occur like that in the future again. He proposed that motion follows laws, whereas causality derives from the conditioning of sensory repetitions. His eight rules describing cause and effect relationships follow three basic ideas:
- Contiguity: Cause and effect are connected near each other by a link in time or space\
- Priority: The effect takes place after the cause
- Constant conjunction: A specific cause is always responsible for the same effect
Nevertheless, a distinction has to be made between events that are coincidentally and causally linked. But the question remains what defines causality. Is prediction the crucial element of causality or a mechanism that would account for the connection of events?
Four Types of Experimental Control
John Stuart Mill (19th century) contributed to a marked change in the empirical strategy, by making use of control conditions. The term control in experimental research takes over multiple functions.
Control had an original meaning of “check”, as in verifying (Boring, 1954). In 1893, the reference of “a standard of comparison” was used to describe control. This basically means controlling that the conditions remain constant. The term control series is used in experimentation as well, which means that there is a control condition in which variation occurs. Furthermore, the term behavior control is used to describe how systematic reinforcement is used to produce behaviors, hence forming learned behaviors.
Mill’s Methods of Agreement and Difference
Four methods of experimental inquiry by John Stuart Mill (1843) summarize why a control group or a condition that yields a comparison should be used in experimental research that is randomized and controlled.
The method of agreement is best described by “If X, then Y” where X represents the cause and Y the effect of presumption. X is solely responsible for bringing about Y and therefore it is a sufficient condition. This means that X has the capability of bringing about Y.
The method of difference is described by “If not-X, then not-Y”, which focuses on the idea that Y would not occur in the absence of X. This gives rise to the term necessary condition.
Both the sufficient and the necessary condition can be present, for example when we would come to the conclusion that virus ABC was necessary and sufficient to bring about the disease.
Between-Group Designs and Mill’s Joint Method
The method of agreement “If X, then Y” refers to the experimental condition and the method of difference “If not-X then not-Y” refers to the control condition. Together they are called the joint method of agreement and difference. This method can be applied to many cases. Yet, some cases often need other methods next to the joint method of agreement and difference to determine causality to its full extent.
In order to separate a placebo effect from e.g. the true effect of a medication we can add another control condition or take a different one. It can also be helpful to set up several stipulations regarding the status of the participants before the treatment.
Independent, Dependent, and Moderator Variables
An issue which remains in research is the choosing of control groups when there are several of them available. The kind of control condition itself is determined. It is up to the researcher to choose the best control group. Not choosing the best matching control group can turn out troublesome as wrong conclusions of the research will be drawn.
There is no guarantee of a perfect cause-and-effect relationship due to the many sources of variability. One source of variability is the moderator variables, which can introduce changes in the variables of the cause-and-effect relationship. Mediator variables, representing certain states, influence the relationship of the independent variable with the outcome variable by affecting the causality.
Cause and effect is nothing more than the independent and dependent variable in social sciences. Causal relations were renamed to functional relations and functional correlations. In research we do not speak of causal effect but of the effect X has on Y. Variables being sets of categories, we need to differentiate between the independent variable and the dependent variable in regard to the framework we are working with. The context is crucial for determining which variable is termed the independent variable and which the dependent variable, as a change in context could flip the roles of the variables.
Solomon’s Extended Control Group Design
While the concept of installing control groups became very popular in the early 20th century, Richard L. Solomon, an experimental psychologist, was the first one to address the issue of the potential sensitizing effect that pretesting of subjects could induce.
This problem can only be avoided by not having two-group designs, but rather three-group or four-group designs. The effect of the experimental condition improvement minus the combination of effects of Control Group I and II improvement will yield a negative or positive effect of pretest sensitization.
There are many studies illustrating the positive and negative effects of pretesting. Therefore it is advised to watch out for potential moderating effects that the pretest might create.
Threats to Internal Validity
We will now discuss four types of validity that can fall victim of violation due to certain variables or circumstances: internal validity, external validity and statistical conclusion validity and construct validity.
Internal validity, next to covariation and temporal precedence has a major influence on causal inference. The implication of internal validity is, as stated by Shadish et al. (2002, p.38): “the validity of inferences about whether observed covariation between A (the presumed treatment) and B (the presumed outcome) reflects a causal relationship from A to B as those variables were manipulated or measured.” (p. 210, Research Methods).
Several factors can be threatening to internal validity and those are history, maturation, instrumentation and selection. Regression or regression toward the mean can also be threatening to internal validity.
Regression toward the mean
Regression toward the mean lays its focus on the predicted or standard scores. Z scores represent those standard scores. In regression toward the mean we speak of an imperfect correlation of X and Y. After a sample mean was yielded from a pretest and we obtain another sample mean but from a post-test, we might see that the post-test mean lies closer to its population mean than the pretest mean to its population mean. If there is no perfect correlation, then regression toward the mean always remains an issue to watch out for.
Studies which are expressed as X-O are studies where a group is exposed to a variable and then possible effects are observed. O-X-O includes a condition of observation or measurement that installed before the group is experiences the intervention.
History describes a threat to internal validity where between pretest and post-test some event takes place which could alter post-test outcomes. The Solomon design with two groups of pretests and post-test each uncovers possible history biasing the post-test results.
Maturation describes changes of the participant that are intrinsic, such as for example greater patience or acquisition of more knowledge, that take place between the pretest and post-test. By having a group that does not experience the treatment we can see how growing older might produce maturation.
If the measuring instruments go through an intrinsic change then we are faced with the threat to internal validity called instrumentation. The measuring instruments might be judges for example, that might get better in rating of a specific characteristic over time. In the X-O design, instrumentation does not pose a problem, unlike with the O-X-O design, as there is only a one time judgment or observation made.
Selection occurs when participants are quite different from one condition to another. By making measurements before the treatment we can see how the participants differ over the conditions. Again, the Solomon design solves this problem by randomization of participants into conditions.
Threats to External Validity
External validity means that results are open to generalization or representative. This means that the cause-effect relationship can be applied to different people, in different environments and also when measurement and treatment variables are varied. Under external validity we have three subtypes of external validity:
Statistical generalizability refers to the question: can the results be applied to a wider population of interest ?
Conceptual replicability (robustness), which is most similar to Campbell and Stanley’s (1966) idea external validity
Realism or mundane realism refers to the question: How likely would analogies of this experimental treatment appear in natural settings?
There is a further distinction between mundane realism and experimental realism as Aronson and Carlsmith proposed. Experimental realism focuses on how the manipulation of the experiment psychologically impacts the participants. Lynch (1982) postulated that an implicit or explicit model of the investigated behavior is necessary to make judgments of external validity.
Before concluding that an experiment does not have the necessary external validity, Mook (1983) suggests to pose oneself the question whether the aim of the investigation is to yield estimations of population characteristics from characteristics of samples. Or is this investigation focused on the sample and about proving a theory of behavior?
Furthermore it should be asked whether the study deals with a universal principle of behavior and therefore results from the laboratory experimentation can be applicable to real life. Or are the laboratory results restricted to the controlled conditions?
When the subjects and the experimental stimuli both represent the target population, we can speak of a representative research design. Experiments conducted under this type of design are ecologically valid.
Moderator variables can lead to a wrong representation of external validity in regard to the inferences that are made about causality. In convenience samples of experimental psychology, which are samples that are easy to access to such as first year psychology students, moderator variables can be identified by having all the students take personality tests and fill in basic demographic information. Then, correlations between this information and the scores in total can be made.
Two researchers around 1940 disagreed on theories of the nature of learning. The first researcher Clark L. Hull believed in a step by step improvement of animals responding to stimuli. Edward C. Tolman was the opinion that cognitive functions are the center of nature of learning, with one event leading to another creating “maps” in cognition. This is also called purposive behaviorism, expectancy theory or sign gestalt theory. This goes against an automatic view of learning and instead argues in favor of learning through exploration and therefore would not always be a continuous process.
Coming back to the convenience samples, the issue commonly faced with is that a convenience sample often does not represent the population in mind. In addition to the difference in theories of Toll and Hull, there was also a difference in the types of rats they used for their experiments. Knowing this, the theories are not majorly questioned from their logical aspect or the aspect of internal consistency, but rather from the aspect of external validity concerning the generalized causality.
Statistical Conclusion Validity and Construct Validity
We have discussed internal validity and external validity, and will now face construct validity and statistical conclusion validity.
Statistical Conclusion Validity
The correlation between the independent and dependent variable and what can be inferred about this is of focus regarding statistical conclusion validity. In regard to statistical tests, with a decreasing level of power we are more likely to make a Type II error, which can threaten our statistical conclusion validity. In addition, if assumptions are violated, the test or measurements are unreliable and if effect size is estimated vaguely the statistical conclusion validity is furthermore threatened.
Construct Validity addresses the question of what construct is really being assessed. It measures whether the proposed concept of how X and Y relate to each other is really how X and Y relate to each other. The verification of a concept remains difficult though, as according to Poppers falsificationist view the search for falsification of a theory is never ending and impossible to end. So while construct validity focuses on the concept being measured, internal validity is concerned with being able to eliminate any other factors that could account for the outcome Y other than the independent variable X tested.
Review Of All Four Types of Validity
Researchers might be misled in the process of inferring causality when construct validity and internal validity score low. Limitations in making any inferences of causality arise when the statistical-conclusion validity or external validity are low.
Subject and Experimenter Artifacts
Artifacts can be a threat to all four types of validity. This means that other factors than the ones taken in account are responsible for the findings. The subject and experimenter artifacts address the subjects or the experimenter as being responsible for systematic errors made. Herbert H. Hyman noted the importance that ignoring an error is not the same as lacking error, because there is always random and systematic error in every research. With cognitive psychology emerging as a new area of psychology and the great demand of psychological research after World War II, the focus on subject and experimenter artifacts increased between the 1960’s and 1970’s.
Illustrations of Artifacts: The Horse Clever Hans
A famous artifact illustration is that of the horse Clever Hans. This horse was said to be intellectual and able to reason, answering with hoof sounds to questions the owner asked him. It was only when psychologist Oskar Pfungst visited a performance of Clever Hans, that it was revealed that subtle body movements of the owner guided the horse in his hoof tapping, and that the horse actually had no reasoning ability.
Illustrations of Artifacts: The Hawthorne Effect
Another illustration of the influence of artifacts is the Hawthorne effect. This artifact was only an incidental finding, which started with experiments during the time of 1924-1932 in a factory examining the work productivity of the workers under different conditions, such as brighter light or longer breaks. The researchers saw an improvement of the productivity, and also when the workers did not receive the brighter light or longer breaks anymore. This led to the finding that solely the knowledge of the workers that they were being studied by researchers increased their productivity because it made them feel special.
Clinical psychologist Saul Rosenzweig proposed sources that produce artifacts in research. These three sources were:
Observational attitude of the experimenter: The researcher’s attitude about the research can have a drastic impact on the observations he makes. In hard sciences such as chemistry the researcher takes his presence on the experiment into account, such as own body temperature, Rosenzweig argued.
Participants who want to outguess the experimenter: These participants focus a lot on the evaluation of their behavior and act consciously in line or out of line with the research.
Personality influence and errors: Errors might result from the experimenter acting a certain way. This would for example mean that if a researcher acts very distanced and cool because that is part of his personality, this might create a source of error in the experiment. Gestures, word choice, race and sex might also present sources of error.
Rosenzweig proposed that to avoid motivational attitude error, the researcher should use some form of deception. But according to Rosenzweig, it is never sure who the “true deceiver” is in the experiment- the experimenter or the subject himself?
Demand Characteristics and Their Control
Psychiatrist and psychologist Martin T. Orne noticed in interviews with his research participants after the experiment that they asked if they did anything wrong. Basically they wanted to know if they performed well in regard to what the researcher wanted to test. Leaned on Kurt Lewin’s (1935) term Aufforderungscharakter, Orne gave the name demand characteristics to the cues a researcher unconsciously gives to participants before the experiment, which might lead them to act according to what the researcher suggested. His own studies supported the theory of demand characteristics, such as the study involving “catalepsy” reactions. Demand characteristics are hence a source of artifacts in experimental research.
The good subject effect is what Orne called the cooperative behavior which the participants exhibited in his studies testing demand characteristics. Orne noticed the subjects would want to complete the most trivial tasks because they want to contribute to research, thus wanting to fulfill the demand characteristics of the experiment itself. Controlling for the good subject effect, Orne argues, is not a possibility. According to him it is only possible to watch out for demand characteristics in the experimental context and then be careful in the data interpretation.
With Orne’s idea of installing the quasi-control strategy, he hoped to eliminate the good subject effect. In this type of strategy, the participants at one point become “co-investigators” themselves, so they will be actively engaged in finding the true results. There are several techniques that put the research subjects in the position of contemplating about possible confounding variables, which is the aim of the quasi-control strategy:
The subjects take part in post-experimental interviews, which puts them in the position of being their own quasi controls.
A sample of subjects does not take part in the experiment, but should describe what behavior they would exhibit. Then a comparison is made between the descriptions and the actual behavior of the ones participating in the study. This technique termed by Orne is called preinquiry.
The sacrifice group consists of participants that will randomly be told to leave the study and are asked about how they perceive the experiment up to the point they were still in it. They might also be asked about their reaction of being deceived in certain ways.
A comparison is made of the difference in behaviors between subjects that volunteered and subjects that did not volunteer to participate in the study. The good subject effect is said to be more often displayed by volunteer subjects.
The evaluation apprehension, a term coined by Rosenberg (1965), differs from the good subject effect in a way that subjects experiencing evaluation apprehension worry about potentially being evaluated in regard to some performance in the experiment. This leads them to want to look good or favorable in the eye of the researcher. Subjects experiencing evaluation apprehension might find themselves in a dilemma of whether looking good is more important than furthering the research as in the case of the good subject effect. A confidentiality agreement can help prevent the evaluation apprehension.
Interactional Experimenter Effects
After having reviewed the noninteractional artifacts that do not concern the experimenter and subject interaction, we will examine the interactional experimenter effects that occupy themselves with this interaction. Five classes of the interactional experimenter effect are reviewed: biosocial attributes, psychosocial attributes, situational factors, modeling effects, and expectancy effects.
Here social and biological factors such as gender, race and age might influence the interaction between the experimenter and the subject. It has to be watched out for example that there is no difference in how male and female experimenters treat their subjects. Also the other way around, female and male subjects should receive the same treatment by the experimenter. If the latter does not apply, this could lead to wrong interpretations of a significant difference between male and female performance in the study.
Not all experimenters are alike in their personality or temperament, and this could contribute to the subjects responding differently in the individual studies. For example, a participant might experience evaluation apprehension when they are intimidated by their experimenter.
Situational effects entail any events that might occur during the experiment that the experimenter becomes aware of, such as looking at the first results. Also, it entails the experience of the experimenter with conducting the study or the familiarity with the research subjects. These artifacts could lead the subjects to behave and respond differently. By making time block divisions it can be determined whether results differed markedly at different points in time of the study.
If an experimenter influences the subjects to behave in an experiment according to how he would behave or how he behaved in a previous test trial, then we can speak of modeling effects. This can become evident in the researchers attitude or opinion. This can lead to the subjects behaving in line with the experimenter’s behavior/opinion or result in the opposite behavior of acting and responding out of line with his opinion.
If the experimenter holds certain expectations of how the study will turn out, this can influence the subject’s responses into that direction. Self-fulfilling prophecies may then be created by the experimenter’s expectations. This will result in the experimenter expectancy effect, which means that the experimenter treats the subjects in certain ways that elicit responses that go in line with the experimenter’s expectations.
Experimenter Expectancy Effects and Their Control
Expectancy effects on the research subjects are illustrated by the experiments in which the same breed of rats was given to different research groups. One research group was given the information that they received bright rats that were especially bred for running mazes, and the other group received the information that their rats were dull at maze running. Due to the different expectations of the researchers and the impact that these expectations had on the two different experiments, the results differed markedly. The research group who was given the apparently bright maze running rats received results that showed way better learning behavior of the rats than those rats of the group who was told that they were dull.
The experimenter expectancy effects, inside and outside the laboratory, was supported by numerous studies conducted after the shocking findings of the rat maze-running experiments.
Table 7.9 (p.227, Research Methods) displays six reduction strategies of the experimenter expectancy effect:
1. Number of experimenters is increased
2. The behavior of experimenters is monitored
3. Order effects are analyzed for in experiments
4. “Blind” contact should be held
5. As little contact as possible between experimenter and subjects
6. Use of expectancy control groups
The fourth point addresses the single-blind study and the double-blind study. In the single-blind study the participant does not know in which condition, e.g. control or experimental condition, he is placed. The process of assigning participants to the conditions is randomized. In the double-blind study neither the researcher nor the subjects know who is assigned to which condition. The difficult implementation of these strategies leads to less use than would be wished for to control for expectancy effects.
The fifth point addresses expectancy control groups, in which different experimenter expectancies with the elicited behavior of his or her research subjects are put in comparison. This technique allows discrimination between the effects of the independent variable and the effects that are actually due to expectancy effects.
The mediation model by Rosnow and collaborates proposed an intervention of the process of how an artifact comes to life, beginning with the source. This model is less occupied with the different starting variables that are responsible for certain artifacts but rather with breaking the chain of causes that lead to the artifact being created.
McGuire (1969) describes the life of an artifact following through these three stages: ignorance, coping and exploitation.
In general we can say that by knowing about the various sources of artifacts, our knowledge of the experimental setting has increased a lot. In addition we can see better where there are limits in trying to understanding behavior.
Nonrandomized and Quasi-Experimental Studies
When randomization is not an option in a certain research setting, mostly with human subjects, then we can use quasi-experiments to yield a somewhat randomized experiment setting. Campbell and Stanley argued that only randomized experiments are “true experiment”, but also recognized that it is not always possible to do so when doing research with human participants. Instead of then measuring causality in nonrandomized experiments, we can measure association. An implication of association is covariation, but this does not equal causation.
Methodological pluralism, meaning that several empirical methods are used because each single one has limitations, is often applied to nonrandomized studies.
There are several types of nonrandomized strategies. We will focus on these four types: nonequivalent groups design, historical control trials, interrupted time-series designs, single-case studies, and correlational designs.
Diachronic research means that a variable is tracked over time in succession, while synchronic research means that in one session the behavior is measured. Studies with experiments mostly follow the synchronic style. With propensity scores the subjects can be matched to a condition if the researcher cannot control in which condition the subjects are placed. These scores are calculated by considering all the available information.
Nonequivalent Groups and Historical Controls
The nonequivalent-groups design uses observations that are made before and after the measurement. Because the participants are not selected randomly for the different groups, there are certain methods we can apply to limit the amount that the groups differ from each other and allow for comparison. For example, after the experiment has taken place the participants can be randomly assigned to the existing groups. Regarding medical trials, this poses the issue whether it is ethical to randomly assign and have some people in the control condition or wait-list control condition wait for or not receive treatment.
Historical Control Trials
Going around the ethical issue, we can use historical control trials which consist of people that had the same disorder and received treatment for it. Yet, these trials show a lot more improvement in healing from the disorder than randomized control trials, which means that the latter provided false-negative conclusions. But since the historical controls were doing worse than the randomized controls from the beginning, there was more potential of improvement possible for the former. Bias from selecting only certain people to be historical controls might also occur, questioning the interpretability of the research.
A commonly faced problem in clinical trials is that of the pooled effects (net effects) that might cover true effects of individuals. When only the pooled effects are reported, which is not uncommon in medical research, false conclusion might be drawn from this leading to the Simpson’s Paradox for example. When another factor is added to the analysis of a bivariate statistical relationship, this might produce findings opposite of that when the factor was not in the model yet.
Interrupted Time Series and the Autoregressive Integrated Moving Average
When an intervention is made and measurements are taken at various specific time points before and after in a systematic manner, we are dealing with an interrupted time-series design. Time series implies that a single point in time has an according data point. The word interrupted is included because it means that the beginning of the experiment is marked clearly to divide it from the measuring done beforehand. In selecting a sample interval the researcher wants to obtain the effects he is looking for. The researcher has to see that he is obtaining enough samples but also not to many to catch only the most important data.
The procedure autoregressive integrated moving average (ARIMA). Three steps are part of using ARIMA: a model is identified that could serve as the basis of the serial effects, the model parameters are estimated and the model fit is checked. When a researcher proposes three models for the underlying serial effects, all models go through these three steps and the one that best fits is chosen.
Autocorrelation describes the relationship of dependency between the data points (or observations). There is regular autocorrelation and seasonal autocorrelation. In the former data points that are bordering each other are dependent and in the latter data points are dependent in terms of cycles or periods.
Single-Case Experimental Designs
Several observations can also be made in one session, called single case. The experimenter controls the intervention. This design occupies itself mostly with causal reasoning in normal life situations. For example, if a doctor gives a patient an injection after they stepped on a rusty nail, this injection cannot be made into the same foot as the swelling from the injection could not be distinguished from the effects of the rusty nail.
Just like the interrupted time-series studies, several observations are made before and after the intervention, but the interrupted-time series studies require obtaining way more data points in succession. This then allows for the use of ARIMA. In contrast, the single-case experimental design uses visual inspection and therefore relies on more profound effects that are so great that they can be detected on a graph for example.
A baseline behavior for a subject is determined before the intervention takes place. This is important as individual effects are often erased through averaging. The A-B-A, also known as reversal design, represents this scheme, with A being the observation and B the intervention/treatment. There are variations to this such as an A-B-A-B or A-B-BC-B, C representing a different treatment than B. This way it can be found out if B and C together as treatments work better than B alone.
An advantage of single-case experimental designs is that they are low cost as one subject can serve as part of the control and treatment condition. Another advantage is that they also measure effects at several time points throughout all stages of the experiment and do not simply obtain only one cross-section of effect in one point in time. Yet, conducting experiments with this type of design takes a lot of time. In addition the question remains of how generalizing the results are if there were only a minimum amount of research subjects participating.
Cross-Lagged Correlational Design
In the cross-lagged panel design some data points receive a temporary lag in their treatment regarding the outcome measure values. Panel study, a term from research in sociology, means that research is conducted in a longitudinal manner. Longitudinal research allows a participant to be placed in all conditions over time of the research and it takes into account how they respond differently over the time period.
Even if no correlation can be observed in the cross-lagged correlational design, we still cannot conclude that there is no causation, as the design might have overlooked a causal relationship.
There are three forms of paired correlations: test-retest correlations (rA1A2 and rB1B2), synchronous correlations (rA1B1 and RA2B2), and cross-lagged correlations (rA1B2 and rB1A2). These are displayed in Figure 8.1 (p. 243, Research Methods) and are represented by the notations in the parentheses. Both variables A and B are measured at two contiguous points in time.
The best interpretations can be made if at each period we have approximately the same correlations. Temporal erosion describes a decreasing correlation that we might obtain after a longer period of time has passed. Attenuation is the reduction in correlation that we observe.
Invisible Variables and the Mediation Problem
Path analysis describes a non-experimental approach whose goal is to infer causation. Often faced with controversy, this approach works by trying to eliminate associations between the variables in order to crystallize out the causal influence and exclude any influences that are not the cause. The third-variable problem presents the problem of the confounding variable. This variable is hidden and is responsible for a causal relationship between two variables, by being correlated with each variable itself. Another problem in non-experimental studies that aim to infer causation is the mediator variable, a variable which intrudes between X and Y.
Path Analysis and Causal Inferences
Regarding Figure 8.4 (p. 249, Research Methods), we can see how several causal pathways are proposed, each representing a hypothesis. These pathways are constructed from the various correlations of Figure 8.3 (p. 247, Research Methods). A pathway/ hypothesis is then rejected after another, until the best fit remains that can explain the data observed. This method is illustrated by the study of Eron and Huesmann investigated what the relationship between violent TV and real-life violent acts of children are. Even when it is often difficult to pin down a one-to-one causal relationship, the direction in which the data goes often makes for a suggestion that is strong enough to make an approach of inferring causation.
The Cohort in Longitudinal Research
When talking about longitudinal research, the issue of cohorts has to be addressed. A cohort is characterized by people who were all born in the same generation and have hence lived through similar historical and personal important life events at approximately the same time. Longitudinal studies make use of prospective data collection, the same as with a panel study. In addition, longitudinal studies also take in information of the past, meaning data is gathered retrospectively as well. Longitudinal studies are not only commonly used in research in the social and behavioral sciences, but also in medicine and any human developmental research.
Different Forms of Cohort Studies
A cohort Table 8.2 (p. 252, Research Methods) shows clearly how cross-sectional designs differ from cohort designs when drawing conclusions about a phenomenon. All cross-sectional curves can be displayed in graphs, such as the one shown in Figure 8.6 (p. 254, Research Methods), and this will improve the analysis of an individual cross-sectional sample in the context of the others. The fallacy of period centrism occurs when generalization of the results yielded from analyzing a certain time period occurs, saying that other time periods would yield the same results. That is why plotting according to cohort is important when it is possible.
An age effect occurs with the natural process of aging and examines the average changes happening. When events take place in chronological order and these are measured at their occurrence, this is a time-of-measurement effect. When we look at generations and measurements of this generation (or cohort), we speak of a cohort effect.
When working with longitudinal or cross-sectional designs, we pay attention to these three variables: age, period, and cohort. It is not possible to account for all three variables and their effects concurrently. That is why the following designs show a lack in at least one of those variables:
In the simple cross-sectional design the observation of subjects that are not the same age is being done. This causes the problem though that all people of the same age are seen as from the same generation, confounding the participants age.
When periodical observations of a cohort are made, we are using a simple longitudinal design. Yet, historical events that might occur are confounding variables to the results.
The cohort-sequential design is allowing the studying of cohorts and allows examining the age, but disregards time of measurement to a certain extent.
In the time-sequential design, neither the ages of the participants nor the times of observation are the same. Age and time are taken into account while cohort is being neglected to some extent.
The cross-sequential design does not fully account for age, but does so for cohort and time of measurement. Over several periods the observation of different cohorts is made but firstly a measure in the same period is made.
Even though one design cannot answer all questions, we can combine designs such as methodological pluralism proposes.
Subclassification on Propensity Scores
Propensity scores are made by looking at variables of treated (or manipulation) and untreated (observed) participants and putting all those that differ into one composite variable. This composite variable is hence a summary of all differences from all the variables, which are also called covariates.
By making subclasses, certain differences such as age differences can be adjusted for. With age differences, subclasses of similar ages would be made and then all conditions are tested in each subclass to make a comparison of conditions in each subclass possible. All conditions must have the same amount of subclasses and the more subclasses there are the more precise the analysis is.
Multiple Confounding Covariates
If there are more than one confounding covariates then one has to adjust not only for differences in age but also another for another covariate. The propensity score is then estimated by assuming that there are multiple confounding covariates. All confounding covariates are combined, yielding one single confounding covariate. A participant is put into either group 1 or group 2 depending on the probability, which is built on the participant’s covariates scores, predicting the group membership. The group membership, and not the confounding covariates, is hence responsible for the differences in outcomes.
Adjustment for covariates can also be made with methods using regression. A problem with using regression in regard to adjusting for covariates is that if there is no sufficient overlap on the confounding covariates, this will not be signalized to us. In contrast, when using propensity scores we will instantly recognize the lack of overlap by observing very small sample sizes that might even be zero.
A limitation of propensity scores remains that adjustment cannot be made for hidden confounding covariates, only for observed confounding covariates. Larger sample sizes are best for using the propensity score method.
Sampling a Small Part of the Whole World
When trying to find out something about the population, a sample is taken. How these samples are selected can happen in multiple ways. In probability sampling for example, there is a nonzero probability of every sampling unit being chosen, drawing of the units occurs in a random manner and estimates of the samples are made by using the probabilities.
It is criticized that sophomores often serve as research subjects, as they might be more prone to exhibiting demand characteristics to please the teachers. When only volunteers, often recruited from the internet, take part in questionnaire studies then this poses another threat for generalizing research findings.
To ensure external validity the probability sampling method is recommended, but as Abraham Kaplan said, faith is also involved in probability sampling, which he termed the “paradox of sampling”. To overcome this issue a sample should not be associated with representativeness. It is the right procedure, the sampling plan which allows speaking of representativeness.
How selection of respondents takes place is defined by a sampling plan. Probability sampling allows randomness of the selection process. This increases the chance of obtaining a representative sample of the population to the extent that researchers can make assumptions about representativeness.
Bias and Instability in Surveys
In survey research, point estimates are descriptions of the central values of sampling distributions, and interval estimates are descriptions of variability. The margin of error is often used as an expression for interval estimates, which indicates the lower and upper bound of a confidence interval. The standard error (SE), which notes the imprecision in making estimations, is associated with the margin of error. The term bias is interchangeable with the term systematic error, and is always present when the true population value differs its estimate are different. If there is no such difference, then we can speak of an unbiased sampling plan. There is more stability or precision when the variability in measurements /observations is little.
With little observations and much variability in those, we have higher instability. We need fewer samples when the members of a population are homogeneous. Often it is falsely assumed that all subjects in convenience samples are alike. This can lead to false subgroup creations, such as when simply dividing between male and female and assuming those members of each subgroups are alike.
Simple Random-Sampling Plans
In simple random sampling, the word simple implies an undivided population from which the sample is taken and random refers to same chance level for every population unit to be picked. Table 9.1 (p. 265, Research Methods) shows randomly created digits, and by blindly selection a starting position you read across the row to randomly obtain subjects for the sample. When subjects that are selected are not put back in the pool again to possibly we selected again, we are doing random sampling without replacement. This is commonly done in survey research, because a subject should not respond twice most of the time. In random sampling with replacement, the likelihood of being chosen remains the same throughout the selection process for all units in the population.
Improving Accuracy in Random Sampling
The size of sample needs to be determined to make estimations of the mean of the true population. Overestimation or underestimation of the population values occur even when using the unbiased simple random sampling plan. The standard error also refers to the sampling plan accuracy and is gotten by taking the standard deviation of the estimate errors.
With stratified random sampling we can make our samples more accurate. It works by making a division of the population into strata regarding certain information of when samples underestimate or overestimate the true population value. Random sampling is then done in each strata. If the population is variable, we can sample these strata, which are not as variable as the population, in a random manner. This increases accuracy.
Confidence Intervals for Population Estimates
For a 95 % confidence interval, we can be 95 % confident that our confidence interval includes our true population parameter. When we increase the confidence level (e.g. from 90 % to 95 %) our confidence interval gets larger, expanding the margin of error. We can also create a confidence interval for binomial population proportions (example: 90 % confidence interval):
90 % CI = P +/- (1.64 √(PQ /N))
Where P is the population proportion estimate and 1-P=Q. Here √(PQ /N) represents the standard deviation. Reversely, calculation of the sample size to achieve a certain level of confidence is possible with this equation.
Speaking of Confidence Intervals
When speaking of lower and upper bounds of a confidence interval and that the population parameter is in this range with a certain level of confidence, we are taking up the Bayesian approach of statistics. Other views than the Bayesian approach presents the traditional, classical, sampling theorist, or frequentist approach. The latter focus on the role of repeated sampling and repeated calculations in interpreting confidence intervals. This meaning is just as correct as the Bayesian meaning of a confidence interval.
Other Selection Procedures
Area probability sampling represents a form of stratification, in which divisions of the population is made into units that are selected, and which are just as likely to be chosen as units that are not selected in the population. This process takes place in stages, and is therefore called multistage cluster sampling.
With the systematic sampling, interval selection separates sampling units on a list and the sampling units are methodically selected. When sampling can only be done manually and we have a large sample and population, then this method of sampling is often used. To do systematic sampling, a sampling interval and a random is necessary. Nevertheless, the samples obtained with systematic sampling are not always exactly random.
Haphazard or fortuitous samples are most commonly used in nonrandom selection. These samples often do not represent the population of interest well as the sampling does not take biases or limits into account. For example, all sampled people might be alike on a certain characteristic, which would lead to wrong conclusions drawn from this unrepresentative sample. Informal polls are examples of haphazard sampling.
Quota sampling aims to have so many respondents so that the population is fairly in proportion with the sample. In this type of sampling, participants are looked for with certain characteristics and it can be tricky finding these participants.
Nonresponse Bias and Its Control
If there are many non-responses to a survey, such as for example a telephone survey, then this might introduce a bias itself, making the study less valid. The bias of point estimation can be reduced when more effort is made to get the non-respondents to respond.
When a non-response bias is suspected in selection of samples in research it is often hard to make an estimate of how great the bias is, taking into account that we have no statistics for the non-respondents. In telephone surveys it is also hard to say whether the person on the phone was really not allegeable to participate or whether they simple did not want to, the latter falling under non-response.
To solve the problem of the non-response bias, Wainer proposed that for the selection process a theory-based model could serve which would describe the observations that were made but also what observations cannot be made. According to Wainer a model of this type was developed by Abraham Wald in World War II for his statistics projects.
Studying the Volunteer Subject
Similar problems that arise with the non-response bias are also an issue for when studying only volunteers. To handle this issue, the characteristics of volunteers and non-volunteers can be compared. This can be done with several approaches:
The first approach is simply comparing the subjects who participated in a study with subjects who did not participate in it.
In the second approach the identification of non-volunteers is made by recruiting volunteers from a specific college class for example. The whole class completes the questionnaire then, with the questionnaire being handed out by a person independent of the one who has done the recruitment of volunteers.
A third approach would be to only collect volunteers and a second questionnaire is introduced after the first one. Those who do not want to participate in the second questionnaire are now considered as non-volunteers and will be compared to the ones who go on to take the second questionnaire. This might not make the best distinction between volunteers and non-volunteers as the non-volunteers were actually volunteers to begin with.
A fourth approach proposes to keep on asking non-volunteers to become volunteers, and this process will be repeated. Meanwhile, data points are made of the characteristics of the recently joined volunteers following another request. These will be compared to the data points of the non-volunteers that never became volunteers throughout the process of several requests.
The fifth approach consists of making data points for the characteristics of those becoming volunteers at different latencies after only one request was made. This approach is most commonly used for survey research.
Characteristics of The Volunteer Subject
The characteristics of the volunteer subjects can be summarized by looking at the results of the studies the individuals participated in (Table 9.6, p. 279, Research Methods). Two further tables (Table 9.7 and Table 9.8, p. 281, Research Methods) indicate how confident we can be that certain characteristics are in association with volunteering, and make a distinction between the categories/levels: maximum confidence, considerable confidence, some confidence and minimum confidence. Under each category certain characteristics are listed. The cutoff scores that determine membership of volunteer characteristics in the different categories can be found in Table 9.8. Furthermore, conclusions for each category of confidence are made, which describe and sum up the characteristics of a volunteer under that certain category.
Implications for The Interpretation of Research Findings
Results cannot be generalized when only having volunteer subjects, as there is either a positive or negative bias, and we can make a prediction about this direction of the bias. Figure 9.2 (p. 284, Research Methods) shows a positive bias that results when only volunteer subjects are used. To minimize a threat to generalizability such a volunteer subjects only an increase in the sample size is not enough. Instead, recruitment of more volunteers has to be accomplished and probability sampling should be done as well.
In randomized experimental studies a prediction of the direction of the volunteer bias should be possible. It has to be taken into account what the baseline level of the variable of interest for volunteers is, and also how this relates to certain characteristics of volunteers being present unconditionally in the control condition.
Keep in mind that sampling biases are usually not controlled for when working with randomized experimental designs. This is a threat to generalizability.
Situational Correlates and the Reduction of Volunteer Bias
For confidence of a variable regarding situational correlates of volunteering we need to refer to the studies which volunteers have done that are associated with certain outcomes due to the influence of situational correlates of volunteering. Furthermore we need to regard the actual proportion of studies which give support of the hypothesis indicating the direction of the bias as well. For qualification of the category some confidence for example, 3 studies are needed which are related to the situational correlate, and all these need to provide support for the relationship.
The same table as for the volunteer characteristics is created for situational correlates of volunteering. Table 9.9 (p. 286, Research Methods) displays these results. Furthermore, just like with the volunteer characteristics, conclusions are made for the individual categories of confidence, which are again maximum confidence, considerable confidence, some confidence, and minimum confidence.
These recommendations might look like ways how to solely recruit more volunteers, but they also aid in making considerations and plans for the research.
The Problem of Missing Data
Missing data is can be expressed in proportions, where .00 means that all data were provided and 1.00 means that the person did not even come to the study. Missing data causes the problem of bias of our estimates and less statistical power. There are certain degrees of randomness of data missing:
MCAR (missing completely at random): there is no relation between the missingness of data and the variables of interest.
MAR (missing at random): there is a relation between the missingness of data and the variables of interest, but due to other observed variables. Hence, there might be a bias of the estimates as there is a correlation present.
MNAR (missing not at random): there is a relation between the missingness of data and the variables of interest and this not completely due to other observed variables. These correlations cannot be explained and are never zero, which is why there is an estimate bias.
Procedures For Dealing With Missing Data
The nonimputational procedures and imputational procedures are used to handle missing data. Table 9.10 (p. 289, Research Methods) displays these procedures and what approaches can be used with them. The difference between those two procedures is that for nonimputational procedures the missing data are not taken into account when making parameter estimates, while imputational procedures make parameter estimates after the missing data has been accounted for.
With the pairwise deletion all data that is there is used for yielding parameter estimates. This procedure is powerful but only with the implication that unbiased estimates are the result of the computation, such as with MCAR. Listwise deletion is almost as powerful as the pairwise deletion. A problem with both of these procedures is that the question of whether we have MCAR data can never be answered.
The maximum likelihood estimation and the Bayesian estimation in combination with MAR data provide unbiased estimates. The statistical model of the data set very much guides the computations of the results. Two general types of imputational procedures are single imputational procedures and multiple imputational procedures. The single imputational procedures fill in the missing data with estimates of the values and use this data set for analysis just like with a full data set, except for minor changes such as making the degrees of freedom smaller.
The mean substitution procedure uses the mean value of a variable to fill in the missing ones. The regression substitution procedure works by filling in the missing values with the predicted value for that variable. For the computation of the predicted value only the cases are used where no data is missing. The stochastic regression imputation is a more refined version of the regression substitution procedure as accuracy is increased by the add-on of a random residual term to the estimates. The hot deck imputation matches cases with complete data to cases with missing data in similarity, and one case is chosen which provides the needed values for the incomplete case.
The multiple imputation represents quite another approach, in which a set of m estimates is used to fill in one observation that is missing. This will result in the creation of m pseudocomplete data sets, which are all analyzed normally. We hence have m analyses which when combined give results that are less biased. In addition, the variability estimates also increase in accuracy.
NHST – Null Hypothesis Significance Testing
In psychological research, we use null hypothesis testing to find out whether the mean differences between groups in an experiment are larger than differences that are expected due to error variation. When carrying out null hypothesis testing, you first need to assume that there is no difference between groups i.e. that the independent variable had no effect. This is called the null hypothesis. When you have made this assumption, you then have to test it using probability theory. Probability theory is used to determine how likely it is that we obtained the difference we did, assuming that the null hypothesis is true. If the probability is small, then you can reject the null hypothesis, by concluding that the independent variable had an effect on the dependent variable. An outcome is statistically significant when it leads us to reject the null hypothesis.
The level of significance is the probability you choose to indicate whether an outcome is statistically significant. The most common level of significance chosen in the scientific community is 0.05. 0.1 and 0.01 are also chosen sometimes, depending on the type of study. As a researcher, you must decide the level of significance before the experiment takes place. This is to avoid the temptation of choosing your level of significance based on the probability of your obtained results.
When using inferential statistics test, there are two possible conclusions. These are either rejecting the null hypothesis or failing to reject the null hypothesis. If a null hypothesis test is not significant this does not mean you accept the null hypothesis, you can only reject it. This is because the experiment may have involved some factors that prevented an effect of the independent variable being found. For example, a small sample is often one of the reasons why a null hypothesis is not rejected.
Statistical inference relies on probability and therefore there is always the chance of making an error. The two types of errors one can make are Type I Errors and Type II Errors. A Type I error is rejecting the null hypothesis when it is actually true. A Type II error is failing to reject the null hypothesis when it is in fact false.
Sensitivity in Experiments and Power
The sensitivity of an experiment is the likelihood of detecting an effect of an independent variable when the independent variable does actually have an effect. A statistical test on the other hand is discussed in terms of having power. Power is the probability in a statistical test of rejecting the null hypothesis when it is actually false. Power can also be calculated as 1 minus the probability of a Type II error.
There are three influential factors when calculating power in statistical tests. These are the sample size, the size of the treatment effect and the level of statistical significance. The sample size is the main factor used by researchers to control power. This is because of the huge influence that sample size can have in detecting effects. Very small effects can be detected if the sample size is big enough.
Repeated measures designs are found to have more sensitivity and higher statistical power compared to independent group’s designs. This is because in repeated measures designs, the estimations of error variation are likely to be smaller.
Type II errors are the most common in psychological research. When statistical significance is not found it does not mean that there is no effect. This is a reason why it should never be concluded that a null hypothesis is true, only that it should be rejected. This is also one of the reasons why it is important to obtain an effect size. This is because the obtained effect can be compared to those found in other studies, and you can then see whether the effect was statistically significant. This is the reason a meta-analysis is carried out.
To determine the power before the study is conducted, you need to first estimate what you anticipate the effect size to be in your experiment. For example, you can research effect sizes found in previous studies of the independent variable you would like to measure. After estimating the effect size, you can look at ‘power tables’ to find information about the sample size you will need in order to ‘get’ that effect. It is strongly recommended that if you have a good estimate of the effect size in your study, then you should carry out a power analysis prior to the research study.