Summaries per chapter with the 4th edition of Statistics: The art and science of learning from data by Agresti & Franklin - Bundle
What is statistics? - Chapter 1
Most professional professions nowadays rely heavily on statistical methods. In a competitive job market, insight into statistics and statistical methods offers an important advantage. But it is also important to understand statistics, even if you will never use it in your work. Understanding statistics can help you make better choices because you are bombarded every day with statistical information from news reports, advertisements, political campaigns, and surveys. A good understanding of the statistical reasoning - and in some cases statistical misconceptions - underlying these judgments will help deal with all this information.
How to use data to answer statistical questions?
Data is the information we gather with experiments and surveys.
Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data.
Researchers want to investigate questions in an objective manner. Statistical methods make that possible. Statistical problem solving is an investigative process that involves four components:
Think of a statistical question.
Gather data.
Analyze the data.
Interpret the results.
Statistics has three main components for answering a statistical question:
Design: thinking of how to get the data necessary to answer the question.
Description: the obtained data needs to be summarized and analyzed.
Inference: making decisions and predictions based on the obtained data for answering the question. (Infer means to arrive at a decision or prediction by reasoning from known evidence).
Statistical description and inference are complementary to each other. Statistical description provides useful summaries and helps find patterns in your data, and to be able to make predictions and decide whether observed patterns are meaningful, it is useful to use inference.
We need to think carefully about the questions that we want to answer by analyzing data. The nature of the statistical questions has an impact on design, description and inference.
The word probability is used to refer to a framework for quantifying how likely various possible outcomes are.
What is a sample?
Subjects are the entities that are being measured in a study. These can be people, but do not have to be.
All the subjects of interest are referred to as the population. In practice, we usually have data for only some of the subjects who belong to that population. This smaller size of subjects is called a sample. We plan to gather data from the sample. The sample is often randomly selected. It is more practical to get data for a sample, because obtaining data from an entire population is often too costly and timeconsuming.
Descriptive statistics refers to methods for summarizing the collected data (where the data constitutes either a sample or a population). The summaries usually consist of graphs and numbers such as averages. The main purpose of descriptive statistics is to reduce the data to simple summaries without distorting or losing much information.
If we want to make a decision or prediction about an entire population, but we only have data for a sample, inferential statistics are used. Inferential statistics thus refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.
Reporting the likely precision of a prediction is an important aspect of inferential statistics.
The absolute size of the sample matters much more than the size relative to the population total.
It is crucial to distinguish between to following terms:
Parameter: a numerical summary of the population.
Statistic: a numerical summary of sample taken from the population.
Because the true parameter values are almost always unknown, we use sample statistics to estimate the parameter values.
A sample tends to be a good reflection of a population when each subject in the population has the same chance of being included in that sample. That’s the basis of random sampling, which is designed to make the sample representative of the population. Important to know:
Random sampling allows us to make powerful inferences about populations.
Randomness is also crucial to performing experiments well.
Samples do vary. The measure of the expected variability from one random sample to the next random sample is referred to as the margin of error.
Results are called statistically significant when the difference between the results for two condition groups is so large that it would be rare to see such a difference by ordinary random variation.
How do you make use of computers for statistics?
MINITAB and SPSS are two popular statistical software packages on college campuses. The TI-83+ and TI-84 graphing calculators, which have similar output, are useful as portable tools for generating simple statistics and graphs. The Microsoft Excel software can conduct some statistical methods, sorting and analyzing data with its spreadsheet program, but its capabilities are limited.
Large sets of data are organized in a data file, to make statistical analysis earlier. This file usually has the form of a spreadsheet. It is the way statistical software receives the data.
Most studies design experiments or surveys to collect data to answer the questions of interest.
Databases are existing archived collections of data files. Sometimes it is adequate to use these databases to answer the questions of interest.
How to explore data with graphs and nummerical summaries? - Chapter 2
Any characteristic observed in a study is referred to as variable. The values of data vary. In a data set, these variables are usually listed in the columns. The rows of the same data set refer to different observations on a variable. Observations refers to the data values that are observed. The observations can be a number or a category. Numerical values that represent different magnitudes of the variable are called quantitative. If a variable belongs to one of a set of distinct categories, the variable is called categorical. Sometimes numbers are used to represent categorical variables. These remain categorical variables and thus are not quantitative. It is because the numbers do not represent different magnitudes of the variable.
What roles do contingency, correlation and regression play in association testing? - Chapter 3
When data is being analyzed on two variables, the first step a researcher has to make, is to distinguish between the response variable and the explanatory variable. A response variable is also called the outcome variable. The explanatory variable is also called a predictor variable. If the explanatory variable is categorical it defines that the groups need to be compared with each other, when the explanatory variable is quantitative it examines how the different values of this variable relates to changes in the response variable.
How do you gather data? - Chapter 4
Study design and data of good quality are crucial elements of statistical practice. This chapter discusses ways of gathering data that is useful and valid.
What role does probability have in our daily lives? - Chapter 5
In everyday live you have to make a lot of decisions, based upon uncertainty. In this chapter we introduce probability - the way we quantify uncertainty. You will learn to measure the chances of possible outcomes for random phenomena.
Researchers rely on randomness to make sure that there will be no bias in the data. Randomness also applies to the outcomes of a response variable. It helps to make games fair, everyone will have the same chances for possible outcomes.
When you roll a die men says that you have a one-in-six chance that you will get 6 on any given roll. What does this mean? In a relative short run, such as 10 rolls of a die, the cumulative proportion of 6s can fluctuate. But if the number of trials keeps increasing, the proportion of 6s becomes more predictable in time and less random. Jacob Bernoulli proved that as the number of trials increases the proportions of occurrences of any given outcome approaches a particular number in the long run. This is known as the law of large numbers. The probability of things is the outcome in the long-run. With random phenomena, if something did not happen in quite a while, people are sure it is due to happen soon. People tend to think that the probability of the random phenomena goes up until it happens. But this is certainly not true. What happens on previous trials does not affect the trial that is about to occur. Trials are independent of each other.
What are probability distributions? - Chapter 6
In statistics, possible outcomes and their probabilities are summarized in a probability distribution. There are two sorts of probability distributions someone can use, namely a normal and a binomial distribution. The normal distribution is known for its bell-shaped form, and plays a key role in statistical inference.
If you use proper methods for gathering data in research, the numerical values that the variables have should be a result of random phenomenon. It may stem from selecting a random sample out of the population one is investigating. In this sort of cases, the variables are called random variables.
Letters of the alphabet, usually the letters near the end of the alphabet, are used to symbolize the value of the variables. Such as x, y and z. When people refer to the random variable itself instead of the value the variable has, they use the capital of te letter, such as X, Y and Z. Each random variable refers to the outcome of a random phenomenon and for each outcome there is a specific possibility. The probability distribution of the random variable is about the possible values and their possibilities.
What are sampling distributions? - Chapter 7
In practice, you seldom know the values of parameters. For example, when elections are coming up, candidates are interested in gauging where they stand with the voters, so they rely on surveys/polls to help predict who is going to win. This section is going to introduce a type of probability distribution called the sampling distribution that helps us determine how close to the population parameter a sample statistic is likely to fall.
Statistical inference: what are confidence intervals? - Chapter 8
Statistical inference methods help us to predict how close a certain sample statistic falls to the population parameter. You then can make decisions and predictions about populations even if we have data for relatively few subjects from that population. There are a few relevant concepts in statistical inference, such as the role of randomization, concepts of probability, the normal distribution and the use of the sampling distribution. These concepts are important for two reasons:
- Statistical inference uses probability calculations that assume that data were gathered with a random sample or randomized experiment.
- The probability calculations refer to a sampling distribution of a statistic, which is often a normal distribution.
There are two types of statistical inference, namely estimation and testing hypotheses. This chapter discusses the estimation in statistical inference. The most informative estimation method is about an interval of numbers, mainly known as the confidence interval.
Statistical inference: What do significance tests say about hypotheses? - Chapter 9
In this chapter you will learn how to use inferential statistics to answer questions regarding predictions and guessing, such as astrology; what makes people believe that the positions of the planets and the moon at the moment of yout birth determine your personality traits. To do this, researchers use a method called significance testing.
How do you compare two groups? - Chapter 10
Consider a study that compares female and male college students on the proportion who say they have participated in binge drinking. You have two variables; the reponse variable, namely binge drinking. The explanatory variable here is gender, and measures the difference between the sex of the students and their binge drinking behavior. An analysis that looks at any type of relationship beween two variables is called a bivariate analysis. It is a special case, when the explanatory variable in the bivariate analysis is a bivariate variable.
How do you analyze the association between categorical variables? - Chapter 11
In Chapter 3, you have learned that two variables have an association when particular values for one variable are more likely to occur with certain values of the other variable.
When you want to investigate an association, first it is very important to identify the response and the explanatory variable. It is, for instance, more natural to study the influence of income (high/low) on happiness instead of the other way around. So, income is the explanatory variable and happiness the response variable. You can put this data in a contingency table. The percentages in a row are called the conditional percentages. Here, they refer to the distribution of happiness. The distribution is called the conditional distribution. You also have proportions that are called conditional probabilities of, in this case, happiness.
How do you analyze the association between quantitative variables: regression analysis? - Chapter 12
In this chapter you will learn more about using a regression line to predict the response variable y and the correlation to describe the strength of the association. A regression line is a straight line that predicts the value of a response vairable y from the value of an explanatory variable x. The correlation, denoted by the letter r, is a summary measure of the association that falls between -1 and +1. You'll learn how to make inferences about the regression line for a population and how the variability of data points around the regression line helps us predict how far from the line a value of y is likely to fall.
What is multiple regression? - Chapter 13
When you have several explanatory variables, you can make better predictions using all of the variables at once. That is the idea behind a multiple regression. But besides helping you to predict the response variable better, multiple regression can help you analyze association between two variables while controlling for another variable/keeping it fixed. That is very important because the effect of an explanatory variables can change very much after you take a potential lurking variable into account.
How do you compare groups: analysis of variance methods - Chapter 14
The methods that they are mentioning in this chapter apply when a quantitative response variable has a categorical explanatory variable. The categories of the explanatory variable identify the groups to be compared in terms of their means on the response variable. The inferential method for comparing means of several groups is called analysis of variance, or denoted by the name ANOVA. The name analysis of variance is about the significance test that focuses on two types of variability in the data. The categorical explanatory variables in the multiple regression and in ANOVA are often referred to as factors. When the ANOVA has one factor, the ANOVA is called an one-way ANOVA. When the ANOVA has two factors, then the ANOVA is called a two-way ANOVA.
What does nonparametric statistics mean? - Chapter 15
Nonparametric statistics are known to be an alternative way to compare two groups without having to assume a normal distribution for the response variable. They solely use the ranking of the subjects on the response variable. They are especially useful in these two cases:
- When the data are ranks for the subjects rather than quantitative measurements
- When it is inappropriate to assume normality, and when the ordinary statistical method is not robust to violations of the normality assumption. We might prefer to not assume normality because we think that the distribution will be skewed when we do. Or, when we have no idea about the distribution shape, and the sample size is too small, it will also give you a lot of information that you otherwise will miss.
This chapter will give you some sort of idea behind the nonparametric methods, and you will learn more about the most popular nonparametric test, the Wilcoxon test for comparing groups. The nonparametric methods in this chapter are special cases of permutation tests when applied to the ranks of the observations instead of using the original values.