Summaries per chapter with the 2022 edition of Analysing Data using Linear Models by Van den Berg - Bundle
Study guide with Analysing Data using Linear Models by Van den Berg
Study guide with Analysing Data using Linear Models
Oline summaries and study assistance with the 2022 edition of Analysing Data using Linear Models by Van den Berg
- For Bulletsummaries with Analysing Data using Linear Models by Van den Berg, see Bulletpoints per chapter with the 2022 edition
- For Booksummaries with Analysing Data using Linear Models by Van den Berg, see Summaries per chapter with the 2022 edition
- For Exam tickets with Analysing Data using Linear Models by Van den Berg, see Examtickets per chapter with the 2022 edition
- For all available summaries, bulletpoints and exam tickets with Analysing Data using Linear Models by Van den Berg, see Study guide with Analysing Data using Linear Models by Van den Berg
Related content on joho.org
What are variables, variation and co-variation? - Chapter 1
- What is this chapter about?
- What is a data matrix?
- What is meant by wide and long data formats?
- Which different types of measurement level exist?
- What are frequency tables?
- What are quartiles, quantiles and percentiles?
- Which three measures of central tendency are there?
- How to use measures of variation?
- What is a normal distribution and how can you define it using the empirical rule?
What is this chapter about?
In this chapter, different types of variables are discussed. Quartiles, quantiles and percentiles are explained as well. Lastly, a normal distribution is shown as well.
What is a data matrix?
Data (plural) are facts and statistics collected together for reference or analysis. In data analysis, we almost always put data in a matrix format. Usually, the objects of the study -called units- are put in rows, and their properties -called variables- in columns. A data matrix thus is a matrix (a collection of rows and columns) that contains information on units (in the rows) in the form of variables (in the columns). An example of such a data matrix is given below. In this matrix, there are four units and two variables.
name | grade |
Laura | 8 |
Lisa | 7 |
Luna | 6 |
Lena | 9 |
What is meant by wide and long data formats?
Often, units of analysis are observed on multiple variables, meaning that there are more observations for every unit of analysis. These data can be stored in either a wide or long format. In a wide format, variables are simply add to the row (unit of analysis). Each new observation of the same variable on the same unit of analysis leads to a new column in the data matrix. Below you can find examples of a wide format.
client | depression_1 | depression_2 | depression_3 | depression_4 |
1 | 115 | 110 | 100 | 95 |
2 | 105 | 100 | 103 | 102 |
3 | 106 | 105 | 103 | 103 |
An alternative way to describe these data is by, instead of adding columns, simply sticking to one variable and only adding rows. This is done by means of a long format. Below is an example of a long format. Note that these are exactly the same data, only visualized differently.
client | time | depression |
1 | 1 | 115 |
1 | 2 | 110 |
1 | 3 | 100 |
1 | 4 | 95 |
2 | 1 | 105 |
2 | 2 | 100 |
2 | 3 | 103 |
2 | 4 | 102 |
3 | 1 | 106 |
3 | 2 | 105 |
3 | 3 | 103 |
3 | 3 | 103 |
Which different types of measurement level exist?
Data analysis is in essence about describing how different values in one variable relate to different values in one or more other variables (co-variation). When describing such co-varying variables, linear models are an important tool. In differentiating between these different variables, one important distinction is the measurement level of the variables: numeric, ordinal or categorical.
Numeric variables are variables that have values describing a quantity that can be measured as a number, such as ‘how many’ students in a classroom or ‘how much’ kg you weigh. A numeric variable can be a count variable, for instance the number of children in a classroom. A count variable can only consist of discrete, natural, positive numbers: 0, 1, 2, 3, etcetera. But a numeric variable can also be a continuous variable. Continuous variables can take any value from the set of real numbers, such as weight: 60.2, 58.8, 93.2 and so on. The number of decimals can be as large as the instrument of measurement allows. Examples of continuous variables include height, time, age, blood pressure and temperature.
For numeric variables, one can further distinguish between interval variables and ratio variables. The difference between interval and ratio variables is that for ratio variables, the ratio between two measurement values is meaningful, and for interval variables it is not. When a variable has a fixed zero-point, it is a ratio variable. In case the variable has an arbitrary zero-point, it is called an interval variable. What ratio and interval variables do have in common however, is that they are both numeric variables, expressing quantities in terms of units of measurements. This implies that the distance between 10 and 20 is the same as the distances between 30 and 40, 40 and 50 and so forth. This distinguishes them from ordinal variables.
Ordinal variables are not measures in units. However, they can have a meaningful order in the values of the variable. For example the size of clothing: small, medium, large. Ordinal variables are usually discrete: there is not an infinite number of levels of the variable. In case of our example with sizes small, medium and large, there are no meaningful other values in between these values. Categorical variables do not consist of any order at all. They are about the quality of study objects.
What are frequency tables?
A frequency table describes how often a certain frequency occurs. Below is an example of a frequency table with frequencies, proportions and cumulative proportions (adding to 100).
age | frequency | proportion | cum_frequency | cum_proportion |
0 | 5 | 0.1 | 5 | 0.1 |
1 | 10 | 0.2 | 15 | 0.3 |
2 | 10 | 0.2 | 25 | 0.5 |
3 | 20 | 0.4 | 45 | 0.9 |
4 | 5 | 0.1 | 50 | 1.0 |
These data can also be plotted in a frequency plot with age on the x-axis and frequency on the y-axis. Further, once could plot these data using a histogram. Histograms contain the same information as frequency plots, except that groups of values are taken together. Such a group of values is called a bin. In our example, each age could be a bin. Therefore, there would have been five bins, each containing a frequency.
What are quartiles, quantiles and percentiles?
Quartiles (from quarter, a forth) are used to make a division into four groups. For example, you could divide 100 children by assigning the 25% tallest children into the first group, the 25% smallest children in the last group and the remaining 50% divided into two equally sized groups in the middle. Next, a quantile is the value below which a given proportion of observations in a group of observations fall. Finally, percentiles are very much like quantiles, except that they refer to percentages rather than proportions. Thus, the 25th percentile is the same as the 0.25 quantile. And the 0.75 quantile is the same as the 75th percentile.
Which three measures of central tendency are there?
There are three measures of central tendency:
- The mean is the average value, which can be computed by adding up all the values and dividing it by the number of values.
- The median simply is the middle value. In the event of even numbers, it is the average of the two middle values.
- The mode is the value that occurs most.
For numeric variables, all three measures of central tendency are valuable. For ordinal variables, the mean is not meaningful, but the median and mode are. For categorical variables, only the mode is valuable.
How to use measures of variation?
Next to summarising distributions by measures of central tendency, we could summarise distributions by measures of variation. First, the range is the distance between the lowest and highest value. Suppose, the lowest value is 100 and the highest value is 130, then the range is: 130 - 100 = 30. Second, the interquartile range (IQR) is the distance between the first and third quartile. That is, the difference between the value for which 75% of measurements is below and the value for which 25% of measurements is below. Third, the sum of squares or sum of squared deviations is the sum of all deviations from the mean. The variance, then, is the sum of squared deviations divided by the number of observations. The standard deviation is often used to indicate how deviant a particular value is from the rest of the values. For example, suppose we have a mean of 100 and a standarddeviation of 5. Then, a score of 105 is one standard deviation separated from the mean. A score of 110 is two standard deviations separated from the mean.
These standard deviations are useful, because they make it possible to compare different values from different variables. More specifically, a standardised score can be computed by subtracting the mean and dividing the result by the standard deviation. A z-score (also known as a standard scores) gives you an idea of how far from the mean a data point is. In more technical terms, it is a measure of how many standard deviations below or above the population mean a raw score is. Z-scores are a way to compare results to a “normal” population.
What is a normal distribution and how can you define it using the empirical rule?
It is important to know that for a normal distribution (bell-shaped distribution), the mean, median and mode are all the same. Moreover, 68% of all values lie between 1 standard deviation below and above the mean. In addition, we also know that 5% of the observed values lie more than 1.96 standard deviations away from the mean (2.5% on both sides). Because all these percentages are known for the number of standard deviations, it is easier to talk about the standard normal distribution.
Although tables are readily found online, it’s helpful to memorise the so-called 68 – 95 – 99.7 rule, also called the empirical rule. It says that 68% of normally distributed values are at most 1 standard deviation away from the mean, 95% of the values are at most 2 standard deviations away (more precisely, 1.96), and 99.7% of the values are at most 3 standard deviations away. In other words, 68% of standardised values are between -1 and +1, 95% of standardised values are between -2 and +2 (-1.96 and +1.96), and 99.7% of standardised values are between -3 and +3.