What are distributions in the realm of statistics? - Chapter 1

This chapter focuses on distributions. A distribution describes the distribution of a variable. It tells us which values of a variable occur in individuals and how often these values occur. Distributions can be described numerically with, for example, an average or a median. This chapter also describes various ways in which you can graphically represent distributions, for example via a histogram or a box plot.


How can you learn from data?

Statistics is the science of acquiring knowledge based on data. Data are numerical (or qualitative) descriptions and information objects to study. This first part deals with data handling. First, the different types of data we can collect and how datasets are organized are discussed. After that, attention is paid to processing data by looking at graphs. Graphs are useful because they provide a visual image that can be used to discover patterns in data. The next step in the data learning process is to calculate numerical summaries. This can be used to describe patterns in the distribution of data. Finally, we make the transition from data summaries to statistical models. Here it is explained what density curves are and the normal distribution is introduced. These distributions play a critical role in methods of drawing conclusions from different datasets.

What is a dataset?

Statistical analysis starts with a dataset. A dataset is a structured 'bin' of data.

  • Data consists of numerical values.

  • A dataset is constructed by determining which cases (or units) we want to study. For each case, we collect information about properties called variables.

  • Cases are the objects described by a dataset. These can be customers, companies, test subjects or other objects.

  • A label is a special variable that is used in some datasets to distinguish different cases from each other.

  • A variable is a property of a case.

  • Different cases can have different values ​​on the variables.

  • A categorical variable places an individual in one or two or more groups or categories. An example is gender.

  • A quantitative variable has numerical values ​​that can be calculated. An example is height: someone of two meters is twice as tall as someone of one meter.

  • A distribution of a variable tells us which values ​​of a variable occur in individuals and how often these values ​​occur.

  • We use the term units of measurement to refer to the way a variable is measured. For example, time is measured in hours, minutes or seconds, the height of a child in meters or centimeters. These units of measure are an important part of the description of a quantitative variable.

What are the main properties of a dataset?

Each dataset is accompanied by certain background information that helps interpret the data. Consider the following points:

  • Who? Which cases describe the data? How many of these cases does the dataset contain?

  • What? How many variables does the data contain? What are the precise definitions of those variables? What are the units of measure for each quantitative variable?

  • Why? What purpose do the data have? Do we hope to answer a specific question? Do we want to draw conclusions about cases for which we have no data? Are the variables used fit for purpose?

A spreadsheet can be used to process the data. This can be done in Excel, for example. It is important to avoid spaces with variable names, as these are not allowed in some statistical software. Instead of a space, an underscore (_) can be used.

If we want to make a variable suitable for calculation, we can transform the variable. For example, the letter assessments from the American school system can be converted into numbers (A = 4, B = 3, etc.). This is only possible when the difference between A and B is the same as, for example, the difference between C and D.

Part of becoming good at statistics is knowing which variables are important and how they can best be measured. Different types of variables may require different instruments to take measurements. For example, a breath test to measure lung capacity, and a survey to test personality. Often, details of certain measurements require knowledge of the specific field of study. In any case, make sure that each variable really measures what you want it to measure. Poor choice of variables can lead to misleading conclusions.

How can you graphically display distributions?

What is Exploratory Data Analysis?

Exploratory data analysis (EDA) involves describing the most important characteristics of a dataset. The following two strategies can be used in this regard:

  • First, examine each variable individually. Only then should the relationship between the variables be considered.

  • Graphically display the values ​​of variables. Then numerical summaries can be made of these values.

The values ​​of a categorical variable are labels for the categories, such as "female" and "male". The distribution of a categorical variable shows how many of the people studied scored a certain value (count). This can also be stated by means of percentages.

Which categorical variable charts are there?

A distribution can be represented graphically by a:

  • Bar graph: The heights of the bars say something about how often certain values ​​occur. The frequencies are on the y-axis and the lengths of the bars must therefore correspond with this.

  • Pie chart: For example, you can immediately see whether more men than women took part in a survey. Because pie charts do not use scales, quantities are expressed using percentages. Pie charts require that all of the categories that make up the whole are added.

Bar charts are easier to interpret and are also more flexible than pie charts. They can both be used when you want people to be able to see at a glance about frequencies of values ​​of a variable.

What charts for quantitative variables are there?

Stem-and-leaf diagram

A stem-and-leaf diagram (stem plot) quickly visualizes the shape of a distribution, while adding each value in its original shape. Such a diagram is most useful when there are not too many observations (all of which are greater than zero). To create a stem-and-leaf diagram, the following steps must be performed:

  • First of all, each value must be divided into a trunk and a leaf. The trunk is the first digit and the leaf is the last digit (with the number 35, 3 is the trunk and 5 is the leaf). Logs can contain multiple digits (for the number 135, 13 is the log), but a leaf always consists of only one digit.

  • Then all logs must be listed among themselves. The smallest trunk should be on top. After doing this a vertical line should be drawn on the right side of the logs.

  • Finally, the corresponding leaf should be noted in each row to the right of the trunk. Start with the smallest leaf.

Back-to-back diagram

A back-to-back diagram is a variation of the stem-and-leaf diagram. With such a diagram, two related distributions can be compared. Such a diagram uses common stems. For example, you can process the weight of men and women in a back-to-back diagram. The stems of the weights are then in the middle and two lines (both left and right) are drawn from the stems. For example, you can write down the women's sheets on the right side, while on the left you can write down the men's sheets.

Which diagrams are there for a large dataset?

Stem-and-leaf diagrams and back-to-back diagrams are not useful when using a large data set. It then takes a very long time to process each value in the chart and it also looks cluttered. However, this can be solved by doubling the number of stems in a diagram. This can be done by:

  • Splitting: divide each stem by two.

  • Trimming: this is where you make the numbers fit when the observed values ​​contain many numbers. This is done by removing the last digits before creating a stem-and-leaf diagram.

What are histograms?

A histogram divides the values ​​of a variable into groups. Therefore, only the frequencies or percentages that belong to the groups are described. You can decide how many groups you want to create, but the groups must be of equal size. However, it is important to note that the way a histogram looks can change when the classes are changed. It takes longer (compared to stem and leaf diagrams) to create histograms manually. Also, the original data values ​​do not appear literally in a histogram. This is precisely the case with stem-and-leaf diagrams. To make a histogram, three steps must be performed:

  • Making groups. For example, with a dataset with the IQ measurement of fifty people, you can make intervals of 75≤ IQ <85, 85 ≤ IQ <85, etc.

  • Divide the values ​​found per group. Then it must be counted how often values ​​fall in a certain group (frequencies). A table of the frequencies associated with each group is called a frequency table.

  • Finally, draw a histogram. In our case, the horizontal axis (X axis) should show the IQ scores, while the Y axis should be the frequencies. Each bar represents a group. There is no space between the bars unless no one in a certain group has scored. This is the case, for example, if no one has an IQ score between 75 and 84.

What are the differences between histograms and bar charts?

Histograms and bar charts are similar, but not the same. With a bar chart the bars are not exactly against each other, while they are with a histogram. A histogram is about the counts or percentages of different values ​​of a variable. A bar chart compares the sizes of different items. The horizontal axis of a bar chart does not have to have a measurement scale, but can consist of labels. If you want to know how many students are studying biology, psychology or medicine, these are categorical variables that you can put on the X axis. In this case a bar chart should be made. If it is a numeric variable (eg IQ, height or weight) then a histogram should be used.

So bar charts are useful for categorical variables, while histograms are important for quantitative variables.

What are the main features of distributions?

After a dataset has been processed in a graph or diagram, the main characteristics of the distribution must be investigated. In this context, it is important to consider the following:

  • View the overall pattern and pay close attention to noticeable deviations from the general pattern (deviations).

  • The shape, the middle and the spread within a dataset must also be considered. The center of a distribution is the value that causes half of the observations to be less than that value and the other half to be greater than that value. The spread of a distribution can be described by looking at the smallest and largest values. When looking at the shape, it is important whether there are multiple peaks in the distribution. If there is only one peak (mode), we call the distribution unimodal. It must also be considered whether the distribution is symmetrical or whether there is a deviation to the left or right. A distribution is symmetrical when the values ​​that are smaller and greater than the center can be mirrored with each other. If there is a deviation to the right (skewed to the right), the right tail (which consists of larger values) is much longer than the left tail (which consists of small values). Length and IQ are variables that often have an (approximately) symmetrical distribution. Few people are extremely small or extremely tall and the majority of people score average. The same goes for IQ scores. House prices have a distribution with a deviation to the right. Many houses are about the same price, while there are some very expensive villas.

  • An important deviating score is an outlier. This is an individual score that is clearly outside the general pattern.

What are outliers?

Determining outliers does not follow specific rules. The point is that you form your own opinion about which scores should be labeled as deviating. In any case, always look for values ​​that are clearly different from most values; it does not therefore only have to be about extreme observations within a distribution. It is also important to try to explain outliers. For example, an outlier can be the result of unusual circumstances.

What are time plots?

When data is collected over time, it is a good idea to process the observations graphically in sequence. The use of histograms and stem-and-leaf diagrams can be misleading in this regard, as there may be systematic changes over time.

  • A time plot of a variable provides a graphical representation of each observation in relation to the moment when this measured variable was. Time should always be placed on the horizontal line, while the measured variable should be on the Y axis. Connecting data points (by means of lines) shows whether changes have taken place over time. Trends can also be discovered in this way.

  • Many datasets are time series. These are measurements of a variable that have been taken at different times. Consider, for example, the measurement of national unemployment per quarter.

  • A trend in a time series is a sustained rise or fall over the long term. A pattern that keeps repeating itself at specific moments in a time series is called seasonal variation. In that case, seasonal adjustment is carried out, so that research results do not have a misleading effect. The fact that the unemployment rate increased in December and January does not necessarily mean that more people have become unemployed. Unemployment figures always rise during this period, because temporary workers, for example, often stop working at the end of the year. Taking such a phenomenon into account is a form of seasonal adjustment.

How can you describe distributions numerically?

What is the mean?

A numerical description of a distribution starts with a measurement of the center. The most well-known measurements of the center are the mean and the median. The average is also really about finding the mean value, while the median is about finding the middle value.

To find the mean, all scores must be added up and divided by the number of scores. If n number of people have the scores x1, x2, x3, xn, their mean is:

Average = (x1 + x2 + x3 +… + xn) / n.

Another notation is: = 1 / n Σ xi. In this formula, "as a Greek letter" stands for "adding everything together".

The disadvantage of the mean is that this measure is very sensitive to the influence of a number of extreme observations. These extreme scores can be outliers, but they don't have to. Because the mean is influenced by extreme scores, we say that the mean is not a robust measure (resistant measure) of the center. The fact that the mean is not a robust measure is also evident from the fact that you can adjust one score from the distribution alone to change the mean.

What is the median?

The median M is the literal center of a distribution. Half of the observations fall below the median, while the other half are above the median. The median of a distribution can be found as follows:

  • Put all scores in order first (from smallest to largest).

  • If the number of observations is odd, then the median is exactly the middle number. For example, if there are five numbers, the median is the third number. The position of the median in this case can be found as follows: (n + 1) / 2. In our example, that is: (5 + 1) / 2 = 3. This formula does not say what the median is, but where the median is in the series of numbers.

  • If the number of observations is even, then the median M is the mean of the two middle observations in the distribution. The position of the median is found in the same way: M = (n + 1) / 2.

What is the difference between the mean and the median?

If a distribution is completely symmetrical, the median and the mean are the same. In a distribution deviating to the left or right, the mean is more in the tail than the median. This is because the mean is much more influenced by extreme scores. The tail of a distribution consists of extreme scores.

What is spread (variability)?

The simplest numerical description of a distribution should consist of a measure of the center (such as the mean and the median), as well as a measure of the distribution within a distribution. We can describe the spread of a distribution by calculating different percentiles. The median divides the distribution exactly in two, which is why we also say that the median is the fiftieth percentile. However, there is still a quartile in the top half of the data. There is also a lower quartile in the lower half of the data. The quartiles ensure that the data can be divided into quarters; each quartile is about a quarter of the data. Quartiles can be calculated as follows:

  • First, all scores must be ranked from smallest to largest. Then the median for the whole set must be calculated.

  • The first quartile (Q1) is the median of the quarter lowest scores of a distribution.

  • The third quartile (Q3) is the median of the fourth highest scores in a distribution.

The pth percentile of a distribution is the value that p percent of the scores are equal to or that p percent of the scores are below.

What is the five number summary?

To describe the midpoint and spread of a distribution, it is helpful to have (1) the lowest score, (2) Q1, (3) M (the median), (4) Q3, and (5) the highest score. These values ​​are collectively referred to as the five-number summary. These five values ​​are visible in a box plot.

  • The outer two edges of the box (box) in a box plot represent Q1 and Q3.

  • The median is represented by the line in the center of the box.

  • Two lines (up and down) from the box show the highest value and the lowest value.

What is the interquartile range?

In principle, viewing the largest and smallest value says little about the spread within the data. The distance between the first and third quartiles is a more robust measure of dispersion. This distance is called the interquartile range (IQR) and is calculated as follows:

  • IQR: Q3 - Q1.

  • The IQR is often used as a rule of thumb to identify outliers. Often times, a score is called an outlier if it falls 1.5xIQR above the third quartile or 1.5xIQR below the first quartile.

What are anomalous distributions?

Quartiles and the IQR are not affected by changes in the tail of a distribution. So they are quite robust. It must be said, however, that no numerical value of spread (such as the IQR) is very useful to describe the spread of distributions with a deviation (to the left or right). The two sides of a deviating distribution have different spreads, so one spread value cannot be sufficient. A deviation to the left or right can be noticed by looking at how far the first quartile and lowest score are from the median (left tail) and by looking at how far the third quartile is from the highest score (right tail).

What are variance and standard deviation?

Much more often than the five-number summary, the standard deviation (along with a measure of the center point) is used to get a picture of a distribution. The standard deviation measures the dispersion by looking at how far observations are from the mean.

  • The variance (s²) of a data set is the mean of the squared standard deviations. In formula form this is: s² = (x1-) ² + (x2-) ² +… + (xn-) ² / n-1. Another correct formula is: s² = 1 / n-1 Σ (xi-) ². In this context n-1 stands for the number of degrees of freedom.

  • To find the standard deviation (s), the square root of the variance must be taken. Finding the standard deviation is especially useful when there are normal distributions. These distributions are discussed in the next section. The standard deviation is preferred over the variance. This is because taking the square root of the variance ensures that dispersion is measured according to the original scale of the variable.

The deviations from the mean (xi-) show to what extent scores differ from the mean. Some of these deviations will be positive, while others will be negative. The sum of deviations from the scores will therefore always be zero. For this reason, the deviations from the mean are squared; this way the calculation does not come to zero. The variance and standard deviation will be large if scores are widely spread from the mean.

The variance and standard deviation will be small when the scores are close to the mean.

What are the characteristics of the standard deviation?

  • Standard deviation s measures the dispersion from the mean and should only be used when the mean (and not the median) is chosen as the measure for midpoint.

  • The standard deviation is zero when there is no spread in a distribution. This only happens if all values ​​are the same. If not, then that standard deviation is greater than zero. The more spread there is, the greater s becomes.

  • The standard deviation s, like the mean, is not robust. The presence of a few outliers can immediately make s very large. The standard deviation is even more sensitive to extreme scores compared to the mean.

  • Distributions with a strong deviation (left or right) have large standard deviations. In this case it is not very useful to calculate the standard deviation. The five-number summary is often more useful than the mean and standard deviation when an anomalous distribution must be described or when a distribution has extreme outliers. Using the mean and standard deviation is more useful when few outliers are present and when the distribution is symmetrical.

How can you transform units of measurement?

The same variable can often be measured using different units of measurement. For example, temperature can be measured in both Fahrenheit and Celsius. Fortunately, converting units of measurement is easy. This is because a change in unit of measure is a linear transformation of the measurements. Such a transformation does not change the shape of a distribution. If temperature measurements in Fahrenheit result in a distribution with a deviation to the right, then it will remain so when the values ​​have been converted to Celsius. However, the spread and center will change after such a change. A linear transformation turns the original variable x into a new variable (xnew) based on the following formula:

  • xnew = a + bx. Adding the constant a changes all values ​​of x by the same amount. Such an adjustment changes the zero point of a variable. Multiplying by the positive constant b changes the size of the measurement unit.

  • To consider the effect of linear transformation on measures of dispersion and on measures of the center, it is important to multiply each observation by the positive number b. This ensures that the median, mean, standard deviation, and IQR are multiplied by b.

  • Adding the same number a (whether it is positive or negative) to each observation adds a to the mean, median, quartiles, and percentiles. Spread sizes are not affected, however.

What are normal distributions?

What are density curves?

Creating histograms manually is inconvenient. Today, scientists often use computer programs to make histograms. The advantage of computer programs is that you can also make a suitable curve based on a histogram. These are called density curves. A histogram, as it were, "flows" through such a curve. Areas under the curve represent proportions of scores.

  • A density curve is always created above the horizontal axis.

  • The total area within the curve equals 1.

  • A density curve describes the general pattern of a distribution. Density curves, like distributions, can take all kinds of shapes. A special variant is the normal distribution, where both halves of the curve are symmetrical. Outliers are not described with a density curve.

How do you measure the center and the spread with normal distributions?

The mode of a distribution describes the peak point of the curve. It is therefore about the place where the curve is highest. Since areas under the curve represent proportions, the median is the point that is exactly in the middle.

The quartiles can be estimated by dividing the curve into approximately four equal parts. The IQR is then the distance between the first and third quartiles. There are arithmetic ways to calculate the areas under a curve. These arithmetic ways allow us to calculate the median and the quartiles precisely.

The mean of a density curve is the point at which the curve would balance if it were made of solid material. With a symmetrical curve, the median and the mean are at the same point. This is not the case with a different distribution. For a curve with a deviation to the right, the median is slightly more towards the peak of the curve than the mean. The mean is therefore more towards the tail. With a deviating distribution it is difficult to determine the balance point with the naked eye. There are arithmetic ways to calculate the mean and standard deviation of a density curve. In short:

  • So the median of a density curve is at the point that bisects the area under the curve.

  • The mean of a density curve is the balance point at which the curve would balance if it were made of solid material.

  • The median and the mean are the same for a symmetrical density curve. The mean of an abnormal distribution is more in the direction of the long tail, while the median is more in the direction of the peak.

What are characteristics of normal distributions?

We indicate the mean of a density curve with the letter µ. The standard deviation is noted using the symbol σ. These values ​​are approximated with the sample mean () and the standard deviation (s) associated with these scores. Normal distributions are symmetrical and unimodal, so they only have one peak. Changing µ (while keeping the standard deviation unchanged) causes the position of the curve on the horizontal axis to shift, while the scatter remains the same. A curve with a larger standard deviation is wider and lower. The standard deviation σ is the measure of dispersion that belongs to a normal distribution. Together with µ, σ determines the shape of a normal distribution.

Why are normal distributions important in statistics?

  • Normal distributions are good descriptions of distributions that belong to real data. These are distributions that are distributed almost normally. Examples are distributions of height, weight and IQ.

  • Normal distributions are good approximations of the outcomes of probability calculations, for example in the case of tossing a coin.

  • Finally, normal distributions are useful, because statistical calculations (made on the basis of normal distributions) can be used for other, almost symmetrical distributions.

What are common features of normal distributions?

There are many types of normal distributions, but they have some common features. The main features are set out below.

  • About 68% of the scores fall within 1 standard deviation (σ) of the mean (µ).

  • About 95% of the scores fall within two standard deviations of the mean.

  • About 99.7% of the scores fall within three standard deviations of the mean.

The top features are collectively known as the 68-95-99.7 rule. The normal distribution with mean µ and standard deviation σ is noted as N (µ, σ). For example, when researching the height of Dutch women, it is possible that N (1.70,10) is found.

What are standardized values?

If someone has scored sixty points on a test, you don't know if this is a high or low score compared to all other scores. It is therefore important to standardize the value.

  • If x is a score from a distribution with mean µ and standard deviation σ, then the standardized value of x is: z = (x-µ) / σ. A standardized value is often referred to as a z-score.

  • The standardized values ​​of a distribution together have an average of 0 and a standard deviation of 1. The standardized normal distribution therefore has the N (0,1) distribution.

What are Cumulative Proportions?

The precise calculation of the proportions under the normal distribution can be done by means of z-tables or software.

  • Z-tables and software often calculate a cumulative proportion: this is the proportion of observations in a distribution that is below or exactly equal to a certain value.

When a distribution is described by means of a density curve, the cumulative proportion is the area under the curve that lies to the left of a given value. This is taken into account if, for example, you only want the proportion that is to the right of the value. In that case you have to calculate 1- the proportion on the left. The z-table can be used to find proportions under the curve. To do this, scores must first be standardized. An example is that you want to know how many students had at least a score of 820 on a certain test. The mean turns out to be 1026 and the standard deviation is 209.

  • The corresponding z-score is: 820-1026 / 209 = -0.99.

  • Then the z-table has to be used to see which proportion belongs to -0.99. That turns out to be 0.1611. The area to the right of -0.99 is therefore 1-0.1611 = 0.8389.

  • If you had wanted to know how many students got a maximum score of 820, the answer would have been 0.1611.

What is a normal quantile plot?

Stem-and-leaf diagrams and histograms are often used to see if a distribution is normally distributed. However, the normal quantile plot is the best graphical way to discover normality. It is not practical to make such a plot yourself. In most cases software is therefore used. Below is a general picture of how such a plot can be created manually.

  • First of all, scores are listed from smallest to largest. It is also noted with which percentile each value goes together.

  • Next, the z-values ​​must be found that go together with these percentiles. These are also referred to as z-normal scores.

  • Finally, each data point must be graphically linked to the corresponding normal score. If the distribution is (almost) normally distributed, then the data points will be almost on a straight line. Systematic deviations from the straight line indicate a non-normally distributed distribution. Outliers are data points that are far from the general pattern in the plot.

Voor toegang tot deze pagina kan je inloggen

 

Voor volledige toegang tot deze pagina kan je inloggen

 

Inloggen (als je al bij JoHo bent aangesloten)

   Aansluiten   (voor online toegang tot alle webpagina's)

 

Hoe het werkt

 

Aanmelden bij JoHo

 

 

  Chapters 

Teksten & Informatie

JoHo: paginawijzer

JoHo 'chapter 'pagina

 

Wat vind je op een JoHo 'chapter' pagina?

  •   JoHo chapters zijn tekstblokken en hoofdstukken rond een specifieke vraag of een deelonderwerp

Crossroad: volgen

  • Via een beperkt aantal geselecteerde webpagina's kan je verder reizen op de JoHo website

Crossroad: kiezen

  • Via alle aan het chapter verbonden webpagina's kan je verder lezen in een volgend hoofdstuk of tekstonderdeel.

Footprints: bewaren

  • Je kunt deze pagina bewaren in je persoonlijke lijsten zoals: je eigen paginabundel, je to-do-list, je checklist of bijvoorbeeld je meeneem(pack)lijst. Je vindt jouw persoonlijke  lijsten onderaan vrijwel elke webpagina of op je userpage
  • Dit is een service voor JoHo donateurs en abonnees.

Abonnement: nemen

  • Hier kun je naar de pagina om je aan te sluiten bij JoHo, JoHo te steunen en zelf en volledig gebruik te kunnen maken van alle teksten en tools.

Abonnement: checken

  • Hier vind je wat jouw status is als JoHo donateur of abonnee

Prints: maken

  • Dit is een service voor wie bij JoHo is aangesloten. Wil je een tekst overzichtelijk printen, gebruik dan deze knop.
JoHo: footprint achterlaten