Learning Statistics with R - Navarro - 2014 - Article


Useful tools for data analysis go beyond what is covered in undergraduate classes. There are tools outside of R or statistics that are essential topics in data analysis:

  • Other types of correlations are correlations apart for Spearman and Pearson. Both methods can be used in measuring continuous variables and their correlation. Other types of correlations are used for variables on a nominal scale.

  • Effect sizes. There are more ways to think about effect size than just the most popular one.

  • Dealing with violated assumptions: bootstrapping, Bayesian probability and cross-validation are tools to analyse data when assumptions are violated.

  • Interaction terms for regression. Interaction terms can be included in the regression model ANOVA to enhance data analysis.

  • Method of planned comparison. Post-hoc analysis (like Tukey HSD) is not always needed. Instead if data is limited and gathered ahead of time, just ANOVA will be enough.

  • Multiple comparison methods. It is not necessary to stick to only one comparison method.

What are important non-traditional statistical methods?

There are a lot of statistical tools used in statistical modelling. Some important ones are described here.

  • Analysis of covariance. ANOVA can be used as a model for regression. The analysis of covariance (ANCNOVA) is a method where some of the predictors are continuous and others are categorical.

  • Nonlinear regression. The relationship between two predictors does not always have to be linear. For example, when the relationship is monotonic (isotonic, polynomial or Lowess regression).

  • Logistic regression. When the outcome variable is binary valued, but the predictors are continuous logistic regression is used.

  • The general linear model (GLM) is a family of models including regression, it allows for the ides that data might not be normally distributed. It also allows for non-linear relationships between predictors and outcomes.

  • Survival analysis is used when data from a study is missing. For example, outliers on one side are missing due to (time) restrictions. It is often used in the medical field.

  • Repeated measures ANOVA is used when participants are measured in multiple conditions. Repeated measures make that independence is violated. Observations from the same participant are more related than observations between participants. Variations in the data can be attributed to individual differences.

  • Mixed Models are used when the repeated measures ANOVA is insufficient. This can happen when people’s changes over time are measured. Mixed models are designed to analyse data and learn about individual units as well as overall effects.

  • Reliability analysis is used to check correlation between questions within a questionnaire. Reliability analysis (e.g. Cronbach’s α) is used to check the assumption that questions covering the same topic are correlated.

  • Factor analysis is useful when measuring more than a single construct. For example, with IQ scores, several things at once are measured. Factor analysis helps to see what these things are. It attempts to express a pattern of the correlations between variables using a smaller number of variables.

  • Multidimensional scaling (MDS) is used when variables cannot be divided in predictors and outcomes. It is an example of an unsupervised learning model. It is used when analysing similarities between items, objects or people. In MDS, the goal is finding a geometric representation of the data. Each item is plotted as a point in a two-dimensional space and the distance between them is measured.

  • Clustering is another example of an unsupervised learning model. It is also referred to as classification and the idea is to figure out what groups exist in the data. There are different types of clustering: k-means clustering, which is unsupervised, semi-supervised clustering and supervised clustering.

  • Causal models are useful tools to learn about causal relationships between variables. Variables should be correlated, but when there are three events it is useful to say something about the causal relationships between them. For example, did event A happen prior to event B or C? Causal models or structural equations modelling (SEM) can be used to clarify causal relationships.

What other ways of inferential statistics can be used?

Besides the traditional analysis of p-values and hypothesis significance testing, there are more methods being used for data analysis.

  • Bayesian methods. The Bayesian interpretation of probability can be explained as the degree of belief. Bayesian probability is used to assign probability to one of two events rather than restricting probability to events that can be replicated. It leads to different tools for data analysis.

  • Bootstrapping is useful when not all underlying assumptions for your data are met. This often happens for small sample sizes. It is a simple method where the results of the study are simulated lots of times under the assumptions that (a) the null-hypothesis is true and (b) the population distribution is like the raw data.

  • Cross validation is a method to describe the data sample. Divide the data into two subsets X1 and X2. Use subset X1 to train the model and see if it performs the same on subset X2. It gives an indication of generalisation of one dataset to another. It is a measure of how good the model is performing.

  • Robust statistics is used when data is messier than it is supposed to be. Variables are not normally distributed, and relationships are non-linear. Some statistical inferences are robust and work on data where the underlying assumptions are not met. Robust statistics is about ho to make safe inferences based on data when faced with contamination.

What are miscellaneous topics in statistics?

  • Missing data can be solved by making a plausible guess about what that data should be.

  • Power analysis is necessary to check how likely it is to find an effect if it exists.

  • Data-analysis using theory-inspired models is the use of psychological theory to get better statistical analysis.

Why should all the basics in statistics be learned?

The pragmatism argument states that all the basics should be learned because they are widely used. The incremental knowledge argument is that understanding the basics helps in understanding more advanced statistics. The extensibility of statistics is the biggest payoff.

Join World Supporter
Join World Supporter
Log in or create your free account

Why create an account?

  • Your WorldSupporter account gives you access to all functionalities of the platform
  • Once you are logged in, you can:
    • Save pages to your favorites
    • Give feedback or share contributions
    • participate in discussions
    • share your own contributions through the 7 WorldSupporter tools
Follow the author: Vintage Supporter
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.