Training: Introductory statistics with R – 24 August 2015

Objective:

• Exploratory data analysis with R. Simple and inferential statistics with a focus on graphics, both for exploration and interpretation. All topics will be illustrated using R, where possible with the R standard example data sets. The training will take one day, and consist of 6-8 sessions, each starting with an introduction to some of the issues, followed by hands-on exercises.

Prerequisites:

Contents:

• 1. Descriptive statistics
Types of variables: nominal/categorical (including binary), ranked, interval, ratio; discrete vs continuous. Examples from standard R datasets: mtcars. How to represent as variables (which type/class?). ‘Factors’ in R: PITA or blessing?

• 1.1. Univariate
Numeric summaries: parametric (mean, variance, standard deviation, coefficient of variation), non-parametric (median, quartiles, quantiles), mode & range. 5-number summaries.
Graphs: box plot, bar graph, histogram (and difference with histogram), rug plot. Symmetrical vs skewed distributions. Difference between mean and median in skewed distributions. Outliers.
1.2. Bivariate: two nominal variates
Numeric summaries: contingency tables; contingency coefficients: measure of association (e.g. hair colour vs eye colour)
Graphs: (heat maps); mosaic plots
1.3. Bivariate: nominal vs numerical
Numeric summaries: differences in the mean/median…
Graphs: overlapping histograms or bar graphs; box plot
1.4. Bivariate: ordinal vs numerical
Numeric summaries: covariance, correlation
Graphs: heat map, scatter plots
1.5. Multivariate:
Numeric summaries: correlation matrix
Graphs: ‘pairs’ (matrix of ‘scatterplots’). 3d graphs are difficult to interpret, better inspect individual bivariate relationships separately. Possibility to use colour, symbol type and size to vary with different variables.
• 2. Inferential statistics
R basics to allow simulations: random numbers, set.seed. Importance of simulated data in understanding concepts and models

• 2.1. Concepts needed for inferential stats:
Law of large numbers; simulation demonstrating how sums of IDD variates are approaching normal distribution; form and variance of the distribution of the mean; not everything is normally distributed (e.g. multiplicative processes instead of additive, log is normally distributed so need for transformations)
Probability distributions, normal distribution; Z-scores, basis of testing; errors of type I and II; Z test; confidence intervals; significance and power; dangers of repeated testing
2.2. Simple statistical tests
2.2.1. t-tests: compensating for uncertainty over estimate of variance. Different forms of t test: one and two-sample; paired sample; equal and different variance. ANOVA as extension of t test
2.2.2. Linear regression: predicting one variate from another. Significance of correlation, intercept, slope. Confidence bands. Diagnostic plots.
2.2.3. Contingency tables, Chi squared

Trainer:

• Edward Vanden Berghe holds a PhD in Science, with a background in Marine Biodiversity. He has been actively using R for several years, and teaching statistics for several decades.

Details:

• Duration: 8h
• Date and time: 24 August 2015, start at 9 am
• Location: European Data Innovation Hub @ AXA, Vorstlaan 23, 1170 Brussel
• Price: 300€

Registration:

Please register via the eventbrite web site, on https://www.eventbrite.com/e/training-introductory-statistics-with-r-24-august-2015-tickets-17937387208.