Even though data science and analytics was named the “sexiest job of the 21st century,” most people still shudder at the thought of statistical analysis and all that it means. The core reason why this discipline has been so intimidating throughout history is probably found in its close relations with the subject of mathematics.

Whether you are confident in your ability to learn statistical data analysis or whether you are simply keen to find out more about it, this article will help you to gain a better understating of the definition of data science and analytics as well as look at the various forms of statistical testing and statistical analysis tools that are available today.

At the heart of data, science and analytics are five important concepts that form the basis of statistical data analysis. Explaining the first four of these is easier because it is not necessary to go into detail about their equations.

  • Mean: This is the average value when it is calculated as the sum of the observations over a certain period.
  • Median: This is the midpoint, or average, of the dataset when it is calculated by arranging all the observations from the least to the greatest and then taking the value in the middle.
  • Variance: This is the over-all spread of information, calculated as the middle of squared variations of the mean.
  • Standard Deviation: This is also a measure of overall information which is calculated by considering the square root of the variance

Not dissimilar to the witnesses you would find in a detective novel, the four concepts start to build a picture, or story around a particular set of data. For instance, if you decided to take a look at the people around you in a restaurant, it would be very difficult to build an interpretation of the crowd-based only on their appearance.

However, if you also had information about their monthly income, age, level of education, taste in music, and gender, for instance, then the first two measures could end up telling you that the crowd is a group of twenty-somethings making their way through university, or a group of elderly people who are investing in hedge funds.

data analysis
Data analysis may seem intimidating, but once you know the theory, you're all set. - Source: Unsplash

Using statistical data analysis tools like this shows up the difference between concepts depending on the distribution of the variation that is being measured. In this example, it was the depth of variability within the crowd.

Using these statistical analysis tools depends on the amount of variation that you are measuring. In this scenario, the more similar the crowd, the more accurate the average will be, while the more variation in the crowd, the more precise the story becomes by taking the average.

Variance and standard deviations are both measures of changeability and can tell you how diverse each observation in your data set is from the average when it comes to a specific variable.

Statistical testing is almost never-ending and you could keep going with statistical analysis in this example to see how similar the gathering is in terms of their age. To do this, you would start by calculating the mean age and then subtracting every individual's age from that. The number tells you how far individuals are from the average age. The statistical analysis of the standard, on the other hand, reveals how near or far your data is clustered around the average based on the typical distribution.

This typical, or standard deviation, is exactly like the difference in terms of what it reveals about the spread of your information. Actually, the typical deviation is calculated by using the square root of the difference. The difference is in the fact that the typical deviation of the description is the easiest to report on because it is using the same elements as the original data, whereas the difference is not.

The best Data Analysis tutors available
Teboho
Teboho
R250
/h
Gift icon
1st lesson free!
Viome
Viome
R200
/h
Gift icon
1st lesson free!
Lewin
5
5 (4 review/s)
Lewin
R180
/h
Gift icon
1st lesson free!
Gareth
Gareth
R300
/h
Gift icon
1st lesson free!
Jacques
Jacques
R200
/h
Gift icon
1st lesson free!
Dr patricia
Dr patricia
R400
/h
Gift icon
1st lesson free!
Lourens
Lourens
R200
/h
Gift icon
1st lesson free!
Jotham
Jotham
R300
/h
Gift icon
1st lesson free!
Teboho
Teboho
R250
/h
Gift icon
1st lesson free!
Viome
Viome
R200
/h
Gift icon
1st lesson free!
Lewin
5
5 (4 review/s)
Lewin
R180
/h
Gift icon
1st lesson free!
Gareth
Gareth
R300
/h
Gift icon
1st lesson free!
Jacques
Jacques
R200
/h
Gift icon
1st lesson free!
Dr patricia
Dr patricia
R400
/h
Gift icon
1st lesson free!
Lourens
Lourens
R200
/h
Gift icon
1st lesson free!
Jotham
Jotham
R300
/h
Gift icon
1st lesson free!
Let's go!

What is Probability?

Now that you have the necessary background on the four basic concepts needed for statistical testing, it is possible to introduce the fifth, most important of all the statistical analysis tools: probability theory.

Probability theory has a reputation for being intimidating, however, it is only needed for the purpose of understanding the most essential graph which you will encounter at the beginning of your data science and analytics journey.

This graph symbolises a typical, or normal, probability distribution, where the information is organised symmetrically around the average. In other words, the theory of probability is used to comprehend the central limit theorem or CLT.

CLT is defined as the indication of an infinite amount of consecutive random samples when drawn from a population group. The sampling dispersal of those means will approach a typical distribution.

For instance, regardless of what the distribution in the population looks like, the average and standard variation become normal as more samples are drawn to look like the graph above.

Understanding probability as an important statistical testing tool, not only gives us the language to adequately explain sample distribution, but it is also the precise tool that allows us to calculate it.

Choosing a Statistical Test

Determining the correct method of statistical testing is critical to successful statistical data analysis.

Once you have familiarised yourself with the fundamentals, and understand basic concepts of statistics, it can be challenging to take the next step, which is how to decide what testing tool to run on your data. There is a wide variety of statistical testing tools available which can be boiled down into four main categories.

  • Associationftutor
  • Comparison
  • Prediction
  • Nonparametric (data that doesn’t follow a typical distribution)

In order to choose which test to perform, it is important to differentiate between the kinds of information you have, based on the different variables that you are analysing. Variables can be categorical or scale variables.

Categorical variables, which fall into two distinct categories are qualitative.

  • Ordinal: this has a noticeable order, much like a scale that rates happiness from 1 to 10.
  • Nominal: this has no significant order, like gender.

Scale variables also fall into two categories and are quantitative.

  • Continuous: this can represent any value, like height.
  • Discrete: these are numbers, like numbers of children.

Tests of Association and When to Use Them

Association tests are intended for observing the relationship between variables. This method is the closest you will get in statistical testing to finding a causal link between two variables.  For instance, if you wanted to determine whether there is an association between marital status and level of education, a test of association would test the strength between these two variables.

 

Type of TestVariable TypesExample
Pearson CorrelationTwo continuous variablesWhen shoe size is associated with height.
Spearman CorrelationTwo ordinal variablesWhen a strong association between economic status and health is indicated.
Chi-SquareTwo categorical variablesTo determine whether favourite colour and gender can be associated.

Means: Tests of Comparison

Tests of comparison entail looking at the variances between differences by observing the change between their means. For example, you could use this if you wanted to find out if where one goes to school makes a difference on standardised exam results.

 

Type of TestVariable TypesExample
Paired T-TestTwo related variablesDifferences in weight, before and after supplement programmes.
Independent T-TestTwo independent variablesFuel expenditure between people living in Johannesburg or Durban.
One-Way ANOVA (Analysis of Variance)One independent variable containing distinct levels and one continuous variableComparing average exam results from differing education levels.
Two-Way ANOVATwo or more variables that are independent, containing both distinct and continuous variablesComparing exam results from three education levels and including the twelve Zodiac signs.

Linear Regression: Tests of Prediction

Prediction tests are useful in determining a change to one or more variables. For example, with information on gender, income, and diet, you could investigate if this leads to a change in height.

Type of TestVariable TypesExample
Simple Linear RegressionOne dependent scale variable with one or more predictors.Can height and age predict weight?
Multiple Linear RegressionOne dependent variable with two or more predictors.Can height, age and income predict weight?

Nonparametric Data Tests

These tests should be completed when the information does not meet the expectations of the other tests. For example, this could be when the information does not follow a typical distribution and is very skewed.

 

Type of TestVariable TypesExample
Wilcoxon Rank-Sum TestTwo independent variablesBetween two diverse types of medication, which offers the greatest relief on two indiscriminate, separate groups of a population?
Wilcoxon Sign-Rank TestTwo related variablesBetween two dissimilar types of medication, which offers the greatest relief on the same crowd of patients?
Friedman TestThree ordinal or metric variablesThree dissimilar ad scores are given by persons in the same population.

There is no doubt, data science and analytics is a complex subject where most students at some point require extra tuition. If this is you, consider using a tutoring website, like Superprof, which is host to Statistics tutors located all over South Africa.

This way, you can hone in on the areas of your data science where you need help, at a time and place that is convenient for you.

>

The platform that connects private tutors and students

1st lesson free

Enjoyed this article? Leave a rating!

5.00 (1 rating/s)
Loading...

Niki

Niki is a content writer from Cape Town, South Africa, who is passionate about words, strategic communication and using words to help create and maintain brand personas. Niki has a PR and marketing background, but her happiest place is when she is bringing a story to life on a page.