# Statistics… Lesson 1

Data Science. Machine Learning.

Chances are you’ve heard those terms a few times. From insights into customer behaviour patterns to categorizing your photos and recognizing faces, Data Science and Machine Learning have evolved our experience with technology and business.

So… how does one learn these powers? It all starts with statistics.

**Statistics… better than Calculus?**

Well, no. Calculus plays an important role in machine learning, and deep learning in particular. But that’s a discussion for another time.

Statistics, however, is the foundation of data science and machine learning. And if you want to learn one of these incredible topics, you’ll need to know stats.

# Where to begin?

Good question. Let’s start with a simple definition of what statistics is.

Statistics is simply a way to get information from data. That’s all.

Data can be anything. How many steps did you take today? How fast does your heart beat each minute? How many calories are in that burrito you’re craving? Data provides us with details that when combined with statistics can help us make informed decisions.

# Descriptive versus Inferential Statistics

Statistics can be broken down into two categories:

(1) Descriptive Statistics

(2) Inferential Statistics

Descriptive statistics provides us with methods to arrange, summarize and present data to provide for better decision making.

To put it another way, let’s say you chose to lead a healthier lifestyle. Fitbit on your wrist, My Fitness Pal downloaded on your phone, you begin to track specific information about your day such as how many steps you took, the total of the calories, protein and sugar that was in the food you ate and your current weight. Putting that data into a table might look something like this:

It looks nice and organized, but what good is it? Collecting and organizing the data now allows us to consider what descriptive statistics has to offer.

In fact, by putting the raw data (steps, calories, protein, sugar and weight) into the form of a table we’ve already taken the first step: **arranging our data**.

From here, we can **summarize our data** very easily, similar to the below.

Looking at the table above, you will see a new line has been added to the bottom called “Total”. Under each category of data (steps, calories, protein and sugar), except weight (steps, calories, protein and sugar), the data has been **summarized**. Using a very basic descriptive** **statistic technique, we are now able to see the total of each column of data for the week that has been collected.

Why was the “weight” column left out of the Total? Good question. That decision stems from understanding the data you are considering. Though weight may fluctuate throughout the week, it wouldn’t make sense to total it. What may make sense is to take the average, which is something we will see later on.

# Population versus Sample

Get your highlighter out, this is going to be important.

Critical to statistics in general is understanding the concept of a **population** and the concept of a **sample**.

**A population is the whole entire data set which is being studied.**

Continuing with our example above, consider this: for an entire year you diligently collected each day all of the data in the table above — how many steps you took, how many calories you consumed, the amount of protein and sugar contained in the food you ate and finally your weight each day.

The data that came out of this year long study is called the **population**. A measure from this population — such as the total number of steps you took for the year — would be called a **parameter**.

Pretending for a moment that for what reasons you have, you chose one month in particular to represent the data you collected for the whole year. This smaller set of data you chose to consider contains the data of the steps, calories, protein, sugar and weight for one month out of the twelve you collected.

This subset — or slice — of the total data is called a **sample**.

A **sample** is a portion of the total data taken for further study and used to represent the data in its entirety which is its **population**. A measure from the sample — such as the total number of calories in the month — is called a **statistic**.

Why is this distinction important? Populations can be quite large, and given a large enough sample size you are able to accurately describe and make inference about the population using statistical methods.

Here’s another key point: an element of the population or sample that is being considered, such as steps or calories, is called a **variable**. **Data** is the actual value of the variable (such as 10,000 steps).

# 3 Types of Data

There are three types of data that can be considered:

(1) Interval Data

(2) Nominal Data

(3)Ordinal Data

**Interval Data** are numerical observations, such as your age or how many steps you took in a day.

**Nominal Data** are categorical observations, such as the color of your shirt or whether you walk, bike or take transit to work or school.

**Ordinal Data** is categorical data that can be placed into a particular order. For instance, if you were asked to rate your experience using an app as “Good”, “Needs Work” or “Would never use again”. The data in this population is able to be ranked, which is what makes it ordinal rather than nominal.

The distinction between Interval, Nominal and Ordinal data is important because it describes what you are able to do with the data that you’ve collected.

**Interval data**, since it is numerical, has the most operations available. You are able to calculate descriptive statistical measurements such as the **mean**, **mode** or **standard deviation**.

**Ordinal data** can be studied by considering proportions — how many “Good”, “Needs Work” or “Would never use again” responses were recorded, and in what **frequency**.

**Nominal Data** has the least choice with regard to what can be done with it. Basically, you can count how many observations were made for each category. How many white shirts, how many blue shirts, you get the idea. Proportions can be studied here as well (3/5 shirts were blue).

# What’s Next?

I think we’ll stop here for now.

In Lesson 2 we will discuss graphical techniques (charts) and some basic descriptive statistical calculations.

Thank you for reading!