Lesson 2: Basic Statistics | BuildYour.Academy

Lesson Content

Introduction to Data and Statistics

Data is the raw material of data science! Statistics provides the tools to make sense of this raw material. We use statistics to summarize data, identify patterns, and make informed decisions. We'll start with the basics. Think of it like this: If you're building a house (a data analysis project), data is the bricks, and statistics is the blueprint.

Types of Data:

Categorical Data: Represents categories or groups. Examples include colors (red, blue, green), customer segments (low, medium, high), or product categories (electronics, clothing, food). Visualizations: Bar charts, pie charts.
Numerical Data: Represents quantities or measurements. Examples include age, income, website traffic, or sales figures. Further divided into:
- Discrete Data: Can only take specific, separate values (e.g., number of children, number of clicks - whole numbers).
- Continuous Data: Can take any value within a range (e.g., height, temperature - decimals possible).

Descriptive Statistics: These are measures that summarize and describe a dataset. They help us get a quick understanding of the data's characteristics.

Mean: The average of a dataset (sum of all values divided by the number of values).
- Example: Data: 2, 4, 6, 8, 10. Mean = (2+4+6+8+10)/5 = 6.
Median: The middle value in a sorted dataset. If there are two middle values, it's the average of those two.
- Example: Data: 1, 3, 5, 7, 9. Median = 5. Data: 1, 3, 5, 7. Median = (3+5)/2 = 4.
Mode: The value that appears most frequently in a dataset.
- Example: Data: 1, 2, 2, 3, 4. Mode = 2.
Range: The difference between the highest and lowest values in a dataset. Example: Data: 1, 3, 5, 7, 9. Range = 9-1 = 8
Standard Deviation: A measure of how spread out the data is from the mean. A higher standard deviation indicates more variability.
- While we won't calculate this by hand, understand it is an important measure of dispersion.

Let's move to visualizing data.

Data Visualization & Distributions

Visualizations bring your data to life! The right chart helps us quickly grasp the data's story.

Choosing the Right Visualization:

Bar Charts: Best for comparing categorical data.
- Example: Compare sales by product category.
Histograms: Display the distribution of numerical data (especially good for showing frequencies across ranges).
- Example: View the distribution of customer ages.
Scatter Plots: Show the relationship between two numerical variables.
- Example: Examine the relationship between ad spend and sales.
Pie Charts: Best for showing proportions of a whole (use sparingly – sometimes bar charts are clearer!).
- Example: Represent market share.

Data Distributions: A distribution shows how often each value appears in a dataset.

Normal Distribution (Bell Curve): The most common distribution. It's symmetrical, with the mean, median, and mode at the center. Many real-world phenomena follow a normal distribution (e.g., height, test scores). Understanding the normal distribution is key for statistical tests.
- Key Characteristics: Symmetrical, defined by mean and standard deviation. Approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
Other Distributions: There are other distributions like Uniform (all values equally likely), Skewed (asymmetrical), and more. Recognizing them helps us choose the right analytical tools.

Deep Dive

Explore advanced insights, examples, and bonus exercises to deepen understanding.

Day 2: Data Scientist - Experiment Design & A/B Testing - Extended Learning

Welcome back! You've grasped the fundamentals of descriptive statistics and data distributions. Now, let's explore these concepts further to strengthen your understanding and prepare you for designing and analyzing A/B tests.

Deep Dive Section: Beyond the Basics

Let's expand on the concepts you learned by exploring:

Outliers and Their Impact: While we've discussed mean, median, and mode, consider how outliers (extreme values) can skew the mean, making the median a more robust measure. Understand how to identify outliers using methods like the interquartile range (IQR) and how to handle them (e.g., removing, transforming, or investigating their cause).
Visualizing Data Effectively: Beyond simple histograms and bar charts, consider more sophisticated visualizations based on the data type. Explore box plots (useful for comparing distributions and highlighting outliers), scatter plots (for visualizing relationships between two numerical variables), and violin plots (combining box plots with kernel density estimation).
The Central Limit Theorem (CLT) - A Sneak Peek: This theorem is fundamental to inferential statistics, which you'll delve into later. The CLT states that the distribution of sample means approximates a normal distribution, regardless of the population's original distribution, *provided the sample size is large enough*. This is critical for A/B testing, as you'll often be working with sample data.

Bonus Exercises

Exercise 1: Outlier Identification

You have a dataset of website loading times in milliseconds: [250, 300, 310, 320, 330, 340, 350, 360, 370, 1500]. Calculate the IQR and determine if there are any outliers. What might cause an outlier in this context? (Hint: IQR = Q3 - Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile. Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are considered outliers.)

Exercise 2: Visualization Challenge

Imagine you have data on customer purchase amounts for two different product pages (Page A and Page B). Page A's purchase amounts: [10, 15, 20, 25, 30, 100]. Page B's purchase amounts: [12, 18, 22, 28, 32]. Which visualization would be most appropriate to compare the distributions of purchase amounts, and why? (Consider box plots, histograms, and other options.) Sketch or describe how each visualization would look.

Real-World Connections

Understanding data distributions and descriptive statistics is vital across various domains:

Marketing: Analyzing website traffic, conversion rates, and customer behavior. Outliers in purchase values can indicate fraudulent activity or high-value customers.
Healthcare: Monitoring patient vital signs (heart rate, blood pressure) and identifying anomalous readings.
Finance: Analyzing stock prices, detecting unusual trading patterns, and assessing risk.
E-commerce: Evaluating A/B tests on product descriptions, analyzing the distribution of sales, and identifying top-performing products.

Challenge Yourself (Optional)

Download a small, publicly available dataset (e.g., from Kaggle or UCI Machine Learning Repository) related to a topic that interests you (e.g., housing prices, customer churn, or product reviews). Calculate the mean, median, and mode for a few numerical features. Create visualizations (histograms, box plots, etc.) to explore the data distributions. Identify any outliers and discuss their potential impact.

Further Learning

To deepen your understanding, consider exploring these topics:

Inferential Statistics: Hypothesis testing, p-values, and confidence intervals (essential for A/B testing).
Specific Data Visualization Libraries: Learn to use libraries like Matplotlib, Seaborn (Python), or ggplot2 (R) for creating more sophisticated visualizations.
Statistical Software Packages: Explore software packages like R or Python (with libraries like Pandas and Scikit-learn) for advanced data analysis and statistical modeling.
Online Courses/Tutorials: Search for courses on "Data Visualization" or "Descriptive Statistics" on platforms like Coursera, edX, or DataCamp.

Interactive Exercises

Calculate Descriptive Statistics

Calculate the mean, median, and mode for the following dataset: 1, 2, 2, 3, 4, 4, 4, 5, 6.

Data Type Identification

Identify the data type (categorical or numerical, and if numerical, discrete or continuous) for each of the following: Customer age, Customer satisfaction rating (1-5 stars), Product color, Number of website visitors per day.

Visualizing Data

Imagine you have data on the number of clicks a user makes on an A/B tested button. What type of chart would you use to show the number of clicks? Why?

Normal Distribution Practice

Describe the key characteristics of a normal distribution in your own words. How does understanding the normal distribution help with data analysis in A/B testing?

Cookie Preferences

Regenerating Content

Basic Statistics

Learning Objectives

Text-to-Speech

Lesson Content

Introduction to Data and Statistics

Data Visualization & Distributions

Deep Dive

Day 2: Data Scientist - Experiment Design & A/B Testing - Extended Learning

Deep Dive Section: Beyond the Basics

Bonus Exercises

Exercise 1: Outlier Identification

Exercise 2: Visualization Challenge

Real-World Connections

Challenge Yourself (Optional)

Further Learning

Interactive Exercises

Calculate Descriptive Statistics

Data Type Identification

Visualizing Data

Normal Distribution Practice

Practical Application

Key Takeaways

Next Steps

Your Progress is Being Saved!

Extended Learning Content

Extended Resources

Extended Resources

Question 1: A dataset contains the following values: 10, 15, 20, 25, 30. What is the median?

Question 2: Which of the following data types would best be visualized with a histogram?

Question 3: If a dataset has a high standard deviation, what does that indicate?

Question 4: What is the mode of the following dataset: 2, 2, 3, 4, 4, 4, 5?

Question 5: You are analyzing the results of an A/B test. The data suggests that the results are NOT normally distributed. What should you consider?

Congratulations!

Cookie Preferences

Upgrade to Premium

Premium Benefits: