**Introduction to Pandas & Data Exploration

In this lesson, you'll be introduced to Pandas, a powerful Python library for data manipulation and analysis. You'll learn the fundamentals of loading, exploring, and cleaning data using Pandas, which are essential skills for any data scientist. We'll focus on how to wrangle data to prepare it for deep learning models.

Learning Objectives

  • Load data into Pandas DataFrames from various file formats (CSV, etc.).
  • Explore data using methods like `head()`, `info()`, and `describe()`.
  • Select specific columns and rows from a DataFrame.
  • Handle missing values using basic techniques.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Pandas

Pandas is a Python library built for data analysis. It provides two main data structures: Series (one-dimensional labeled arrays) and DataFrames (two-dimensional labeled data structures with columns of potentially different types). We'll primarily work with DataFrames in this lesson. To use Pandas, you'll first need to import it:

import pandas as pd

Loading Data into a DataFrame

The most common way to load data is from a CSV (Comma Separated Values) file. The pd.read_csv() function does the job:

# Assuming you have a file named 'my_data.csv'
df = pd.read_csv('my_data.csv')

# Print the first few rows to see the data
print(df.head())

Alternatively, you could use a dictionary of lists to create a DataFrame:

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

Exploring Your Data

Once you have your DataFrame, it's crucial to understand it. Here are some useful methods:

  • df.head(): Shows the first few rows (default: 5).
  • df.tail(): Shows the last few rows (default: 5).
  • df.info(): Provides information about the DataFrame, including data types and non-null values.
  • df.describe(): Generates descriptive statistics (count, mean, standard deviation, min, max, etc.) for numerical columns.
  • df.shape: Returns a tuple representing the dimensions of the DataFrame (rows, columns).

Example:

print(df.info())
print(df.describe())
print(df.shape)

Selecting Data (Columns and Rows)

You can select specific columns using their names:

# Select the 'Name' column
names = df['Name']
print(names)

# Select multiple columns
subset = df[['Name', 'Age', 'City']]
print(subset)

You can select rows using slicing:

# Select the first 3 rows
rows = df[0:3]
print(rows)

Or using boolean indexing (filtering based on conditions):

# Select rows where age is greater than 28
age_above_28 = df[df['Age'] > 28]
print(age_above_28)

Handling Missing Values

Missing values (NaN - Not a Number) are common in real-world datasets. Here are basic ways to handle them:

  • df.isnull(): Returns a DataFrame of the same shape as the original, with True where values are missing and False otherwise.
  • df.fillna(value): Fills missing values with a specified value.
  • df.dropna(): Removes rows with missing values.

Example:

# Assuming some missing values
# (Often data with missing values will be loaded with a string value like 'NA' which `read_csv` will often convert to NaN) 
# Example creates a data frame with missing values.  Real datasets would be loaded this way
df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':[6,7,float('nan'),9,10]}) 

# Check for missing values
print(df.isnull())

# Fill missing values with the mean of the column
mean_col2 = df['col2'].mean()
df['col2'] = df['col2'].fillna(mean_col2)
print(df)

# Remove rows with any remaining missing values 
df = df.dropna()
print(df)
Progress
0%