Lesson 4: Data Manipulation with Pandas

Lesson Content

Introduction to Pandas

Pandas is a core library in Python used for data analysis and manipulation. It provides data structures like DataFrames, which are similar to tables or spreadsheets. Think of them as organized containers for your data. You'll use Pandas to load, clean, transform, and analyze data efficiently. First, you need to import the Pandas library using import pandas as pd. This allows you to call all of Pandas' functions using the shorthand 'pd'.

Creating and Accessing DataFrames

You can create a DataFrame from various data sources, including lists, dictionaries, or reading from a file. Let's start by creating a DataFrame from a dictionary:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

This will output a table-like structure. To access specific columns, use bracket notation: df['Name'] will display the 'Name' column. To access a row, use .loc[row_index]. For example, df.loc[0] will access the first row (index 0). .iloc[] is used to access rows and columns by integer location, starting from 0. For example: df.iloc[0, 1] would return the age of Alice (the second element, age, from the first row, 0).

Loading Data from CSV Files

A common task is loading data from a CSV (Comma Separated Values) file. Let's assume you have a file named 'data.csv'. The Pandas function read_csv() makes this easy:

import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

The .head() function displays the first few rows of the DataFrame, allowing you to quickly inspect the data. Make sure the CSV file is in the same directory as your Python script or specify the full file path. You can use df.tail() to show the last few rows. You can see basic info about your data using df.info(). This will give you information about columns, data types and non-null values.

Filtering and Sorting Data

Pandas allows you to filter and sort your data based on specific criteria. Let's say you want to filter for people older than 28:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

older_than_28 = df[df['Age'] > 28]
print(older_than_28)

This creates a new DataFrame older_than_28 containing only the rows where the 'Age' column is greater than 28. To sort the DataFrame by age in ascending order:

sorted_df = df.sort_values(by='Age')
print(sorted_df)

To sort in descending order, add ascending=False inside the parentheses. For example df.sort_values(by='Age', ascending = False)

Handling Missing Values

Real-world datasets often contain missing values, represented as NaN (Not a Number) in Pandas. You can use the fillna() method to replace missing values. For example, to replace missing values in a column 'Score' with the mean of that column:

import pandas as pd
import numpy as np

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, np.nan, 78]}
df = pd.DataFrame(data)

df['Score'] = df['Score'].fillna(df['Score'].mean())
print(df)

This replaces the NaN value with the calculated mean. You can also fill with a specific value: df['Score'].fillna(0) would replace NaN with 0.

Basic Data Transformations

You can transform data within your DataFrame. Let's say you want to add 5 to each value in the 'Age' column:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

df['Age'] = df['Age'] + 5
print(df)

Another example: You can create a new column based on an existing one. For instance, to create a 'Age in Months' column:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

df['Age_in_Months'] = df['Age'] * 12
print(df)

Deep Dive

Explore advanced insights, examples, and bonus exercises to deepen understanding.

Day 4: Data Scientist Interview Prep - Pandas Deep Dive

Welcome back! Today, we're expanding your Pandas toolkit. We'll build upon the foundational concepts you learned yesterday, diving into more advanced manipulation techniques and exploring how Pandas fits into real-world data science workflows. Remember, mastering Pandas is crucial for data science interviews!

Deep Dive Section: Advanced Pandas Techniques

1. Understanding Data Types and Optimizing Memory Usage

Pandas automatically infers data types, but sometimes this can lead to inefficient memory usage. You can explicitly specify data types during loading (e.g., using the `dtype` parameter in `pd.read_csv()`) or after loading using `.astype()`. This is particularly important when dealing with large datasets where memory constraints are a factor. Understanding data types also helps in avoiding unexpected behavior during calculations. Consider how numerical columns might be stored (e.g., `int64`, `float64`) and how object columns are handled.

        
        import pandas as pd
        # Example:  Load CSV with specified data types
        df = pd.read_csv('your_data.csv', dtype={'col1': 'int32', 'col2': 'float32', 'col3': 'category'})

        # Example: Change data type after loading
        df['col4'] = df['col4'].astype('int16')

2. The Power of `apply()`, `map()`, and `applymap()`

These methods allow you to apply custom functions to your data in various ways.

`apply()`: Applies a function along an axis (row or column) of the DataFrame. Excellent for complex operations.
`map()`: Applies a function to each element of a Series. Often used for simple transformations or mapping values.
`applymap()`: Applies a function to each element of the entire DataFrame (use with caution, can be slow for large datasets).

        
        import pandas as pd
        # Example:  Apply a custom function to a column
        def square(x):
          return x * x
        df['squared_value'] = df['numeric_column'].apply(square)  # Using apply()

        # Example: Using map() for category mapping
        mapping = {'A': 'Category 1', 'B': 'Category 2'}
        df['category_column'] = df['category_column'].map(mapping)

        # Example: Using applymap() - not recommended for large datasets
        # df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)

3. Grouping and Aggregation (GroupBy)

The `groupby()` operation is fundamental for data analysis. It allows you to split your data into groups based on some criteria (e.g., values in a column), apply a function to each group (aggregation, transformation), and then combine the results. Common aggregation functions include `mean()`, `sum()`, `count()`, `min()`, `max()`, and `std()`. You can also use `.agg()` to apply multiple aggregations at once. Mastering `groupby()` is essential for calculating statistics and identifying trends within your data.

        
        import pandas as pd
        # Example: Group by 'category' and calculate the average 'value'
        grouped = df.groupby('category')['value'].mean()

        # Example: Group by 'category' and calculate multiple aggregations
        grouped = df.groupby('category').agg({'value': ['mean', 'sum'], 'other_column': 'count'})

        # Example: Counting unique values using groupby
        unique_counts = df.groupby('category')['value'].nunique()

Bonus Exercises

Data Type Optimization: Load a sample dataset (you can find one online, like a CSV of sales data). Identify columns with numerical data and experiment with changing their data types to optimize memory usage. Measure the memory footprint before and after. Hint: Use `df.info(memory_usage='deep')`
`apply()` Challenge: Create a DataFrame with a column of dates. Use `apply()` to create a new column containing the day of the week for each date. Then, group the data by the day of the week.
`groupby()` Practice: Load a dataset (e.g., from Kaggle). Use `groupby()` to answer the following: What is the average and maximum value of a specific column, grouped by a categorical variable?

Real-World Connections

Pandas is heavily used in various real-world scenarios:

Data Cleaning and Preprocessing: Preparing raw data for analysis is a major application. Transforming data into the appropriate format, handling missing values, and identifying outliers are all common tasks.
Exploratory Data Analysis (EDA): Pandas helps uncover patterns and insights in data through filtering, grouping, and aggregation. This includes calculating descriptive statistics, generating visualizations (often with Matplotlib or Seaborn, built upon Pandas DataFrames), and creating summary reports.
Financial Modeling: Analyzing stock prices, portfolio performance, and risk management often relies on Pandas.
Business Intelligence: Creating reports and dashboards from databases, often using Pandas for manipulation.

Challenge Yourself

Load a large dataset (e.g., a dataset with millions of rows). Time how long it takes to perform several of the following operations, and try to optimize it:

Filtering a large number of rows based on multiple conditions.
Grouping by a large category column and calculating aggregate statistics.
Using `apply()` on a large numerical dataset. Experiment with `apply()` using a faster approach like NumPy where possible.

Tip: Experiment with techniques like parallel processing and chunking to see how to increase efficiency.

Further Learning

Expand your knowledge with these topics:

Time Series Data Analysis with Pandas: Learn how to work with date and time data, including resampling, rolling window calculations, and time-based filtering.
Merging and Joining DataFrames: Understand how to combine data from multiple sources using `merge()`, `join()`, and `concat()`.
Pandas and SQL Integration: Explore how to load data from SQL databases directly into Pandas DataFrames and execute SQL queries using Pandas.
Visualization with Pandas: Explore how to generate plots and graphs directly from Pandas DataFrames using its built-in plotting functions.

Remember to practice consistently and work on projects to solidify your Pandas skills! Good luck with your interview preparation.

Interactive Exercises

Create a DataFrame

Create a Pandas DataFrame from a dictionary with the following information: Name, Sales, and Region. Include at least three entries. Print the DataFrame.

Load and Explore a Dataset

Download a sample CSV file (you can find one online, e.g., a small sales dataset). Load it into a Pandas DataFrame. Use `.head()` and `.info()` to inspect the data.

Filter Data

Using the loaded dataset from the previous exercise, filter the DataFrame to show only rows where a specific condition is met (e.g., sales are above a certain threshold, or a specific region). Print the filtered DataFrame.

Handle Missing Values

Using the dataset, if there are any missing values, use `.fillna()` to fill in the missing values. Describe what you chose to fill the missing values with and why.

Cookie Preferences

Regenerating Content

Data Manipulation with Pandas

Learning Objectives

Text-to-Speech