Data Selection and Filtering with Pandas

In this lesson, you will learn how to select and filter data using the powerful Pandas library in Python. We'll explore various methods for choosing specific columns, rows, and subsets of your data, enabling you to focus on the information you need for analysis. This is a crucial skill for data wrangling and cleaning, setting the stage for more in-depth data exploration.

Learning Objectives

  • Identify and select specific columns from a Pandas DataFrame.
  • Filter rows based on simple and complex criteria using boolean indexing.
  • Understand the difference between `.loc` and `.iloc` for data selection.
  • Apply these techniques to real-world datasets for data cleaning and preparation.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Data Selection

Data selection is a fundamental part of data wrangling. It involves choosing the specific data you need from a larger dataset. Pandas provides versatile methods to select columns, rows, and subsets based on various criteria. This allows you to focus your analysis on relevant information, making it easier to identify patterns, trends, and anomalies. Let's start by understanding how to select columns.

Selecting Columns:

You can select one or more columns using square brackets [] and the column names. The output is always a Pandas Series (for a single column) or a Pandas DataFrame (for multiple columns).

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 28, 22],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Select the 'Name' column
name_column = df['Name']
print("Name Column:\n", name_column)

# Select multiple columns
selected_columns = df[['Name', 'Age']]
print("Selected Columns:\n", selected_columns)

Filtering Rows with Boolean Indexing

Boolean indexing allows you to filter rows based on a condition. You create a boolean mask (a series of True/False values) where True indicates the row should be included, and False indicates it should be excluded. This mask is used to index the DataFrame.

# Filter for people older than 25
older_than_25 = df[df['Age'] > 25]
print("People older than 25:\n", older_than_25)

# Filter for people living in London
living_in_london = df[df['City'] == 'London']
print("People living in London:\n", living_in_london)

You can combine multiple conditions using logical operators: & (and), | (or), and ~ (not).

# Filter for people older than 25 AND living in London
filtered_data = df[(df['Age'] > 25) & (df['City'] == 'London')]
print("Filtered data:\n", filtered_data)

`.loc` and `.iloc` - Location-based Selection

Pandas offers two powerful methods for selection based on labels (.loc) and integer-based positions (.iloc). Understanding their differences is crucial for effective data manipulation.

  • .loc: Selects data based on labels (row and column names). The syntax is df.loc[row_label, column_label]. If you provide a row label (or a list/slice of labels), it includes the row with that label. If you provide a column label (or a list/slice of labels), it includes the column with that label.
# Select the row with label 0 and the 'Name' column
name_at_index_0 = df.loc[0, 'Name']
print("Name at index 0:\n", name_at_index_0)

# Select rows with labels 0 and 1, and columns 'Name' and 'Age'
subset = df.loc[[0, 1], ['Name', 'Age']]
print("Subset using .loc:\n", subset)

# Slicing with .loc includes the end index
subset_rows_0_to_2 = df.loc[0:2, :]
print("Slice of rows 0 to 2 (inclusive) with all columns:\n", subset_rows_0_to_2)
  • .iloc: Selects data based on integer positions (index numbers). The syntax is df.iloc[row_index, column_index]. This is similar to how you access elements in a Python list.
# Select the row at index 0 and the column at index 0
first_cell = df.iloc[0, 0]
print("First cell:\n", first_cell)

# Select rows at indices 0 and 1, and columns at indices 0 and 1
subset_iloc = df.iloc[[0, 1], [0, 1]]
print("Subset using .iloc:\n", subset_iloc)

# Slicing with .iloc excludes the end index
subset_rows_0_to_2_iloc = df.iloc[0:2, :]
print("Slice of rows 0 to 1 (exclusive) with all columns:\n", subset_rows_0_to_2_iloc)

Key Differences:

  • .loc uses labels (row and column names). .iloc uses integer positions (index numbers).
  • .loc slice includes the stop index, while .iloc slice excludes the stop index.
Progress
0%