**Data Loading, Selection, and Filtering with Pandas

In this lesson, you'll learn how to load data into Pandas, a powerful Python library for data manipulation. You'll then dive into selecting specific data and filtering it based on conditions, a crucial skill for any data scientist to explore and prepare data for analysis.

Learning Objectives

  • Load data from various file formats (CSV) into a Pandas DataFrame.
  • Select specific columns and rows from a DataFrame using various methods.
  • Filter data based on single and multiple conditions using boolean indexing.
  • Understand the importance of data selection and filtering in data exploration and preparation.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Data Loading with Pandas

Pandas provides easy-to-use functions for loading data from different sources. The most common is loading data from CSV files. First, you need to import the Pandas library using import pandas as pd. Then, use the pd.read_csv() function to load your data.

import pandas as pd

# Assuming you have a file named 'my_data.csv'
data = pd.read_csv('my_data.csv')

#Display the first 5 rows
print(data.head())

Ensure that 'my_data.csv' is in the same directory as your Python script or provide the full file path. The .head() method displays the first few rows of the DataFrame, providing a quick way to inspect the loaded data.

Selecting Columns and Rows

Once your data is loaded, you'll often need to select specific columns or rows for analysis.

  • Selecting Columns: Use square brackets [] and specify the column name(s).

    ```python

    Select a single column

    column_A = data['ColumnA']
    print(column_A.head())

    Select multiple columns

    subset = data[['ColumnA', 'ColumnB', 'ColumnC']]
    print(subset.head())
    ```

  • Selecting Rows: Use .loc[] (label-based indexing) or .iloc[] (integer-based indexing).

    • .loc[]: Selects rows by label (e.g., index number or row name).

      ```python

      Select rows with index labels 0, 1, and 2

      row_selection_loc = data.loc[[0, 1, 2]]
      print(row_selection_loc)
      ```

    • .iloc[]: Selects rows by integer position.

      ```python

      Select the first three rows

      row_selection_iloc = data.iloc[0:3] # Note: the upper bound is exclusive
      print(row_selection_iloc)
      `` **Important Note:** The slice0:3iniloc` selects rows with indices 0, 1, and 2. The upper bound (3) is exclusive.

Filtering Data with Boolean Indexing

Filtering allows you to select rows that meet specific criteria. This is done using boolean indexing. You create a boolean mask (an array of True/False values) and use it to select the desired rows.

# Assuming you have a column named 'Age'
# Create a boolean mask: select rows where 'Age' is greater than 30
filter_mask = data['Age'] > 30

# Apply the mask to the DataFrame
filtered_data = data[filter_mask]

# Print the filtered data
print(filtered_data.head())
  • Multiple Conditions: You can combine multiple conditions using logical operators: & (AND), | (OR), and ~ (NOT).

    ```python

    Filter for ages greater than 30 AND gender is 'Male'

    filtered_data = data[(data['Age'] > 30) & (data['Gender'] == 'Male')]
    print(filtered_data.head())

    Filter for ages greater than 30 OR gender is 'Female'

    filtered_data = data[(data['Age'] > 30) | (data['Gender'] == 'Female')]
    print(filtered_data.head())

    Filter for not gender is 'Female'

    filtered_data = data[~(data['Gender'] == 'Female')]
    print(filtered_data.head())
    ```

    Important: When combining conditions, enclose each condition in parentheses (()). Also, make sure to use & and | instead of and and or for boolean operations with Pandas Series.

Progress
0%