Data Manipulation with Pandas

This lesson introduces Pandas, a powerful Python library for data manipulation. You'll learn how to load, explore, clean, and transform data using Pandas DataFrames, a fundamental skill for any aspiring data scientist.

Learning Objectives

  • Understand the basic structure of a Pandas DataFrame.
  • Load data from a CSV file into a Pandas DataFrame.
  • Perform essential data manipulation tasks like filtering, sorting, and handling missing values.
  • Apply basic data transformations using Pandas.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Pandas

Pandas is a core library in Python used for data analysis and manipulation. It provides data structures like DataFrames, which are similar to tables or spreadsheets. Think of them as organized containers for your data. You'll use Pandas to load, clean, transform, and analyze data efficiently. First, you need to import the Pandas library using import pandas as pd. This allows you to call all of Pandas' functions using the shorthand 'pd'.

Creating and Accessing DataFrames

You can create a DataFrame from various data sources, including lists, dictionaries, or reading from a file. Let's start by creating a DataFrame from a dictionary:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

This will output a table-like structure. To access specific columns, use bracket notation: df['Name'] will display the 'Name' column. To access a row, use .loc[row_index]. For example, df.loc[0] will access the first row (index 0). .iloc[] is used to access rows and columns by integer location, starting from 0. For example: df.iloc[0, 1] would return the age of Alice (the second element, age, from the first row, 0).

Loading Data from CSV Files

A common task is loading data from a CSV (Comma Separated Values) file. Let's assume you have a file named 'data.csv'. The Pandas function read_csv() makes this easy:

import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

The .head() function displays the first few rows of the DataFrame, allowing you to quickly inspect the data. Make sure the CSV file is in the same directory as your Python script or specify the full file path. You can use df.tail() to show the last few rows. You can see basic info about your data using df.info(). This will give you information about columns, data types and non-null values.

Filtering and Sorting Data

Pandas allows you to filter and sort your data based on specific criteria. Let's say you want to filter for people older than 28:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

older_than_28 = df[df['Age'] > 28]
print(older_than_28)

This creates a new DataFrame older_than_28 containing only the rows where the 'Age' column is greater than 28. To sort the DataFrame by age in ascending order:

sorted_df = df.sort_values(by='Age')
print(sorted_df)

To sort in descending order, add ascending=False inside the parentheses. For example df.sort_values(by='Age', ascending = False)

Handling Missing Values

Real-world datasets often contain missing values, represented as NaN (Not a Number) in Pandas. You can use the fillna() method to replace missing values. For example, to replace missing values in a column 'Score' with the mean of that column:

import pandas as pd
import numpy as np

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, np.nan, 78]}
df = pd.DataFrame(data)

df['Score'] = df['Score'].fillna(df['Score'].mean())
print(df)

This replaces the NaN value with the calculated mean. You can also fill with a specific value: df['Score'].fillna(0) would replace NaN with 0.

Basic Data Transformations

You can transform data within your DataFrame. Let's say you want to add 5 to each value in the 'Age' column:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

df['Age'] = df['Age'] + 5
print(df)

Another example: You can create a new column based on an existing one. For instance, to create a 'Age in Months' column:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

df['Age_in_Months'] = df['Age'] * 12
print(df)
Progress
0%