**Data Transformation: Adding/Removing Columns, Renaming Columns, and Combining DataFrames

In this lesson, you'll learn essential data transformation techniques, focusing on manipulating columns and combining datasets. You'll gain practical skills in adding, removing, and renaming columns, along with methods for merging and joining data from multiple sources. These skills are fundamental for preparing data for analysis.

Learning Objectives

  • Add new columns to a DataFrame based on existing columns.
  • Remove unwanted columns from a DataFrame.
  • Rename columns for clarity and consistency.
  • Combine multiple DataFrames using merge and join operations.

Text-to-Speech

Listen to the lesson content

Lesson Content

Adding New Columns

Often, you'll need to create new columns based on existing data. This is crucial for feature engineering. You can create new columns using mathematical operations, string manipulations, or conditional statements.

Example: Suppose you have a DataFrame with 'price' and 'quantity' columns. You can create a new column called 'total_cost' by multiplying these two columns.

import pandas as pd

data = {'price': [10, 20, 30], 'quantity': [2, 3, 4]}
df = pd.DataFrame(data)
df['total_cost'] = df['price'] * df['quantity']
print(df)

Output:

   price  quantity  total_cost
0     10         2          20
1     20         3          60
2     30         4         120

Removing Columns

You might need to remove irrelevant or unnecessary columns to simplify your analysis and reduce noise. The drop() function is your go-to tool for this.

Example: Continuing with the previous example, if we want to remove the 'price' column:

import pandas as pd

data = {'price': [10, 20, 30], 'quantity': [2, 3, 4], 'total_cost': [20, 60, 120]}
df = pd.DataFrame(data)
df = df.drop('price', axis=1)
print(df)

Output:

   quantity  total_cost
0         2          20
1         3          60
2         4         120

Important: axis=1 specifies that we are dropping a column. To drop a row, you'd use axis=0 (which we'll cover later in more advanced lessons). The inplace=True argument can be added to modify the DataFrame directly, without creating a copy, but be cautious as it can modify the original data.

Renaming Columns

Clear and descriptive column names are critical for readability and understanding. You can rename columns using the rename() function or by directly assigning a list of new names. rename() is generally preferred as it is more flexible.

Example: Renaming 'quantity' to 'units_sold':

import pandas as pd

data = {'price': [10, 20, 30], 'quantity': [2, 3, 4], 'total_cost': [20, 60, 120]}
df = pd.DataFrame(data)
df = df.rename(columns={'quantity': 'units_sold'})
print(df)

Output:

   price  units_sold  total_cost
0     10           2          20
1     20           3          60
2     30           4         120

You can rename multiple columns at once by providing a dictionary to the columns argument in rename().

Combining DataFrames: Merging and Joining

Often, your data is split across multiple files or tables. You'll need to combine them. Pandas provides merge() and various join methods to accomplish this.

Example: Imagine you have two DataFrames: one with product information and another with sales data.

import pandas as pd

# Product DataFrame
product_data = {'product_id': [1, 2, 3], 'product_name': ['A', 'B', 'C']}
product_df = pd.DataFrame(product_data)

# Sales DataFrame
sales_data = {'product_id': [1, 2, 1, 3], 'sales': [100, 150, 120, 200]}
sales_df = pd.DataFrame(sales_data)

# Merge DataFrames on 'product_id'
merged_df = pd.merge(sales_df, product_df, on='product_id')
print(merged_df)

Output:

   product_id  sales product_name
0           1    100            A
1           1    120            A
2           2    150            B
3           3    200            C
  • pd.merge(df1, df2, on='column_name'): This is the most common use. It merges df1 and df2 based on the specified column.
  • Join operations have different types such as inner, outer, left, and right that determine how to handle missing data based on the key column.
  • inner: Only keep rows where the key column exists in both dataframes (default).
  • outer: Keep all rows; fills in missing values with NaN.
  • left: Keep all rows from the left DataFrame; fills in missing values from the right DataFrame with NaN.
  • right: Keep all rows from the right DataFrame; fills in missing values from the left DataFrame with NaN.
Progress
0%