Working with DataFrames: Data Manipulation and Analysis

In this lesson, you'll learn how to manipulate and analyze data stored in Spark DataFrames, the fundamental data structure for working with structured data in Spark. You'll explore common operations like filtering, selecting columns, grouping, and performing basic aggregations to gain insights from your data.

Learning Objectives

  • Understand how to select specific columns from a DataFrame.
  • Learn how to filter data based on conditions using the `where` or `filter` methods.
  • Grasp the concept of grouping data and performing aggregate calculations.
  • Apply basic data manipulation techniques to derive meaningful information from a DataFrame.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to DataFrames and SparkSession (Review)

A DataFrame in Spark is like a table in a relational database, but distributed across a cluster for handling large datasets. It's a structured collection of data organized into named columns and rows. We will start by reviewing how to create a SparkSession, your entry point to Spark functionality. Remember how to create a SparkSession (e.g., spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()). This creates a connection to your Spark cluster. We'll be using this SparkSession in all the following examples to load and manipulate data.

Selecting Columns

To select specific columns from a DataFrame, you can use the .select() method. This method takes one or more column names as arguments.

Example:
Assume we have a DataFrame called peopleDF with columns like name, age, and city.

# Assuming peopleDF is already loaded
from pyspark.sql.functions import col

selected_df = peopleDF.select("name", "age") # Selecting by name of the columns
selected_df.show()

# Selecting columns using col from pyspark.sql.functions
selected_df2 = peopleDF.select(col("name"), col("age")) # equivalent to the above
selected_df2.show()

This code will display a new DataFrame containing only the 'name' and 'age' columns.

Filtering Data

Filtering allows you to select rows that meet specific criteria. You use the .filter() or .where() method, along with conditional statements (e.g., ==, >, <, !=).

Example:

# Assuming peopleDF is already loaded
filtered_df = peopleDF.filter(peopleDF["age"] > 25) #  Selecting rows where age is greater than 25
# OR
filtered_df = peopleDF.where(peopleDF["age"] > 25) #  same results as the filter method
filtered_df.show()

This code will display a new DataFrame containing only those rows where the age is greater than 25. You can also use multiple conditions with & (AND) and | (OR) to filter more complex scenarios. For example, peopleDF.filter((peopleDF["age"] > 25) & (peopleDF["city"] == "New York"))

Grouping and Aggregation

Grouping allows you to group data based on the values in one or more columns, and then perform aggregations (e.g., calculating the sum, average, count) within each group. The .groupBy() method groups the data, and aggregation functions (e.g., .count(), .sum(), .avg()) are then applied.

Example:

# Assuming peopleDF is already loaded, with a 'city' column
from pyspark.sql.functions import count, avg

grouped_df = peopleDF.groupBy("city").agg(count("*").alias("count"), avg("age").alias("avg_age")) # Counting the number of people and calculating the average age per city
grouped_df.show()

This example groups the data by the 'city' column, calculates the count of people and the average age for each city. Note that alias() is used to rename the aggregated columns for readability.

Progress
0%