**Introduction to Pandas: DataFrames & Data Exploration

This lesson introduces Pandas, a powerful Python library for data manipulation and analysis. You'll learn how to load data into Pandas DataFrames, explore its structure, clean missing values, and perform basic exploratory data analysis (EDA) to understand your data better.

Learning Objectives

  • Understand the purpose and importance of the Pandas library.
  • Learn how to load data from CSV and Excel files into Pandas DataFrames.
  • Explore and understand the structure of a DataFrame, including its rows, columns, and data types.
  • Apply basic data cleaning techniques, such as handling missing values.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Pandas

Pandas is a fundamental library for data science in Python. It provides high-performance, easy-to-use data structures and data analysis tools. The core data structure in Pandas is the DataFrame, which can be thought of as a table or a spreadsheet with rows and columns. Pandas allows you to efficiently work with structured data, cleaning, transforming, and analyzing it.

To use Pandas, you first need to import it. The standard convention is to import it as pd:

import pandas as pd

Creating DataFrames (Conceptual, not a primary focus here, but necessary for understanding)

While loading data is our main focus, it's useful to understand how DataFrames are constructed. You can create a DataFrame from various data structures, such as lists, dictionaries, or NumPy arrays.

import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

This creates a DataFrame where the keys of the dictionary become column headers and the values become the data within each column.

Loading Data from CSV Files

CSV (Comma Separated Values) files are a common format for storing data. Pandas makes it easy to load CSV files into DataFrames using the read_csv() function. You need to specify the path to your CSV file.

import pandas as pd

# Assuming you have a file named 'data.csv' in the same directory
df = pd.read_csv('data.csv')
print(df)

If your CSV file has a header row (which it usually does), Pandas will automatically use the first row as column names. You can also specify parameters like sep (separator) if your CSV uses something other than a comma or header=None if your file doesn't have a header. For example:

df = pd.read_csv('data.csv', sep=';', header=None)

This would read a semicolon-separated file with no header and automatically assign column names (0, 1, 2, ...).

Loading Data from Excel Files

Pandas also supports loading data from Excel files using the read_excel() function. You'll need the openpyxl library installed (install with pip install openpyxl).

import pandas as pd

# Assuming you have a file named 'data.xlsx'
df = pd.read_excel('data.xlsx', sheet_name='Sheet1') #  Specify the sheet name
print(df)

You can specify the sheet_name to read a specific sheet from the Excel file. Common options are 'Sheet1', 0 (for the first sheet), or None (to read all sheets into a dictionary of DataFrames).

Exploring the DataFrame

Once you have your data loaded, you need to explore it to understand its structure and content.

  • df.head(): Displays the first 5 rows of the DataFrame (by default). You can specify the number of rows with df.head(10) to view the first 10 rows.
    python print(df.head())
  • df.tail(): Displays the last 5 rows of the DataFrame. Similar to head(), you can specify the number of rows to show.
    python print(df.tail())
  • df.info(): Provides a summary of the DataFrame, including the number of non-null values, data types of each column, and memory usage.
    python print(df.info())
  • df.describe(): Generates descriptive statistics of numerical columns (count, mean, standard deviation, min, max, and quartiles).
    python print(df.describe())
  • df.columns: Displays the column names. This is a useful way to understand what features your dataset contains.
    python print(df.columns)
  • df.shape: Returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
    python print(df.shape)
  • df.dtypes: Shows the data type of each column (e.g., int64, float64, object (for strings)).
    python print(df.dtypes)

Data Cleaning: Handling Missing Values

Real-world datasets often have missing values, represented as NaN (Not a Number) in Pandas. You'll need to handle these before analysis. Here are common techniques:

  • df.isnull(): Returns a DataFrame of the same shape as the original, with True where values are missing and False otherwise.
    python print(df.isnull())
  • df.isnull().sum(): Counts the number of missing values in each column.
    python print(df.isnull().sum())
  • df.dropna(): Removes rows with missing values. You can use the subset parameter to only consider specific columns. df.dropna(subset=['column_name']) removes rows where missing values are only present in a specific column. By default, dropna() will remove rows where any value is missing. Set how='all' to drop only rows where all values are missing.
    python df_cleaned = df.dropna()
  • df.fillna(): Replaces missing values with a specified value. You can fill with a specific value (e.g., df.fillna(0)), the mean/median of the column (e.g., df.fillna(df.mean())), or other strategies.
    python df_filled = df.fillna(0)

Basic Exploratory Data Analysis (EDA)

EDA involves summarizing the main characteristics of a dataset, often using visualizations and basic statistics. We've already covered some aspects of EDA. Here are a few more basic techniques:

  • df['column_name'].value_counts(): Counts the occurrences of each unique value in a specific column. Useful for categorical data.
    python print(df['Category'].value_counts())
  • Histograms: While we won't cover visualizations in detail here (that comes in later lessons), you can create simple histograms using df['column_name'].hist(). This shows the distribution of values in a numerical column.

    python import matplotlib.pyplot as plt df['Age'].hist() plt.show() # to show the plot
    (Note: matplotlib.pyplot needs to be imported separately. Visualization is covered more in detail later.)

Progress
0%