**Introduction to Pandas: DataFrames and Series

This lesson introduces the Pandas library, a fundamental tool for data scientists. You'll learn about the core data structures: DataFrames and Series, and how to create, manipulate, and access data within them. This will lay the groundwork for more advanced data wrangling and exploration techniques.

Learning Objectives

  • Define and differentiate between Pandas Series and DataFrames.
  • Create Pandas Series and DataFrames from various data sources.
  • Understand how to access and select data within DataFrames using indexing and slicing.
  • Describe common DataFrame attributes and methods for data inspection.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Pandas

Pandas is a powerful Python library built for data manipulation and analysis. It provides easy-to-use data structures and data analysis tools. The two primary data structures in Pandas are Series and DataFrames. You'll work with these extensively as a data scientist. To get started, you'll need to install Pandas (if you don't already have it): pip install pandas and then import it in your Python environment: import pandas as pd. The pd alias is the convention.

Pandas Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). Think of it like a single column in a spreadsheet.

Creating a Series:

import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

Output:

0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with Custom Index:

import pandas as pd

data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)

Output:

a    10
b    20
c    30
d    40
e    50
dtype: int64

Accessing Series Data: You can access elements using their index (either the default numerical index or a custom index):

print(series['b']) # Accessing element with index 'b'
print(series[1]) # Accessing element with numerical index 1

Pandas DataFrames

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet or a SQL table. It's the most commonly used Pandas object. DataFrames are built upon the Series objects.

Creating a DataFrame from a dictionary:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris

Creating a DataFrame from a list of lists:

import pandas as pd

data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'London'],
        ['Charlie', 28, 'Paris']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)

Output (same as above):

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris

Accessing DataFrame Data:
* Accessing Columns: df['column_name'] returns a Series.
python print(df['Age']) # Accessing the 'Age' column
* Accessing Rows (using .loc and .iloc):
* .loc (label-based): df.loc[row_label] or df.loc[row_label, column_label]
* .iloc (integer-based): df.iloc[row_index] or df.iloc[row_index, column_index]
python print(df.loc[0]) # Accessing the first row by label print(df.iloc[0]) # Accessing the first row by integer index print(df.loc[0, 'Name']) # Accessing the value at row 0, column 'Name'

DataFrame Attributes and Methods

DataFrames have various attributes and methods for inspection and manipulation.

  • .head(): Displays the first few rows (default: 5).
  • .tail(): Displays the last few rows (default: 5).
  • .info(): Provides a concise summary of the DataFrame (column names, data types, non-null values).
  • .describe(): Generates descriptive statistics (count, mean, std, min, max, etc.) for numerical columns.
  • .shape: Returns a tuple representing the dimensions of the DataFrame (rows, columns).
  • .columns: Returns an index of the column labels.
  • .index: Returns the index (row labels) of the DataFrame.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 28, 22, 35],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}
df = pd.DataFrame(data)

print(df.head(2))
print(df.info())
print(df.describe())
print(df.shape)
print(df.columns)
Progress
0%