**Data Science with Python: Introduction and Data Manipulation

This lesson introduces you to data science using Python, focusing on how to load and manipulate data using the powerful Pandas library. You'll learn the basics of creating and modifying DataFrames, the fundamental structure for organizing data in Python, and get your hands dirty with practical examples.

Learning Objectives

  • Understand the role of Python in data science.
  • Learn how to install and import the Pandas library.
  • Create and understand Pandas DataFrames.
  • Perform basic data manipulation techniques like selecting, filtering and calculating new columns.

Text-to-Speech

Listen to the lesson content

Lesson Content

Python and Data Science: A Quick Overview

Python is a versatile and popular programming language widely used in data science. Its clear syntax and extensive libraries make it ideal for tasks like data analysis, machine learning, and visualization. We will use Python and the Pandas library in this lesson.

Why Python?
* Readability: Python's syntax is designed to be easy to read and understand.
* Libraries: It boasts a vast ecosystem of libraries specifically for data science, such as Pandas, NumPy, and Scikit-learn.
* Community: A large and active community means plenty of resources and support.

Setting up Your Environment
* Install Python: Download and install the latest version of Python from the official website (python.org). Choose the version that is appropriate for your operating system.
* Install Pandas: Open your terminal/command prompt and run the command: pip install pandas.

Introducing Pandas: The Data Wrangler

Pandas is a Python library built for data manipulation and analysis. Its core data structure is the DataFrame, which is essentially a table of data (like a spreadsheet or SQL table) with rows and columns. This makes it incredibly easy to work with structured data.

Importing Pandas
To use Pandas, you first need to import it into your Python environment. The conventional way is:

import pandas as pd

The as pd part is just a shorthand – now you can refer to Pandas functions using pd.function_name().

Creating a DataFrame (Example)
Let's create a simple DataFrame from a dictionary of lists:

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)
print(df)

This will output a table-like structure, your DataFrame!

Data Selection and Manipulation

Now that you have a DataFrame, let's look at how to select and manipulate data.

Selecting Columns:

print(df['Name'])  # Selects the 'Name' column
print(df[['Name', 'Age']]) # Selects multiple columns

Filtering Rows (Conditional Selection):

print(df[df['Age'] > 28]) # Selects rows where 'Age' is greater than 28

Adding New Columns:

df['Salary'] = [50000, 60000, 55000]
print(df)

Calculating Columns:

df['Age_in_Dog_Years'] = df['Age'] * 7
print(df)

Loading Data from External Sources

Working with data stored in files is a crucial skill. Pandas simplifies this with functions like read_csv() and read_excel().

Loading from CSV (Comma Separated Values):
Assuming you have a file named 'data.csv' in the same directory as your Python script:

df = pd.read_csv('data.csv')
print(df.head()) # Displays the first 5 rows of the DataFrame

Loading from Excel:

df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # replace 'Sheet1' with the sheet name
print(df.head())

Remember to install openpyxl if needed: pip install openpyxl for Excel file reading.

Progress
0%