Data Cleaning

This lesson focuses on crucial data cleaning techniques for data scientists. You'll learn how to convert data types, manipulate strings, and handle duplicate entries, ensuring data quality and usability.

Learning Objectives

  • Identify and correct incorrect data types in a dataset.
  • Apply string manipulation techniques to clean and standardize text data.
  • Detect and remove duplicate records from a dataset.
  • Understand the importance of data cleaning in the data science workflow.

Text-to-Speech

Listen to the lesson content

Lesson Content

Type Conversion: Making Data Usable

Data often arrives in the wrong format. For example, numbers might be read as strings, or dates as plain text. This prevents calculations and analysis. We'll use Python's astype() method (usually with Pandas DataFrames) to convert data types.

Example: Let's say we have a column called 'Age' that's been imported as a string.

import pandas as pd

data = {'Name': ['Alice', 'Bob'], 'Age': ['30', '25']}
df = pd.DataFrame(data)
print(df.dtypes) # Check initial data types
df['Age'] = df['Age'].astype(int) # Convert 'Age' to integer
print(df.dtypes) # Check new data types

Here, astype(int) converts the 'Age' column from string to integer. Other common conversions include float, datetime, and bool. Always check your data types initially using df.dtypes to find the incorrect ones.

String Manipulation: Cleaning Text Data

Text data often needs cleaning. We use string methods to standardize it. Common operations include:

  • str.lower(): Converts text to lowercase.
  • str.upper(): Converts text to uppercase.
  • str.strip(): Removes leading and trailing whitespace.
  • str.replace(old, new): Replaces occurrences of a substring with another.

Example:

import pandas as pd

data = {'Name': [' Alice ', '  BOB  ', 'Carol']}
df = pd.DataFrame(data)

df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.lower()
print(df)

This example cleans names by removing extra spaces and converting to lowercase.

Handling Duplicates: Ensuring Data Integrity

Duplicate data entries can skew analysis. We use duplicated() and drop_duplicates() in Pandas to address this.

  • df.duplicated(): Identifies duplicate rows (returns a boolean series).
  • df.drop_duplicates(): Removes duplicate rows based on all or selected columns.

Example:

import pandas as pd

data = {'ID': [1, 2, 2, 3], 'Value': [10, 20, 20, 30]}
df = pd.DataFrame(data)

print(df.duplicated())
df = df.drop_duplicates()
print(df)

In this case, the second row with ID 2 is a duplicate and is removed by default (based on all columns). You can specify columns to check for duplicates, e.g., df.drop_duplicates(subset=['ID']).

Progress
0%