- InfinitePy Newsletter 🇺🇸
- Posts
- Working with Data in Python: From Basics to Advanced Techniques
Working with Data in Python: From Basics to Advanced Techniques
Python has emerged as one of the most popular programming languages for data manipulation and analysis, thanks to its simplicity and the power of its libraries.
🕒 Estimated reading time: 10 minutes
Reading and writing files are fundamental operations in Python and are often necessary for tasks such as data processing, configuration management, and logging. Understanding how to work with data is a crucial skill, whether it's reading and writing basic text files, loading datasets, or performing complex data manipulations, Python has you covered.
In this post, we'll start with the basics of file handling using open()
and then dive into powerful libraries like Pandas for more advanced data manipulation tasks. We'll cover practical examples for beginner, intermediate, and advanced users.
Now, let's dive into the world of Data in Python with some practical examples that are also available on Google Colab here 👨🔬.
Reading files is one of the fundamental skills you need to acquire. Python's built-in open()
function makes this task straightforward. Here's how you can do it:
First, we will download a sample file for our use.
# Import the urlretrieve function from the urllib.request module # This function helps in downloading a file from a given URL from urllib.request import urlretrieve # Define the URL of the file to be downloaded url = 'https://infinitepy.s3.amazonaws.com/samples/lorem_ipsum.txt' # Define the local filename where the downloaded file will be saved filename = 'lorem_ipsum.txt' # Use the urlretrieve function to download the file from the given URL # and save it locally with the specified filename urlretrieve(url, filename)
We can now open the file.
# Open the downloaded file for reading ('r' stands for read mode) # The 'with' statement ensures that the file is properly closed after its suite finishes with open(filename, 'r') as file: # Read the entire content of the file content = file.read() # Print the content of the file to the console print(content)
Writing data to a file is equally simple:
# Basic file writing # 'with' statement is used to ensure proper acquisition and release of resources. # It's a context manager that simplifies working with files by handling file closing automatically. with open('output.txt', 'w') as file: # 'open' function is used to open a file. # The first argument is the name of the file. If the file does not exist, it will be created. # The second argument 'w' stands for 'write mode', which allows us to write to the file. # 'file.write' method is used to write the provided string to the file. # Here, it writes the string 'Hello, world!' to 'output.txt'. file.write('Hello, world!') # At this point, the 'with' context manager will automatically close the file, # even if an error occurs, ensuring that resources are properly released.
Practical example
Below is a practical example that demonstrates how to read from and write to a text file. Let's create a simple task tracker. We'll have two functions:
one for reading tasks from a file and
another for writing new tasks to a file.
Writing Tasks to a File
First, let's create a function to write tasks to a file. Each task will be written on a new line.
def write_tasks_to_file(filename, tasks): """ Write a list of tasks to a specified file. Args: filename (str): The name of the file to write tasks to. tasks (list): A list of tasks to be written to the file. """ # Open a file with the given filename in write ('w') mode. # The 'with' statement ensures proper acquisition and release of resources. # The file is automatically closed after the indented block of code. with open(filename, 'w') as file: # Iterate over each task in the provided tasks list. for task in tasks: # Write each task to the file followed by a newline character. file.write(task + '\n') # Print a message to the console indicating that tasks have been successfully written to the file. print(f"Tasks have been written to {filename}")
Reading Tasks from a File
Next, let's read tasks from the file. We'll read each line and return it as a list of tasks.
def read_tasks_from_file(filename): """ Read tasks from a specified file. Args: filename (str): The name of the file to read tasks from. Returns: list: A list of tasks read from the file. """ tasks = [] # Initialize an empty list to hold the tasks try: # Attempt to open the file in read-only mode ('r') with open(filename, 'r') as file: # Read all lines from the file into the list 'tasks' tasks = file.readlines() # Strip newline characters '\n' from the end of each task tasks = [task.strip() for task in tasks] except FileNotFoundError: # If the file is not found, print an error message print(f"The file {filename} does not exist.") # Return the list of tasks (or an empty list if the file was not found) return tasks
Using the Functions
Here’s an example of how you can use these functions:
filename = 'tasks.txt' # Define a list of tasks to write to the file tasks_to_write = ['Buy groceries', 'Complete Python tutorial', 'Walk the dog'] # Call the function to write tasks to the file write_tasks_to_file(filename, tasks_to_write) # Call the function to read tasks from the file tasks_read = read_tasks_from_file(filename) print("Tasks read from file:") # Print each task read from the file for task in tasks_read: print("- " + task)
Working with CSV Files in Python Pandas
CSV (Comma-Separated Values) files are one of the most common data storage formats in data science, analytics, and machine learning. Pandas, a powerful data manipulation and analysis library for Python, has built-in support for handling CSV files effortlessly.
Pandas is an open-source library providing high-performance, easy-to-use data structures and data analysis tools for Python. It is particularly well-suited for working with structured data, like CSV files.
Reading CSV Files
To read a CSV file into a DataFrame, you use the read_csv
function:
# Import the pandas library which is very commonly used for data manipulation and analysis in Python import pandas as pd # Read a CSV file from the provided URL into a DataFrame # A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types, # similar to a table in a database or an Excel spreadsheet. df = pd.read_csv('https://infinitepy.s3.amazonaws.com/samples/employees.csv') # Display the first few rows of the DataFrame to verify the data has been loaded correctly print(df.head()) # Optional line for quick data inspection
Displaying Data
You can display the first few rows of the DataFrame using the head
method:
# The 'head' method of a pandas DataFrame returns the first n rows. # By default, it returns the first 5 rows if no argument is given. # So, the following line of code prints the first 5 rows of the DataFrame 'df'. print(df.head())
Simple Data Manipulation
Let's say you want to select a specific column from the DataFrame:
# Print the 'Name' column from the DataFrame 'df' # The 'Name' column is accessed using df['Name'] print(df['Name'])
Filtering and Sorting Data
Filtering data can be done using conditions:
# The following line filters the DataFrame 'df' to include only rows where the 'Salary' column is greater than 5000. # The result is assigned to 'filtered_df', which is a new DataFrame containing the filtered rows. filtered_df = df[df['Salary'] > 5000] # The following line prints the first 5 rows of 'filtered_df' to the console. # This is useful for quickly inspecting the filtered data to ensure it meets our expectations. print(filtered_df.head())
To sort data, use the sort_values
method:
# Sort the DataFrame 'df' by the values in the 'Salary' column in descending order (largest to smallest) sorted_df = df.sort_values(by='Salary', ascending=False) # Print the first 5 rows of the sorted DataFrame # We use the head() method to display the top 5 entries of sorted_df print(sorted_df.head())
Project in Action: From Theory to Practice
Now, let's apply what we've learned in a real-world scenario. Suppose you have a CSV file with sales data for different regions, products, and months. You'll analyze this data to gain insights.
Read Data:
# Read data from a CSV file available at the given URL and store it in a DataFrame called sales_df # The read_csv function fetches the data and structures it into a format suitable for analysis sales_df = pd.read_csv('https://infinitepy.s3.amazonaws.com/samples/sales_data.csv')
Inspect Data:
# 'head()' is a built-in method in pandas DataFrame that returns the first n rows. # By default, it returns the first 5 rows. You can specify a different number as an argument like 'head(10)' for the first 10 rows. sales_df.head()
# The .info() method is called on 'sales_df', used to get a concise summary of the DataFrame # The output will include: # - The class type of the DataFrame # - The number of non-null entries in each column # - The column names and their data types (dtype) # - Memory usage of the DataFrame # This method is very useful for getting a quick overview of your data, # which can aid in understanding its structure and identifying any potential data issues. sales_df.info()
Clean Data: Remove duplicates and fill missing values:
# Drop duplicate rows from the DataFrame 'sales_df' in place. # This means any duplicate rows will be removed directly from the 'sales_df' object, # and no copy of the DataFrame will be created. sales_df.drop_duplicates(inplace=True) # Fill any missing (NaN) values in the DataFrame 'sales_df' with 0. # The 'inplace=True' parameter means that the DataFrame will be modified directly, # without creating a copy. sales_df.fillna(0, inplace=True)
Analyze Data: Compute total sales per region:
# Use groupby() to group the data by the 'Region' column # Then, chain .sum() to calculate the total sales for each region total_sales_region = sales_df.groupby('Region')['Sales'].sum() # Print the result which shows the total sales per region print(total_sales_region)
Analyze Data: Find the top 5 products by sales:
# Group the sales data by 'Product' and sum the 'Sales' for each product. top_products_sales = sales_df.groupby('Product')['Sales'].sum() # Sort the summed sales in descending order to get the best selling products at the top. top_products_sales_sorted = top_products_sales.sort_values(ascending=False) # Select the top 5 products with the highest sales. top_5_products_sales_sorted = top_products_sales_sorted.head(5) # Print the result to see the names and sales of the top 5 products. print(top_5_products_sales_sorted)
Save Results: Save cleaned data and analysis results:
# Export the cleaned sales DataFrame to a CSV file # 'index=False' specifies not to write the row index to the file. sales_df.to_csv('cleaned_sales_data.csv', index=False) # Export the total sales per region DataFrame to a CSV file total_sales_region.to_csv('total_sales_region.csv') # Export the top 5 products by sales DataFrame to a CSV file top_5_products_sales_sorted.to_csv('top_products.csv')
Conclusion
Whether you are aiming to become a data scientist or just looking to utilize data more effectively in your current role, mastering Python's data capabilities will undeniably be a game-changer.
From reading and writing basic files with Python's open()
function to leveraging the powerful Pandas library for complex data manipulations, we've covered a range of techniques for working with data in Python. No matter your skill level, these methods form the foundation of effective data handling and analysis in Python.
Feel free to reply to this newsletter with any questions or topics you'd like us to cover in the future.
If you liked this newsletter, don't forget to subscribe to receive regular updates. Share with your friends and colleagues interested in Python and let's grow together in our community of programmers!
Remember, the key to mastery is practice and persistence. Happy coding! Until the next edition, keep programming! 👨💻
InfinitePy Newsletter - Your source for Python learning and inspiration.