Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

In the modern data-driven world, analyzing sales data can offer tremendous insights into business performance, customer behaviour, and potential growth opportunities.

๐Ÿ•’ Estimated reading time: 8 minutes

Efficient data manipulation is a cornerstone of any data analysis process, and mastering the ability to select, filter, and aggregate data is essential for deriving meaningful insights.

In this multi-part series, we will delve into some foundational techniques for working with DataFrames in pandas, focusing on selecting and filtering data, merging DataFrames, and calculating aggregated metrics.

You'll learn how to filter rows based on specific conditions, merge multiple DataFrames to combine disparate data, create new calculated columns, and even visualize the results using a horizontal bar chart. Whether you're a seasoned data professional or just starting out, these techniques will help enhance your data manipulation skills and make your analyses more robust.

Now, let's dive into the world of analyzing Excel data with some practical examples that are also available on Google Colab here ๐Ÿ‘จโ€๐Ÿ”ฌ.

To delve into the first article on Analyzing Excel Sales Data with Python Pandas and Seaborn, please click here.

Selecting and Filtering Data

Selecting and filtering data is a fundamental part of working with datasets. Let's filter the DataFrame rows where the Product column contains the word Apple.

# Filter the orders DataFrame to only include orders for products that contain "Apple" in their name.
# The str.contains() method is used to check if the 'Product' column contains the string "Apple".
product_orders_df = orders_df[orders_df['Product'].str.contains("Apple")]

# Display the first few rows of the filtered DataFrame to verify the results.
# head() method returns the first 5 rows by default.
product_orders_df.head()
Orders for products that contain "Apple" in their name.

Orders for products that contain "Apple" in their name.

Here's a breakdown of the steps:

  • orders_df[orders_df['Product'].str.contains("Apple")]: This line filters orders_df based on the boolean Series created by str.contains().

  • Only the rows where the Product column contains Apple are kept in the new DataFrame product_orders_df.

Now, let's filter the customers_df DataFrame to keep only the rows where the Address column includes two-letter state or territory abbreviations followed by a five-digit postal code, similar to a USA ZIP code.

# Define a regex pattern that matches:
# - A word boundary (\b)
# - Two uppercase letters ([A-Z]{2})
# - A space
# - Five digits (\d{5})
# - Another word boundary (\b)
pattern = r'\b[A-Z]{2} \d{5}\b'

# Apply a filter on the 'Address' column of the DataFrame:
# - `str.contains(pattern, regex=True, na=False)` uses the regex pattern to check each element in the 'Address' column
# - `na=False` means we exclude any rows where the 'Address' is NaN (missing) from the result
filtered_customers_df = customers_df[customers_df['Address'].str.contains(pattern, regex=True, na=False)]

# Display the first 5 rows of the filtered DataFrame to get a glimpse of the data
filtered_customers_df.head()
Rows where the Address column includes postal code, similar to a USA ZIP code.

Rows where the Address column includes postal code, similar to a USA ZIP code.

Here's a breakdown of what each part of the regex pattern does:

  • \b: Matches a word boundary, ensuring that the pattern is at the start of a word.

  • [A-Z]{2}: Matches exactly two uppercase letters.

  • \d{5}: Matches exactly five digits.

  • \b: Matches a word boundary, ensuring that the pattern is at the end of a word.

Merging DataFrames

Merging is similar to SQL joins and it allows you to combine multiple DataFrames into a single one, facilitating the integration and comparison of datasets.

Syntax

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

Parameters:

  • left, right: DataFrames to be merged.

  • how: Type of merge to be performed (left, right, outer, inner). Default is inner.

  • on: Column or index level names to join on. Must be found in both DataFrames.

  • left_on, right_on: Columns or index levels from the left and right DataFrames to use as keys.

  • left_index, right_index: Use the index from the left or right DataFrame as the join key(s).

  • sort: Sort the result DataFrame by the join keys.

  • suffixes: Suffix to apply to overlapping column names.

  • indicator: Adds a column to the output DataFrame called _merge with information on the source of each row.

# Merge the two DataFrames on the 'CustomerID' column
# 'inner' join means only rows with matching CustomerID in both DataFrames will be included.
# The result is a new DataFrame that combines columns from both original DataFrames.
merged_df = pd.merge(orders_df, customers_df, on='CustomerID', how='inner')

# Display the first 5 rows of the merged DataFrame
# This is useful for quickly inspecting the result of the merge operation.
merged_df.head()
Merge the two DataFrames on the 'CustomerID' column.

Merge the two DataFrames on the 'CustomerID' column.

Creating a new column Total in the merged_df DataFrame that will store the result of multiplying Quantity by Price for each row in the DataFrame

# Assume 'merged_df' is a DataFrame that has already been created.

# 'Quantity' likely represents the number of units of a product or items.
# 'Price' is the cost of one unit of the product or item.

# Create a new column 'Total' in the 'merged_df' DataFrame.
# This new column will store the result of multiplying 'Quantity' by 'Price' for each row in the DataFrame.
merged_df['Total'] = merged_df['Quantity'] * merged_df['Price']

# Display the first few rows of the 'merged_df' DataFrame to verify the new column.
merged_df.head()
Column 'Total' in the 'merged_df' DataFrame.

Column 'Total' in the 'merged_df' DataFrame.

This operation is vectorized, meaning that it will be performed for each row in the DataFrame without the need for explicit loops. Vectorized operations are efficient and fast, taking advantage of low-level optimizations in the underlying libraries (like NumPy) that pandas rely on.

Calculating the total revenue for each product

  • Step 1: Group the data by the Product column.

    • This step will create a DataFrame with unique products as rows.

    • Each row will represent a unique product.

  • Step 2: Aggregate the data by summing up the Price for each product.

    • The sum() function calculates the total revenue for each product.

    • The result is a DataFrame where each product's total revenue is calculated.

  • Step 3: Reset the index (.reset_index()) to turn the grouped object back into a normal DataFrame.

    • This step makes the Product column a regular column again instead of the index.

  • Step 4: Renames the Price column to Revenue to clearly indicate that it represents total revenue for each product.

# Assuming merged_df is a DataFrame that has already been created and contains the relevant data.

product_revenue_df = merged_df.groupby('Product')['Total'].sum().reset_index().rename(columns= {'Total':'Revenue'})

# Display the result DataFrame
product_revenue_df
Total revenue for each product.

Total revenue for each product.

Selecting the Top 10 Products based on its Revenue

# Sort the 'product_revenue_df' DataFrame by the 'Revenue' column in descending order.
# The 'Revenue' column likely contains revenue figures to determine the top-selling products.
sorted_products = product_revenue_df.sort_values(by='Revenue', ascending=False)

# Select the top 10 rows from the sorted DataFrame to identify the top 10 products by revenue.
top_10_products = sorted_products.head(10)

# Display the result DataFrame
top_10_products
Top 10 Products based on its Revenue.

Top 10 Products based on its Revenue.

Horizontal bar chart displaying the top 10 Products by Revenue

# Create a figure and a set of subplots with a specific size
plt.figure(figsize=(12, 8))  # Size of the figure is 12x8 inches

# Create a horizontal bar plot using seaborn's barplot function.
# y='Product': Specifies that the y-axis should represent the 'Product' column from the DataFrame.
# x='Revenue': Specifies that the x-axis should represent the 'Revenue' column from the DataFrame.
# data=top_10_products: Uses the 'top_10_products' DataFrame as the data source.
# hue='Product': Different colors for different products.
# dodge=False: Ensures bars are not separated if 'hue' is specified (but here it's redundant).
# palette='viridis': This sets the colormap to 'viridis' for the bars.
bar_plot = sns.barplot(
    y='Product',
    x='Revenue',
    data=top_10_products,
    hue='Product',
    dodge=False,
    palette='viridis'
)

# Loop through the top 10 products to add labels on each bar.
# This loop uses the index (i) and the DataFrame (top_10_products)
for i in range(len(top_10_products)):
    # Extract the 'Revenue' value for the current product
    total_revenue = top_10_products.iloc[i]['Revenue']

    # Add text labels to the bar plot.
    # total_revenue: x-coordinate for the text label.
    # i: y-coordinate for the text label.
    # f'${total_revenue:,.2f}': Format the 'Revenue' as currency with commas and 2 decimal places.
    # color='black': Color of the text label.
    # ha="center": Horizontal alignment of the text centered on the bar.
    bar_plot.text(
        total_revenue,
        i,
        f'${total_revenue:,.2f}',
        color='white',
        ha="right"
    )

# Set the title of the plot
plt.title('Top 10 Products by Total Revenue')

# Set the label for the x-axis
plt.xlabel('Total Revenue')

# Set the label for the y-axis
plt.ylabel('Product')

# Display the plot to the screen
plt.show()
Top 10 Products by Total Revenue.

Top 10 Products by Total Revenue.

Conclusion

Mastering the techniques of selecting, filtering, merging, and aggregating data in pandas provides a powerful toolkit for data analysis. You can now confidently filter data to focus on specific subsets, merge DataFrames to combine related datasets, and calculate meaningful metrics like total revenue for each product.

Additionally, visualizing the results with a clear and insightful plot can make your analysis even more compelling. These skills not only make your data manipulation more efficient but also add depth to your analytical capabilities, empowering you to uncover and present data-driven insights more effectively.

Feel free to reply to this newsletter with any questions or topics you'd like us to cover in the future.

If you liked this newsletter, don't forget to subscribe to receive regular updates. Share with your friends and colleagues interested in Python and let's grow together in our community of programmers!

Remember, the key to mastery is practice and persistence. Happy coding! Until the next edition, keep programming! ๐Ÿ‘จโ€๐Ÿ’ป

InfinitePy Newsletter - Your source for Python learning and inspiration.