InfinitePy Newsletter 🇺🇸
Posts
Analyzing Excel Sales Data with Python Pandas and Seaborn - Part III

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part III

Visualizing Seasonal Trends, Customer Revenue, and Top Products for Optimal Sales Strategies

Eduardo Miranda
July 04, 2024

_{🕒 Estimated reading time: 10 minutes}

Welcome back to the third part of our series on analyzing Excel sales data using Python's libraries — Pandas and Seaborn. If you missed our earlier articles, you can catch up by clicking here and here.

In today’s session, we'll cover:

Deep Analysis of Sales Pattern for 2024: To track the sales activity of the products and observe any seasonal variations.
Total and Average Revenue per Customer: To have an estimate of total and average revenues that a certain customer brings to the business thereby improving our understanding of customer value.
Histogram of Average Order Value (AvgOrderValue): This contributes to understand the distribution of average order value.
Distribution of Order Quantities: By performing the frequency distribution of order quantities, we will look at the general order sizes and identify any anomalies.
Top 10 Customers Based on Total Expenses: Select the key clients by their contribution in sales and review their buying patterns.
Orders from the Top 10 Customers: Analyze what our loyal consumers are buying.
Top 10 Best-Selling Products: Determine the productivity of our prominent products.

All the examples are also explained here 👨‍🔬, a corresponding Google Colab notebook to make your learning even more interactive.

Deep Analysis of Sales Pattern for 2024

Analyzing 2024 sales data over time is crucial for understanding trends, and seasonal patterns. By leveraging the robust visualization tools offered by Seaborn, you can gain insights into not just the peaks and troughs of your sales performance, but also the factors driving them.

# Convert OrderDate to datetime
# This converts the 'OrderDate' column of the DataFrame to a datetime format.
# This is important for time series analysis, as it allows for proper date-based indexing and plotting.
merged_df['OrderDate'] = pd.to_datetime(merged_df['OrderDate'])

# Filter data for the year 2024 only
merged_df_2024 = merged_df[merged_df['OrderDate'].dt.year == 2024]

# Set OrderDate as the index
# By setting 'OrderDate' as the index of the DataFrame, we can perform easier time-based operations and filtering.
time_series = merged_df_2024.set_index('OrderDate')

# Set the figure size
plt.figure(figsize=(14, 8))  # You can adjust the width and height as per your requirement

# Choose a color palette for the plot
# 'flare' is a sequential color palette. `as_cmap=True` is more relevant in the context of heatmaps,
# but here it's just setting the palette for the plot.
sns.color_palette("flare", as_cmap=True)

# Plot sales data over time
# This uses Seaborn to create a line plot, with 'OrderDate' on the x-axis and 'Price' on the y-axis.
# 'data=time_series' specifies our data source and columns for plotting.
sns.lineplot(data=time_series, x='OrderDate', y='Price')

# Set the title for the plot
plt.title('2024 Sales Over Time')

# Rotate x labels for better readability
plt.xticks(rotation=60)

# Display the plot
plt.show()

Deep Analysis of Sales Pattern for 2024.

Calculate total and average revenue per customer

When analyzing various types of revenue, two key metrics stand out: total revenues and Average Revenue Per User (ARPU).

Average Revenue Per User (ARPU), calculates the revenue generated from each customer over a specific period. This metric is crucial for assessing customer profitability, evaluating the success of pricing strategies, and understanding market segmentation effectiveness.

# Step 1: Calculate total revenue per customer
# Group the data by 'CustomerID' and sum up the 'Price' for each customer
# Reset the index to turn the grouped object back into a DataFrame
customer_revenue_df = merged_df.groupby('CustomerID')['Price'].sum().reset_index()

# Step 2: Calculate average order value per customer
# Group the data by 'CustomerID' and calculate the mean of 'Price' for each customer
# Reset the index to turn the grouped object back into a DataFrame
customer_avg_order_df = merged_df.groupby('CustomerID')['Price'].mean().reset_index()

# Step 3: Merge the total revenue and average order value dataframes
# Merge the two DataFrames on 'CustomerID' to get a comprehensive overview
customer_lifetime_value_df = customer_revenue_df.merge(customer_avg_order_df, on='CustomerID')

# Rename the columns for better readability
# 'TotalRevenue' represents the total revenue per customer
# 'AvgOrderValue' represents the average order value per customer
customer_lifetime_value_df.columns = ['CustomerID', 'TotalRevenue', 'AvgOrderValue']

# Display the final DataFrame
customer_lifetime_value_df

Total and average revenue per customer.

Explanation

Grouping and Aggregation:
1. The code utilizes the groupby() function to group the data by the CustomerID column.
2. Two metrics are calculated for each customer:
  1. Total Revenue (sum()): The total amount spent by each customer.
  2. Average Order Value (mean()): The average amount spent per order by each customer.
Resetting Index:
1. The reset_index() function is used to convert the resulting grouped object back into a regular DataFrame which includes the CustomerID as a normal column.
Merging Data:
1. The merge() function combines the two DataFrames, customer_revenue_df and customer_avg_order_df, on the CustomerID. This creates a new DataFrame with both the total revenue and the average order value for each customer.
Renaming Columns:
1. Column names are updated to be more descriptive, making it clear what each column represents.
Final Output:
1. The resulting customer_lifetime_value_df DataFrame gives a comprehensive view of each customer's total revenue and the average order value, which can be useful for further analysis or reporting.

Plotting the histogram of the Average Order Value (AvgOrderValue)

Monitoring the Average Order Value (AvgOrderValue) is an important KPI in the e-commerce and retail business. It indicates generally the update of spending by consumers per order in a specific organization. This metric is very significant in analyzing customer buying behavior so that appropriate changes to a firm’s marketing and selling strategies can be made.

In creating a histogram plot of the AvgOrderValue column, what is involved is an arrangement of a graph that shows the count of the average order values. This visualization helps in:

Understanding Distribution: This technique of visualizing is useful in helping the business understand the extent to which customers differ in their spending patterns as indicated by AvgOrderValue.
Identifying Trends: Histograms are even more effective for this as it will help to see most frequent AvgOrderValue range or see that many low or high value orders are made and etc – all this can influence marketing and sales decision.
Strategic Insights: The visual distribution can enable managerial decisions regarding the promotional discounts, upselling and cross-sell strategies, pricing strategies, and customised marketing approach for improving the customer value.

# The kind='hist' specifies that we want a histogram, and bins=20 specifies the number of bins in the histogram.
# The title argument provides a title to the plot.
customer_lifetime_value_df['AvgOrderValue'].plot(kind='hist', bins=20, title='AvgOrderValue')

# plt.gca() gets the current Axes instance, used for modifying the plot's visual components.
# spines represents the axis spines, which are the lines connecting the axis tick marks and noting the boundaries of the data area.
# Setting the 'top' and 'right' spines' visibility to False ('top', 'right',) -> Tuple with 'top' and 'right' elements.
plt.gca().spines[['top', 'right']].set_visible(False)  # Makes the top and right spines invisible to give the plot a cleaner look

Histogram of the Average Order Value (AvgOrderValue).

Analyze the Distribution of Order Quantities

When comparing order quantities a method of visualization can be useful that is the box plot. In general, a boxplot or a whisker plot is one of the standardized forms of displaying the data distribution through which it is possible to comprehend the spread and skewness of a given dataset easily.

A boxplot is constructed using a five-number summary:

Minimum: The least value in the set or observed data that is not considered an outlier.
First Quartile (Q1): This is the value that separates the dataset into two equal halves; while this is the midpoint of the lower half. This one identifies the 25th percentile or in other words, 25% of the scores lie below this mark.
Median: The value located in the middle of the dataset when the data is arranged in increasing or decreasing order. But if the given dataset is ordered then the median is the mid-value which splits the dataset in to two equal halves. It stands for the 50th fringe.
Third Quartile Q3): This is the average of all the values in the distribution, to which half of the data constitutes the greater half of the overall set of data. It is the value which describes that 75% of the given data set at least as big as this number as in this position the frequency density starts rising.
Maximum: Largest value of the observation in a large set of numbers; it excludes any outlying observation if there is any.

# Create a new figure with a specific size (width=10, height=6)
plt.figure(figsize=(10, 6))

# Generate a boxplot for the 'Quantity' column in the orders_df DataFrame
# The boxplot will help visualize the distribution, central tendency, and spread of the 'Quantity' data
sns.boxplot(y='Quantity', data=orders_df)

# Set the title of the plot to 'Distribution of Order Quantities'
plt.title('Distribution of Order Quantities')

# Set the x-axis label to 'Order Quantity'
plt.xlabel('Order Quantity')

# Set the y-axis label to 'Quantity'
plt.ylabel('Quantity')

# Display the plot
plt.show()

Distribution of Order Quantities.

Summarizing Sales Data

# Group the merged DataFrame by 'CustomerID' and calculate the sum of the 'Total' column for each customer
# The result is a new DataFrame with 'CustomerID' and their respective total revenue
customer_revenue = merged_df.groupby('CustomerID')['Total'].sum().reset_index()

# Merge the customer revenue summary DataFrame with the original customers DataFrame
# This brings additional customer details (e.g., name, contact info) into the resulting DataFrame
# The merge is performed on the 'CustomerID' column; 'how='inner'' ensures only customers present in both DataFrames are included
customer_revenue_df = pd.merge(customer_revenue, customers_df, on='CustomerID', how='inner')

# Sort the combined DataFrame by the 'Total' column in descending order
# This means the customers with the highest revenue will be at the top of the DataFrame
customer_revenue_df.sort_values(by=['Total'], ascending=False)

Combined DataFrame by the 'Total' column in descending order.

Top 10 customers based on their total expenses

# Group by 'CustomerID' and sum the 'Total' expenses for each customer
total_expenses_df = merged_df.groupby('CustomerID')['Total'].sum().reset_index() 

# Sort the resulting DataFrame based on the 'Total' expenses in descending order
total_expenses_df = total_expenses_df.sort_values(by=['Total'], ascending=False)

# Select the top 10 customers based on their total expenses
top10_customers_by_total_expenses = total_expenses_df[0:10]['CustomerID']

# Filter the original customers DataFrame to include only the top 10 customers by total expenses
top10_customers_by_total_expenses_df = customers_df[customers_df['CustomerID'].isin(top10_customers_by_total_expenses)]

# Display the DataFrame containing information of the top 10 customers by total expenses
top10_customers_by_total_expenses_df

Top 10 customers based on their total expenses.

Orders from the top 10 customers based on their total expenses

# Filter the DataFrame `orders_df` to include only the rows where the 'CustomerID' 
# is present in the list `top10_customers_by_total_expenses`.
# The .isin() method checks if each 'CustomerID' in `orders_df` exists 
# within the `top10_customers_by_total_expenses` list and returns a boolean Series.
orders_from_top10_customers_by_total_expenses_df = orders_df[orders_df['CustomerID'].isin(top10_customers_by_total_expenses)]

# Display the resulting DataFrame which contains orders from the top 10 customers 
# based on their total expenses.
orders_from_top10_customers_by_total_expenses_df

Orders from the top 10 customers based on their total expenses.

Top 10 products with more sales

By identifying which products have higher sales, you can prioritize your efforts more effectively. Whether it's marketing, inventory management, or product development, focusing on top-selling items ensures that your time and resources are spent on what's most impactful.

# Group the merged DataFrame by the 'Product' column and aggregate data.
# Create a new column 'count_col' which counts the number of occurrences of each product.
product_revenue_df = merged_df.groupby('Product').agg(
    total=pd.NamedAgg(column="Product", aggfunc="count")
).sort_values(by=['total'], ascending=False).reset_index()

# Select the top 10 products with the most sales based on the count_col.
top10_products_with_more_sales = product_revenue_df[0:10]

# Output the result: a DataFrame containing the top 10 products by sales count.
top10_products_with_more_sales

Top 10 products with more sales.

Plot of products with More Sale

# Set the size of the plot figure (width, height) in inches
plt.figure(figsize=(10, 6))

# Create the bar plot using Seaborn
# data: the DataFrame containing the data to plot
# x: the column name in the DataFrame to use for the x-axis
# y: the column name in the DataFrame to use for the y-axis
# hue: the column name to group data by different colors for each unique value
# palette: the color palette to use for the bars
sns.barplot(
    data=top10_products_with_more_sales, 
    x='Product', 
    y='total', 
    hue='Product', 
    palette='viridis'
)

# Set the x-axis label to 'Product'
plt.xlabel('Product')

# Set the y-axis label to 'Sales'
plt.ylabel('Sales')

# Set the title of the plot
plt.title('Products with More Sales')

# Rotate the x-axis labels by 45 degrees for better readability
plt.xticks(rotation=45)

# Display the plot on the screen
plt.show()

Conclusion

Here we highlighted the significance of utilizing histograms and boxplots to visualize data distributions. Additionally, this article delved into identifying the top customers and products.

By harnessing these insights, you should be better equipped to tackle tasks related to sales performance. Leveraging the collected data for analysis will make it easier to identify trends, gauge customer value, and evaluate product performance. This groundwork is essential for refining marketing strategies, boosting sales tactics, and enhancing organizational productivity.

Effectively using these tools and techniques not only enables you to analyze historical data but also primes you and your business for future analysis, ensuring continued relevance. Continuously analyze and process your datasets, and feel free to explore and implement various functions from Python's data analysis libraries.

Feel free to reply to this newsletter with any questions or topics you'd like us to cover in the future.

If you liked this newsletter, don't forget to subscribe to receive regular updates. Share with your friends and colleagues interested in Python and let's grow together in our community of programmers!

Remember, the key to mastery is practice and persistence. Happy coding! Until the next edition, keep programming! 👨‍💻

InfinitePy Newsletter - Your source for Python learning and inspiration.