Integrating Python Pandas with ChatGPT: A new frontier

Utilizing high-caliber Python libraries like Pandas and integrating them with powerful tools such as ChatGPT can substantially enhance productivity and streamline the process of extracting valuable insights from organizational data assets.

🕒 Estimated reading time: 7 minutes

Over the years, I have been teaching Python and specifically focusing on Pandas. Recently, I discovered an exciting new project called PandasAI. This innovative library allows you to write queries in plain English, which are then automatically translated into Pandas commands in Python. I have experimented with PandasAI and uncovered some interesting results.

PandasAI is great for data scientists, analysts, and engineers who’d like to engage with data less artificially. This is especially helpful if you are beginners and do not understand how SQL or Python works or if you are able to spend weeks doing data manipulation. The natural language querying efficiency is also added for experienced professionals can also find it useful.

All the examples are also explained here👨‍🔬, a corresponding Google Colab notebook to make your learning even more interactive.

What is PandasAI?

PandasAI is open source Python software for labeled data, which allows to query your data with natural language. Not only does it translate your questions into python pandas code but also it has several others features that make the data analysis more efficient and also more complete.

Key Features of PandasAI

  • Natural Language Querying: Enables you to make questions in simple words without high-level terminologies.

  • Data Visualization: Simple and effective tools that are useful in the generation of graphs and charts for the clarification of the data you are dealing with.

  • Data Cleansing: Is useful especially when dealing with datasets with missing values.

  • Feature Generation: Improves the quality of the data you enter by creating new variables.

  • Data Connectors: There is a connection to different data such as CSV, XLSX, PostgreSQL, MySQL, BigQuery, Databricks, Snowflake, etc.

How it Works

PandasAI uses a generative AI type that discerns the meaning of a user’s question and translates it into Python code and SQL queries. Your actual data then engages with this code and gives you the results that you want.

Project Work & Practice

The datasets and query examples presented here were originally featured in the article, "Working with Data in Python: From Basics to Advanced Techniques." If you are not yet comfortable with Python Pandas, I highly recommend reviewing that article.

# Running the following line will install the 'pandasai' library
!pip install pandasai
# Import the pandas library, commonly used for data manipulation and analysis
import pandas as pd

# Import the pyplot module from the matplotlib library, used for plotting and visualization
import matplotlib.pyplot as plt

# Import the Agent class from the pandasai module, which can interface with large language models (LLMs)
from pandasai import Agent

# Import the SmartDataframe class from the pandasai module, which adds enhanced data manipulation features
from pandasai import SmartDataframe

# Import the OpenAI class within the pandasai.llm.openai submodule, which allows interfacing with OpenAI's language models
from pandasai.llm.openai import OpenAI

Once you’ve completed the installation, we can begin our PandasAI experiment.

Read Data

# Read data from a CSV file available at the given URL and store it in a DataFrame called sales_df
# The read_csv function fetches the data and structures it into a format suitable for analysis
sales_df = pd.read_csv('https://infinitepy.s3.amazonaws.com/samples/sales_data.csv')

Inspect Data

# 'head()' is a built-in method in pandas DataFrame that returns the first n rows.
# By default, it returns the first 5 rows. You can specify a different number as an argument like 'head(10)' for the first 10 rows.
sales_df.head()

Inspect Data.

🤖 Creating a SmartDataframe

SmartDataframe is a class from the pandasai.llm.openai library, which is part of the Pandas AI ecosystem. This particular class leverages the capabilities of Large Language Models (LLMs), like OpenAI's GPT, to perform advanced data frame operations that typically require a deep understanding of the data context or might be too complex to implement using traditional methods.This allows you to perform complex operations and queries on your data frames using natural language commands.

In case you don't have an OpenAI API key yet, in this article, we will guide you through the process of generating your OpenAI API key for ChatGPT.

# Importing the 'userdata' module from google.colab, which allows you to access user-specific data.
from google.colab import userdata

# Fetching the OpenAI API key from the user's data stored within Google Colab.
api_token = userdata.get('OPENAI_API_KEY')

# Creating an instance of the OpenAI class by passing the API token.
# This instance will be used to interact with OpenAI's language model (LLM).
llm = OpenAI(api_token=api_token)

# Wrapping an existing DataFrame (sales_df) into a SmartDataframe object.
# SmartDataframe likely adds additional capabilities or configurations to the standard DataFrame.
# We pass the 'llm' instance as part of the configuration, possibly to enable natural language capabilities or other features powered by the OpenAI model.
sales_smart_df = SmartDataframe(sales_df, config={"llm": llm})

Now that we have everything set up, let's begin our comparison between the traditional way of working with Python Pandas 🐼 and the innovative approach of Natural Language Querying with PandasAI 🤖.

1️⃣ Remove duplicates and fill missing values

🐼 Traditional method
# Drop duplicate rows from the DataFrame 'sales_df' in place.
# This means any duplicate rows will be removed directly from the 'sales_df' object,
# and no copy of the DataFrame will be created.
sales_df.drop_duplicates(inplace=True)

# Fill any missing (NaN) values in the DataFrame 'sales_df' with 0.
# The 'inplace=True' parameter means that the DataFrame will be modified directly,
# without creating a copy.
sales_df.fillna(0, inplace=True)
sales_df
🤖 Natural Language Querying with PandasAI
sales_smart_df.chat("Remove duplicates and fill missing values")

2️⃣ Compute total sales per region

🐼 Traditional method
# Use groupby() to group the data by the 'Region' column
# Then, chain .sum() to calculate the total sales for each region
total_sales_region = sales_df.groupby('Region')['Sales'].sum()

# Print the result which shows the total sales per region
total_sales_region
🤖 Natural Language Querying with PandasAI
sales_smart_df.chat("Compute total sales per region")
Result

Total sales per region.

3️⃣ Find the top 5 products by sales

🐼 Traditional method
# Group the sales data by 'Product' and sum the 'Sales' for each product.
top_products_sales = sales_df.groupby('Product')['Sales'].sum()

# Sort the summed sales in descending order to get the best selling products at the top.
top_products_sales_sorted = top_products_sales.sort_values(ascending=False)

# Select the top 5 products with the highest sales.
top_5_products_sales_sorted = top_products_sales_sorted.head(5)

# Print the result to see the names and sales of the top 5 products.
top_5_products_sales_sorted
🤖 Natural Language Querying with PandasAI
sales_smart_df.chat("Find the top 5 products by sales")
Result

Top 5 products by sales.

4️⃣ Calculating the sales trend on a monthly basis for each region.

🐼 Traditional method
# Drop duplicates and forward fill missing values
sales_df.drop_duplicates(inplace=True)
sales_df.fillna(method='ffill', inplace=True)

# Define the correct chronological order for months
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Convert the 'Month' column to a categorical type with the specified order
sales_df['Month'] = pd.Categorical(sales_df['Month'], categories=month_order, ordered=True)

# Pivot table to calculate total sales per month for each region
monthly_sales = sales_df.pivot_table(index='Month', columns='Region', values='Sales', aggfunc='sum')

# Filter the DataFrame to include only data from January to June
filtered_monthly_sales_df = monthly_sales.loc['January':'June']

# Plotting the monthly sales trend for each region
fig, ax = plt.subplots(figsize=(12, 8))
filtered_monthly_sales_df.plot(kind='line', marker='o', ax=ax)

# Adding grid lines
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Additional plot settings
plt.title('Monthly Sales Trend per Region (January to June)')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend(title='Region')
plt.xticks(rotation=45)
plt.show()
🤖 Natural Language Querying with PandasAI
sales_smart_df.chat("Calculates the sales trend on a monthly basis for each region.")
Result

Monthly Sales Trend per Region.

5️⃣ Sales Share by Product

🐼 Traditional method
# Compute the sales share for each product
sales_share = sales_df.groupby('Product')['Sales'].sum() / sales_df['Sales'].sum() * 100

# Plotting the sales share as a pie chart
sales_share.plot(kind='pie', autopct='%1.1f%%', figsize=(8, 8))
plt.title('Sales Share by Product')
plt.ylabel('')  # Hide the y-label
plt.show()
🤖 Natural Language Querying with PandasAI
sales_smart_df.chat("Calculates the Sales Share by Product and create a pie chart")
Result

Sales Share by Product.

Conclusion

Combining Pandas with ChatGPT allows anyone to build efficient data processing solutions based on the language interface that simplifies machine learning computations. We have gone all the way from basic data loading with Pandas to complex data analysis with ChatGPT, thus addressing the novices as well as the experts among the learners. This guide will help you to effectively manage and get the best out of both kinds of technologies.

Integrating these tools in practice shows great promise. FAC members and developers can gain deeper insights and access more context-rich information from the work of data analysts more quickly. Continue experimenting with various datasets and prompts to unlock the full potential of these combined tools.

🔔 Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.

InfinitePy Newsletter - Your source for Python learning and inspiration.