InfinitePy Newsletter 🇺🇸
Posts
Getting started with PySpark on Google Colab

Getting started with PySpark on Google Colab

Welcome to our journey into the world of PySpark! PySpark is the Python API for Apache Spark, the open source framework designed for distributed data processing.

Eduardo Miranda
August 30, 2024

_{🕒 Estimated reading time: 8 minutes}

In this text, we'll walk through the core architecture of a Spark cluster, delve into the anatomy of a Spark application, and explore Spark's powerful structured APIs using DataFrames.

To facilitate learning, we will configure the Spark environment in Google Colab, providing a practical and efficient platform to perform our experiments and analysis. Let's discover together how PySpark can transform the way we deal with large volumes of data.

All examples are also explained here👨‍🔬, a corresponding Google Colab notebook to make your learning even more interactive.

Understanding Spark: Terminology and basic concepts

Normally, when you think of a car 🚗, you imagine a single vehicle sitting in your garage or parked at the office. This car 🚗 is perfectly suited for daily errands or commuting to work. However, there are some tasks that your car simply can't handle due to its limited power and capacity. For example, if you want to transport an entire rock band's gear 🎸 across the country, a single car won't be enough - it doesn't have enough space and the trip would be too much work.

In scenarios like these, a fleet of trucks 🚛🚛🚛 comes in handy. A fleet brings together the storage capabilities of many vehicles, allowing us to transport all items as if they were in a giant truck. But just having a fleet doesn't solve the problem; you need a well-coordinated system to manage logistics. Think of Spark as that sophisticated logistics tool, managing and orchestrating equipment transportation tasks across your entire fleet.

Apache Spark is a powerful distributed computing system designed for speed and ease of use. Unlike traditional batch processing systems, Spark offers in-memory processing capabilities, making it significantly faster for data analysis and machine learning tasks.

Main components of Spark

Think of Spark apps like a busy kitchen in a restaurant 🍽️, where the head chef 👨‍🍳 oversees the entire cooking process while multiple sous chefs 👨🏼‍🍳 perform specific tasks. In this analogy, the chef 👨‍🍳 is the driving process (driver), and the sous chefs 👨🏼‍🍳 are the executing processes (executors).

The chef 👨‍🍳 of the kitchen (leading process) is in the kitchen and has three main responsibilities:

maintain control over general kitchen operations,
respond to customer requests and
plan, distribute and schedule tasks for deputy bosses 👨🏼‍🍳.

Without the kitchen boss 👨‍🍳, the kitchen would fall into chaos — just as the driving process is the brain and central command of a Spark application, crucial for maintaining order and assigning tasks throughout the lifecycle of a Spark application.

In this well-organized kitchen, the sous chefs 👨🏼‍🍳 (executors) have two main functions:

they carefully execute the recipes given by the chef and
keeps the kitchen chef 👨‍🍳 informed about the status of their cooking tasks.

The last vital component of this culinary operation is the restaurant manager 🧑🏻‍💼 (cluster manager). The restaurant manager oversees the entire restaurant (physical machines) and allocates kitchen space and resources to different chefs (Spark applications).

As a brief review, the key points to remember are:

Spark has a cluster manager (the restaurant manager 🧑🏻‍💼) that keeps track of available resources.
The driver process (kitchen chef 👨‍🍳) executes instructions from our main program on executors (sous chefs 👨🏼‍🍳) to complete tasks.

While the executors predominantly run Spark code, the driver can operate in multiple languages through Spark's language APIs, just as a kitchen chef can communicate recipes in different cooking styles.

🛠️ Environment setup

Here we will conduct the download process required for proper installation and configuration of Apache Spark on Google Colab. This step is essential to ensure that all dependencies are acquired correctly, providing a functional and optimized environment for executing tasks and analyzes using Apache Spark.

Make sure you follow each step carefully, ensuring a smooth and efficient setup of Spark in the Colab environment.

# The variable spark_version contains the value of the Spark version to be used.
# To find out available versions visit https://dlcdn.apache.org/spark
spark_version = "3.5.2"

# Open JDK installation
# Apache Spark is written in Scala, which runs on the Java Virtual Machine (JVM).
# OpenJDK provides an open source implementation of the JVM, ensuring compatibility and support for running Spark.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download spark-{spark_version}-bin-hadoop3, a specific distribution of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-{spark_version}/spark-{spark_version}-bin-hadoop3.tgz

# Unpack the contents of the spark-VERSION-bin-hadoop3.tgz file into the file system
!tar xf spark-{spark_version}-bin-hadoop3.tgz

# Configuration of environment variables
# The JAVA_HOME environment variable points to the Java installation directory on the machine and is essential for Spark
# The SPARK_HOME environment variable points to the Apache Spark installation directory. It is used by Spark to localize its own components and libraries.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{spark_version}-bin-hadoop3"

# The findspark library is used to make it easier to configure and launch Apache Spark in local environments, such as on development machines.
!pip install -q findspark

# findspark.init() makes it easy to configure and launch Apache Spark in local development environments
import findspark
findspark.init()

🛤️ Step-by-step description of installation and configuration

Initially, we define the version of Spark to be installed in the environment using the variable spark_version.

spark_version = "3.5.2"

Spark is built on Scala and requires the Java Virtual Machine (JVM) to run. Initially, we download Java with the command:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we download a compressed file that contains a pre-compiled distribution of Apache Spark, using:

!wget -q https://dlcdn.apache.org/spark/spark-{spark_version}/spark-{spark_version}-bin-hadoop3.tgz

To install and start using this version of Spark, we need to unzip the file using the command below. It will extract the necessary files to a directory, from which we can then start configuring and running Spark.

!tar xf spark-{spark_version}-bin-hadoop3.tgz

The configuration of the JAVA_HOME and SPARK_HOME environment variables is carried out with the code:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{spark_version}-bin-hadoop3"

The JAVA_HOME environment variable points to the Java installation directory on the machine. It is essential for Spark to know where Java is installed so that it can use the JVM for execution. Without this configuration, Spark may not be able to find Java correctly.

SPARK_HOME points to the Apache Spark installation directory. It is used by Spark to localize its own components and libraries.

Now, with all dependencies installed in Colab, we define the environment path to allow PySpark to run. We use the findspark library to simplify the configuration of Apache Spark in local environments, installing it with

!pip install -q findspark

and starting with:

import findspark
findspark.init()

This last command adds the Spark directory to Python's sys.path, allowing you to import and use Spark as a Python library in local development environments without the need to set environment variables separately.

👨‍🔬 Testing the installation

Now, we can test our installation through a simple example of manipulating a DataFrame with PySpark. We are considering that you have already installed PySpark as shown previously.

from pyspark.sql import SparkSession

# Initializing a Spark session
# SparkSession is the main input to Spark SQL functionality.
# We configure the Spark application name and create or get an existing session.
spark = SparkSession.builder \
    .appName("PySpark DataFrame Example") \
    .getOrCreate()

# Example data
# Here we create a list of tuples with the example data.
data = [
    ("John", 28, "São Paulo"),
    ("Mary", 22, "Rio de Janeiro"),
    ("Peter", 35, "Belo Horizonte"),
    ("Ana", 23, "Curitiba")
]

# Defining the schema (column names)
# We define a list of column names for the DataFrame.
columns = ["Name", "Age", "City"]

# Creating the DataFrame
# We use SparkSession's createDataFrame function to transform the data and columns into a DataFrame.
df = spark.createDataFrame(data, columns)

# Displaying the DataFrame
# We use the show() method to show the first rows of the DataFrame.
print("Initial DataFrame")
df.show()

# Filtering Data
# We display records where the age is equal to or greater than 25.
# The filter() method is used with a Boolean expression to filter the data.
print("Records aged 25 or over")
df_filtered = df.filter(df["Age"] >= 25)
df_filtered.show()

# Selecting Columns
# We select only the "Name" and "City" columns from the DataFrame.
# The select() method is used for this operation.
print("Selection of Name and City columns only")
df_selected = df.select("Name", "City")
df_selected.show()

# Grouping Data
# We group the data by the "City" column and count the number of records in each group.
# The groupBy() method is used to group, followed by the count() method to count the records in each group.
print("Counting total records by city")
df_grouped = df.groupBy("City").count()
df_grouped.show()

# Saving the DataFrame
# We save the DataFrame to a CSV file at the specified location.
# The write.csv() method is used for this operation, with the header=True option to include the header.
df.write.csv("/content/output.csv", header=True)

Running the above code will produce the following output.

Initial DataFrame
+-----+---+--------------+
| Name|Age|          City|
+-----+---+--------------+
| John| 28|     São Paulo|
| Mary| 22|Rio de Janeiro|
|Peter| 35|Belo Horizonte|
|  Ana| 23|      Curitiba|
+-----+---+--------------+

Records aged 25 or over
+-----+---+--------------+
| Name|Age|          City|
+-----+---+--------------+
| John| 28|     São Paulo|
|Peter| 35|Belo Horizonte|
+-----+---+--------------+

Selection of Name and City columns only
+-----+--------------+
| Name|          City|
+-----+--------------+
| John|     São Paulo|
| Mary|Rio de Janeiro|
|Peter|Belo Horizonte|
|  Ana|      Curitiba|
+-----+--------------+

Counting total records by city
+--------------+-----+
|          City|count|
+--------------+-----+
|     São Paulo|    1|
|Rio de Janeiro|    1|
|      Curitiba|    1|
|Belo Horizonte|    1|
+--------------+-----+

Conclusion

Congratulations! You've taken your first steps into the world of PySpark. Throughout this text, we explore the Apache Spark architecture, configure the development environment in Colab, and perform essential operations with DataFrames using PySpark.

We hope you have gained a solid understanding of how Spark works and how to use PySpark for data manipulation and analysis.

This knowledge is just the beginning. In upcoming articles we will explore how PySpark offers a wide range of functionality to process large data sets quickly and efficiently.