InfinitePy Newsletter 🇺🇸
Posts
Optimizing performance in PySpark with Parquet files - Part I

Optimizing performance in PySpark with Parquet files - Part I

Optimizing Big Data Processing: An In-Depth Look at Parquet Coding Methods.

Eduardo Miranda
October 01, 2024

_{🕒 Estimated reading time: 7 minutes}

Parquet is a columnar storage file format optimized for use with big data processing frameworks like Apache Spark. With its efficient storage capabilities, Parquet plays a crucial role in improving performance, especially when dealing with large data sets.

Some of the standout features of Parquet files is the use of advanced encoding techniques such as Bit Packing, Dictionary Encoding and Run-Length Encoding (RLE), as well as the use of compression algorithms that are essential for increasing performance and reducing space of storage.

In this and the next text, we will explore these concepts in depth. But before we delve into coding techniques, let's briefly understand row storage and then understand column storage.

Row-based storage

Row-based storage (also known as row-oriented storage) organizes data by rows. Each row stores a complete record, and each record includes all fields (or columns) of that record. This format is typical in traditional relational databases such as MySQL and PostgreSQL.

Rows format

In row-based storage, each record (or row) is stored together, so when you read or write a record, you access all columns in that row simultaneously.

Rows format

Advantages of Row-Based Storage

Efficient for transactional operations: Row-based storage is optimized for transaction-oriented operations where you frequently need to access and modify entire records, such as entering new customer details or updating an order.
Simple data access: access to a complete record is direct, making it easier to handle operations such as adding or updating lines.

Disadvantages of Row-Based Storage

Inefficient for analytical queries: when running analytical queries that require only specific columns, row-based storage can be inefficient because it reads entire rows, including unnecessary data.
Storage space: Row-based storage can use more space if not optimized correctly, as entire rows are stored together, including any unused or redundant data fields.

In short, this format is intuitive and works well when frequently accessing entire records. However, it can be inefficient when dealing with analytics, where you often only need specific columns from a large data set.

For example, imagine a table with 50 columns and millions of rows. If you are only interested in analyzing 3 of these columns, a per-row format would still require you to read all 50 columns for each row.

Example of line-based file formats: CSV, JSON, AVRO etc.

Column-Based Storage

Unlike row-based formats, columnar storage saves data column by column, which improves performance for analytical queries by allowing you to read only relevant data.

Column format

When data is stored using a column-based approach, there can be some disadvantages. A big challenge is that when you need to write or update data, it affects multiple parts of the columns, leading to numerous input/output operations. This can significantly slow down the process, especially when dealing with large volumes of data.

Additionally, if a query requires information from multiple columns, the system must gather records from these different columns and combine them. The more columns your query involves, the more complex and demanding this task becomes.

The upside is that using a hybrid storage format combines the best aspects of traditional row-based storage and columnar storage approaches. This hybrid method can offer improved performance and flexibility, making it a practical choice for handling a variety of data storage needs.

Parquet Format

Overview

Apache Parquet is a columnar storage file format specifically optimized for use with big data processing frameworks such as Apache Hadoop and Spark. Apache Parquet was released by the Apache Software Foundation in 2013. It gained traction relatively quickly due to its efficiency and compatibility with big data processing needs.

Parquet development was initiated by engineers at Twitter and Cloudera, with key contributions from Julien Le Dem of Twitter and Todd Lipcon of Cloudera.

Parquet was created to be compatible with multiple data processing tools and languages, predominantly within the Hadoop ecosystem but also beyond. It supports the Avro serialization system, making it flexible for use in a variety of systems and scenarios.

The format also supports schema evolution, making it easier to handle changes in data structure over time without breaking compatibility or requiring extensive data migration efforts.

Due to its columnar nature, Parquet supports features such as predicate pushdown, allowing query engines to quickly skip irrelevant blocks of data.

With these features, Parquet has become a standard for storing and processing large and complex data sets, especially in industries and fields where big data tools are prevalent. Its popularity continues to grow due to its efficiency and adaptability across multiple environments, including cloud-based data lake architectures.

About the format

Parquet files organize data into "groups of rows", each representing a segment of the overall data set. Think of it as dividing the data into sections by rows, which helps you structure the data efficiently. Within each row group, the data is divided by columns into what are known as "column blocks." This process acts as a vertical partitioning of the data.

It is important to note that all column blocks within a row group are stored together sequentially on the storage disk, ensuring that they remain as a cohesive unit.

Parquet file.

You might assume, as I initially did, that Parquet is just a columnar storage format. However, this is not entirely true. Parquet employs a hybrid model for data storage.

Parquet Encoding

Parquet encoding makes the data size smaller, so the IO volume is less and memory usage is less. Although extra CPU costs are required to decode Parquet data, it is still a good deal to trade off CPU costs for IO costs.

For modern computers, it is much cheaper to increase computing power compared to increasing IO performance. Especially for big data workload, IO is the bottleneck of data query performance.

Parquet coding algorithms

When dealing with data storage in Parquet format, it is important to understand the various encoding algorithms used to optimize data storage and retrieval. Let's look at some important coding techniques using simple examples. Here I will use a simple dataset as an example to explain the encoding algorithms.

Simple dataset as an example.

Dictionary encoding

Think of dictionary encoding as a unique lookup table for data in a column. Imagine you have a column for "Product" names. With dictionary encoding, each distinct product name is stored only once in a separate list called the "Product" dictionary, and each is assigned a unique index. Instead of storing the full product name repeatedly, the column now contains only the index values.

Dictionary encoding.

Performance: the fewer unique values (lower cardinality) a column has, the better this method compresses. For example, a "Gender" column, which typically has fewer unique values, compresses better than a "Zip Code" column.
Query Consistency: regardless of the data type (e.g. text, numbers), once encoded, query performance remains consistent because index values are uniformly scaled.

Run-Length Encoding (RLE)

RLE is like an abbreviation for repeated data. For a column like "Quantity", where values may repeat sequentially, RLE compresses the data by storing the value once and counting how many times it repeats consecutively.

Run length encoding (RLE)

Efficiency with standards: high repetition means better compression. For example, if multiple rows in "Quantity" have the same repeating number, RLE can drastically reduce the storage size.
Benefit of classification: sorting the data first can greatly improve compression, as RLE thrives on consecutive identical values.
Limitations: columns with constantly changing values are not ideal for RLE as they can even increase the size of the data.

Combining RLE and dictionary encoding

These two methods can work together. After using dictionary encoding to convert values to numeric indices, RLE can compress them further if there is a repeating pattern.

Delta Encoding

Delta coding is like a sequence of small steps rather than complete leaps. Instead of storing entire data values, it records the difference between each consecutive value. This is especially useful for data that changes slightly over time, such as time series data.

Delta Encoding.

Ideal for small variations: When values change incrementally, such as timestamps recorded in seconds, delta encoding significantly reduces storage needs. Only small differences are saved, not the complete values.

Conclusion

In short, Parquet uses these coding strategies to optimize the storage and processing of large data sets. Each technique has its strengths, depending on the nature of the data, and can significantly impact performance and efficiency in data operations.

While Parquet's columnar design inherently provides benefits such as improved performance and reduced storage, in future texts we will explore how these techniques combined with compression algorithms and partitioning features can take Parquet's efficiency to the next level.