Introducing PyMongo: Your Gateway to MongoDB in Python Development

Simplify MongoDB Operations with Python: A Beginner's Guide to Inserting, Querying, and Aggregating Data Efficiently for Your Data-Driven Projects

🕒 Estimated reading time: 16 minutes

In the era of big data, having a powerful and flexible way to manage your database operations is indispensable. MongoDB is a highly scalable NoSQL database known for its speed and ease of use. Coupled with the Python programming language, you have a potent combination for dynamic, data-driven projects. PyMongo is the go-to library for connecting MongoDB with Python. Here we will guide you through the basics, intermediate, and advanced features of PyMongo, enriched with practical examples.

Now, let's dive into the world of MongoDB in Python with some practical examples that are also available on Google Colab here 👨‍🔬.

Introduction to PyMongo

PyMongo is the official Python driver for MongoDB, allowing you to interact seamlessly with MongoDB databases. Whether you're working on a simple script or a complex data-driven application, PyMongo provides a comprehensive toolkit for all your database operations.

Before diving into the various operations, you'll need to install PyMongo:

pip install pymongo

Basic Operations

Connecting to MongoDB

First, let's connect to a local MongoDB server. In this example, we are using a free MongoDB Atlas instance provided by InfinitePy. It's important to note that the username and password are hard-coded in this example. For security reasons, in real applications, these credentials should be managed using environment variables or secret managers.

# Import the MongoClient class from the pymongo library.
# This will allow us to connect to a MongoDB database.
from pymongo import MongoClient

# Define the username for the MongoDB database. This should be a valid user with appropriate permissions.
username = 'infinitepy'

# Define the password for the MongoDB database. Make sure that this password is correct.
# Never expose or hard-code passwords in your code for security reasons. Use environment variables or secret managers in real applications.
password = 'ilovepyth0n'

# Create a MongoClient object. This object represents a client connection to the MongoDB database.
# We use an f-string to format the connection string with the username and password.
# The connection URI includes the username, password, and cluster information.
client = MongoClient(f'mongodb+srv://{username}:{password}@infinitepy-cluster.s05fljv.mongodb.net/?retryWrites=true&w=majority&appName=infinitepy-cluster')
Notes on connection URI parameters
  • mongodb+srv: Specifies the protocol to use to connect to the MongoDB cluster over a 'SRV' connection string.

  • {username}:{password}: Replaces placeholders with the actual username and password. ⚠️ Ensure you keep your credentials safe and secure.

  • ?retryWrites=true: Ensures that write operations are retried once if they fail due to a network error.

  • &w=majority: Ensures that write operations only return successfully once the majority of the cluster nodes acknowledge it.

  • &appName=infinitepy-cluster: Specifies a custom name for this connection. Useful for monitoring purposes.

# Using the client instance, fetch the list of all database names
# list_database_names() returns a list of strings, where each string is the name of a database.
for db_name in client.list_database_names():
    # Print each database name to the console
    print(db_name)

Running the code above will produce the following output.

library
sample_db
sample_mflix
admin
local

Inserting a Document

One of the most basic operations is inserting documents into a collection. Here's a simple example:

# Access the sample_db database; if it doesn't exist, MongoDB creates it when you first store some data in it.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' in the sample_db database
collection = db.my_collection

# Create a dictionary representing the document to be inserted into the collection
# The document contains information about a person: name, age, and city
document = {"name": "John Doe", "age": 29, "city": "New York"}

# Insert the document into the 'my_collection' collection
# The insert_one method returns a result object that contains information about the insertion
result = collection.insert_one(document)

# Print the unique ID of the inserted document
# The inserted_id attribute of the result object contains this ID
print("Document id inserted: {document_id}".format(document_id=result.inserted_id))

Querying Documents

To retrieve documents, you can use the find method:

# Access the sample_db database; if it doesn't exist, MongoDB creates it when you first store some data in it.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' in the sample_db database.
# A collection in MongoDB is similar to a table in a relational database.
collection = db.my_collection

# Find a single document in the 'my_collection' collection where the 'name' field equals "John Doe".
# The method 'find_one' returns the first match found.
result = collection.find_one({"name": "John Doe"})

# Print the result to the console.
# If a document with the 'name' "John Doe" is found, it will print that document.
# If no such document exists, it will print 'None'.
print(result)

Running the code above will produce the following output.

{'_id': ObjectId('6668b417ddfc8c94966ada18'), 'name': 'John Doe', 'age': 29, 'city': 'New York'}

⚠️ The ObjectId will be different for each newly created document.

Inserting multiple Documents

You can automate inserting multiple documents from a list:

# Access the sample_db database; MongoDB creates it if it doesn’t exist.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' within the sample_db database.
# A collection in MongoDB is similar to a table in a relational database.
collection = db.my_collection

# Define a list of documents to be inserted into 'my_collection'.
# Each document is represented as a Python dictionary with key-value pairs.
documents = [
    {"name": "Alice", "age": 25, "city": "London"},
    {"name": "Bob", "age": 30, "city": "San Francisco"},
]

# Insert the list of documents into the collection.
# The insert_many method inserts multiple documents at once and returns an object containing the inserted_ids.
result = collection.insert_many(documents)

# Print the unique IDs of the inserted documents.
# The inserted_ids attribute of the result object contains these IDs.
print("Documents inserted with IDs:", result.inserted_ids)

Running the code above will produce the following output.

Documents inserted with IDs: [ObjectId('6668ec777220f360ac35571f'), ObjectId('6668ec777220f360ac355720')]

Updating Documents

In real-world applications, you'll often need to update existing records. Here's how you can accomplish that:

# Access the sample_db database; if it doesn't exist, MongoDB creates it when you first store some data in it.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' in the sample_db database
collection = db.my_collection

# Insert the document into the 'my_collection' collection
# The insert_one method returns a result object that contains information about the insertion
result = collection.update_one(
    {"name": "John Doe"},
    {"$set": {"age": 30}}
)

# Check if the update was successful
if result.modified_count > 0:
    print("Document updated successfully.")
else:
    print("No document was updated.")

Deleting Documents

Removing documents is just as straightforward:

# Access the sample_db database; MongoDB creates it if it doesn’t exist.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' within the sample_db database.
# A collection in MongoDB is similar to a table in a relational database.
collection = db.my_collection

# Attempt to delete a single document from the 'my_collection' collection
# where the 'name' field is "John Doe".
# `delete_one` method deletes the first document that matches the filter criteria.
result = collection.delete_one({"name": "John Doe"})

# Check if a document was actually deleted by examining the 'deleted_count' attribute.
# This attribute returns the number of documents that were deleted.
if result.deleted_count > 0:
    # If deleted_count is greater than 0, it means at least one document was deleted.
    print("Document deleted successfully.")
else:
    # If deleted_count is 0, it means no document matched the filter criteria.
    print("No document was deleted.")

Advanced Operations

Aggregation Framework

The MongoDB Aggregation Framework is a powerful tool for processing data and transforming documents within a collection. Instead of using imperative code to process data, you can define a pipeline of operations that MongoDB will execute. Each stage in the pipeline transforms the documents as they pass through. The aggregation framework is particularly useful for tasks like filtering, grouping, averaging, and other types of data processing.

Key Concepts and Stages

  • Pipeline: A sequence of stages. Documents enter the pipeline and are processed sequentially by each stage.

  • Stage: Each stage operates on the documents and can perform operations such as filtering, grouping, projecting, sorting, and more.

  • Common Stages:

    • $match: Filters documents by a specified condition. Similar to the WHERE clause in SQL.

    • $group: Groups documents by a specified key and can compute aggregate values such as sums, averages, and counts.

    • $project: Reshapes documents, including or excluding fields or computing new fields.

    • $sort: Sorts documents based on one or more fields.

    • $limit and $skip: Control the number of documents to return, allowing for pagination.

In the provided code sample:

  1. Step 1 ($match): The pipeline starts with filtering documents. In this case, only documents where the age field is greater than or equal to 25 are passed to the next stage.

  2. Step 2 ($group): The filtered documents are grouped by the city field (_id is set to $city). For each city, the average age is calculated using the $avg accumulator operator, and the result is stored in the average_age field.

This two-stage pipeline effectively computes the average age of all people aged 25 or older for each city in the collection.

# Access the sample_db database; MongoDB creates it if it doesn’t exist.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' within the sample_db database.
# A collection in MongoDB is similar to a table in a relational database.
collection = db.my_collection

# Define an aggregation pipeline to process documents in the collection.
# The pipeline is a list of stages, where each stage transforms the documents as they pass through.
# The stages are processed in order from the first to the last.
pipeline = [
    # The $match stage filters the documents by the specified condition.
    # Here, we are only interested in documents where the age field is greater than or equal to 25.
    {"$match": {"age": {"$gte": 25}}},

    # The $group stage groups the documents by a specified identifier.
    # Here, we are grouping documents by the 'city' field.
    # The '_id' field is used as the group identifier.
    # The 'average_age' field is calculated as the average value of the 'age' field for each group.
    {"$group": {"_id": "$city", "average_age": {"$avg": "$age"}}},
]

# Execute the aggregation pipeline on the collection.
# The aggregate() method returns an iterable with the results of the pipeline.
result = collection.aggregate(pipeline)

# Iterate over the result and print each document.
# Each document in the result represents a city and its corresponding average age.
for doc in result:
    print(doc)

Running the code above will produce the following output.

{'_id': 'San Francisco', 'average_age': 30.0}
{'_id': 'London', 'average_age': 25.0}

Practical Example

Designing a complex aggregation pipeline to find the maximum age per city:

# Access the sample_db database; MongoDB creates it if it doesn’t exist.
# The 'client.sample_db' syntax is used to access the database.
db = client.sample_db

# Access the collection named 'my_collection' within the sample_db database.
# A collection in MongoDB is similar to a table in a relational database.
collection = db.my_collection

# Define an aggregation pipeline to process the data in the collection.
# The pipeline contains a sequence of stages, each represented by a document.
pipeline = [
    # The $group stage groups documents by the 'city' field.
    # It creates a new document for each group where '_id' is the city name.
    # It also calculates the maximum age for each group and stores it in 'max_age'.
    {"$group": {"_id": "$city", "max_age": {"$max": "$age"}}},

    # The $sort stage sorts the grouped documents by the 'max_age' field in descending order (-1).
    {"$sort": {"max_age": -1}}
]

# Execute the aggregation pipeline on the collection.
# 'collection.aggregate(pipeline)' returns a cursor to the result of the aggregation.
result = collection.aggregate(pipeline)

# Iterate over the documents in the result cursor.
# Each document in 'result' is a dictionary containing '_id' (the city name) and 'max_age'.
for doc in result:
    # Print the city name and maximum age for each group.
    print(f"City: {doc['_id']}, Max Age: {doc['max_age']}")

Running the code above will produce the following output.

City: San Francisco, Max Age: 30
City: London, Max Age: 25

Project in Action: From Theory to Practice

Now that we've covered the basics, let's put our knowledge into practice by creating a simple application to manage a book library. This project will demonstrate fundamental CRUD (Create, Read, Update, Delete) operations using MongoDB with the help of pymongo, showcasing these concepts in a real-world scenario. Additionally, we will implement the Python logging techniques discussed in our latest text here.

import logging

# Create a logger object with the name 'mongo_logger'
mongo_logger = logging.getLogger('mongo_logger')

# Set the logging level to DEBUG. This means that all messages with the level DEBUG and above will be logged
mongo_logger.setLevel(logging.DEBUG)

# Create handlers that will print log messages to the console (standard output) write log messages to a file named 'mongo_logger.log'
console_handler = logging.StreamHandler()
file_handler = logging.FileHandler('mongo_logger.log')

# Define a formatter that specifies the format of the log messages
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Set the formatter for the console handler and for the file handler
console_handler.setFormatter(formatter)
file_handler.setFormatter(formatter)

# Add the console handler to the logger. This means that log messages will be output to the console
mongo_logger.addHandler(console_handler)
# Add the file handler to the logger. This means that log messages will be written to the 'mongo_logger.log' file
mongo_logger.addHandler(file_handler)
# Function to get a specific database
def get_database(db_name):
    # Log that we are accessing the specified database
    mongo_logger.info(f"Accessing the '{db_name}' database.")
    # Return the database object from the client
    return client[db_name]

# Function to get a specific collection from a database
def get_collection(db, collection_name):
    # Log that we are accessing the specified collection within the given database
    mongo_logger.info(f"Accessing the '{collection_name}' collection within the '{db.name}' database.")
    # Return the collection object from the database
    return db[collection_name]

# Function to insert multiple book documents into a collection
def insert_books(collection, books):
    # Log that we are inserting new book documents
    mongo_logger.info("Inserting new book documents into the collection.")
    # Insert the list of books into the collection using insert_many
    result = collection.insert_many(books)
    # Log the IDs of the inserted documents
    mongo_logger.info(f'Inserted book IDs: {result.inserted_ids}')
    # Return the result of the insert operation
    return result

# Function to find books by a specific author in a collection
def find_books_by_author(collection, author_name):
    # Log that we are querying for books by the specified author
    mongo_logger.info(f"Querying the collection for all books authored by {author_name}.")
    # Query the collection for books with the given author name
    books = collection.find({"author": author_name})
    # Log the number of books found (convert cursor to list to count)
    mongo_logger.info(f"Found {len(list(books))} books by {author_name}.")
    # Return the cursor to the list of books found
    return books

# Function to update the genre of a book by its title
def update_book_genre(collection, title, new_genre):
    # Log that we are updating the genre of the specified book
    mongo_logger.info(f"Updating the genre of '{title}' to '{new_genre}'.")
    # Update the genre of the book with the given title
    update_result = collection.update_one(
        {"title": title},
        {"$set": {"genre": new_genre}}
    )
    # Log the number of matched and modified documents
    mongo_logger.info(f'Matched {update_result.matched_count} document(s) and modified {update_result.modified_count} document(s).')
    # Return the result of the update operation
    return update_result

# Function to delete a book by its title
def delete_book(collection, title):
    # Log that we are deleting the book with the specified title
    mongo_logger.info(f"Deleting the book with the title '{title}'.")
    # Delete the book document with the given title
    delete_result = collection.delete_one({"title": title})
    # Log the number of documents deleted
    mongo_logger.info(f'Deleted {delete_result.deleted_count} document(s).')
    # Return the result of the delete operation
    return delete_result
# The following script performs basic operations on a library database.
# It assumes the existence of certain helper functions: get_database, get_collection,
# insert_books, find_books_by_author, update_book_genre, and delete_book.

if __name__ == "__main__":
    # This line checks if the script is run as the main module and not imported as a part of another module.

    # Initialize the database connection with the database named 'library'
    db = get_database('library')

    # Access the 'books' collection from the 'library' database
    books_collection = get_collection(db, 'books')

    # Define a list of new books to be added to the collection
    new_books = [
        {
            "title": "The Catcher in the Rye",
            "author": "J.D. Salinger",
            "year": 1951,
            "genre": "Fiction"
        },
        {
            "title": "To Kill a Mockingbird",
            "author": "Harper Lee",
            "year": 1960,
            "genre": "Fiction"
        },
        {
            "title": "Python Crash Course",
            "author": "Eric Matthes",
            "year": 2015,
            "genre": "Programming"
        }
    ]

    # Insert the list of new books into the 'books' collection
    insert_books(books_collection, new_books)

    # Find all books by the author J.D. Salinger in the 'books' collection
    salinger_books = find_books_by_author(books_collection, "J.D. Salinger")

    # Log the books found by the author J.D. Sali

Conclusion

By integrating PyMongo into your Python projects, you can harness the power of MongoDB effortlessly. From basic insertions and queries to advanced aggregations and indexing, PyMongo offers a robust set of tools to streamline your database operations. Whether you are a beginner, intermediate, or advanced Python developer, this guide provides a comprehensive foundation for leveraging the full potential of PyMongo.

Feel free to reply to this newsletter with any questions or topics you'd like us to cover in the future.

If you liked this newsletter, don't forget to subscribe to receive regular updates. Share with your friends and colleagues interested in Python and let's grow together in our community of programmers!

Remember, the key to mastery is practice and persistence. Happy coding! Until the next edition, keep programming! 👨‍💻

InfinitePy Newsletter - Your source for Python learning and inspiration.