Quick Start

This tutorial is intended as an introduction to working with PyMongoArrow. The reader is assumed to be familiar with basic PyMongo and MongoDB concepts.


Before we start, make sure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:

import pymongoarrow as pma

This tutorial also assumes that a MongoDB instance is running on the default host and port. Assuming you have downloaded and installed MongoDB, you can start it like so:

$ mongod

Extending PyMongo

The pymongoarrow.monkey module provides an interface to patch PyMongo, in place, and add PyMongoArrow’s functionality directly to Collection instances:

from pymongoarrow.monkey import patch_all

After running patch_all(), new instances of Collection will have PyMongoArrow’s APIs, e.g. find_pandas_all().


Users can also directly use any of PyMongoArrow’s APIs by importing them from pymongoarrow.api. The only difference in usage would be the need to manually pass the instance of Collection on which the operation is to be run as the first argument when directly using the API method.

Test data

Before we begein, we must first add some data to our cluster that we can query. We can do so using PyMongo:

from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
    {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
    {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
    {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
    {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])

Defining the schema

PyMongoArrow relies upon a user-specified data schema to marshall query result sets into tabular form. Users can define the schema by instantiating pymongoarrow.api.Schema using a mapping of field names to type-specifiers, e.g.:

from pymongoarrow.api import Schema
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})

There are multiple permissible type-identifiers for each supported BSON type. For a full-list of supported types and associated type-identifiers see Supported Types.

Find operations

We are now ready to query our data. Let’s start by running a find operation to load all records with a non-zero amount as a pandas.DataFrame:

df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)

We can also load the same result set as a pyarrow.Table instance:

arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)

Or as numpy.ndarray instances:

ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)

In the NumPy case, the return value is a dictionary where the keys are field names and values are the corresponding arrays.

Aggregate operations

Running aggregate operations is similar to find. Here is an example of an aggregation that loads all records with an amount less than 10:

# pandas
df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# arrow
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# numpy
ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)

Writing to other formats

Result sets that have been loaded as Arrow’s Table type can be easily written to one of the formats supported by PyArrow. For example, to write the table referenced by the variable arrow_table to a Parquet file example.parquet, run:

import pyarrow.parquet as pq
pq.write_table(arrow_table, 'example.parquet')

Pandas also supports writing DataFrame instances to a variety of formats including CSV, and HDF. For example, to write the data frame referenced by the variable df to a CSV file out.csv, run:

df.to_csv('out.csv', index=False)