Quick Start¶
This tutorial is intended as an introduction to working with PyMongoArrow. The reader is assumed to be familiar with basic PyMongo and MongoDB concepts.
Prerequisites¶
Before we start, make sure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:
import pymongoarrow as pma
This tutorial also assumes that a MongoDB instance is running on the default host and port. Assuming you have downloaded and installed MongoDB, you can start it like so:
$ mongod
Extending PyMongo¶
The pymongoarrow.monkey module provides an interface to patch PyMongo,
in place, and add PyMongoArrow’s functionality directly to
Collection instances:
from pymongoarrow.monkey import patch_all
patch_all()
After running patch_all(), new instances of
Collection will have PyMongoArrow’s APIs,
e.g. find_pandas_all().
Note
Users can also directly use any of PyMongoArrow’s APIs
by importing them from pymongoarrow.api. The only difference in
usage would be the need to manually pass the instance of
Collection on which the operation is to be
run as the first argument when directly using the API method.
Test data¶
Before we begein, we must first add some data to our cluster that we can query. We can do so using PyMongo:
from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
client.db.data.insert_many([
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])
Defining the schema¶
PyMongoArrow relies upon a user-specified data schema to marshall
query result sets into tabular form. Users can define the schema by
instantiating pymongoarrow.api.Schema using a mapping of field names
to type-specifiers, e.g.:
from pymongoarrow.api import Schema
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
There are multiple permissible type-identifiers for each supported BSON type. For a full-list of supported types and associated type-identifiers see Supported Types.
Find operations¶
We are now ready to query our data. Let’s start by running a find
operation to load all records with a non-zero amount as a
pandas.DataFrame:
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
We can also load the same result set as a pyarrow.Table instance:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
Or as numpy.ndarray instances:
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
In the NumPy case, the return value is a dictionary where the keys are field names and values are the corresponding arrays.
Aggregate operations¶
Running aggregate operations is similar to find. Here is an example of
an aggregation that loads all records with an amount less than 10:
# pandas
df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# arrow
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# numpy
ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
Writing to other formats¶
Result sets that have been loaded as Arrow’s Table type can
be easily written to one of the formats supported by
PyArrow. For example,
to write the table referenced by the variable arrow_table to a Parquet
file example.parquet, run:
import pyarrow.parquet as pq
pq.write_table(arrow_table, 'example.parquet')
Pandas also supports writing DataFrame instances to a variety
of formats including CSV, and HDF. For example, to write the data frame
referenced by the variable df to a CSV file out.csv, run:
df.to_csv('out.csv', index=False)