Quick Start¶
This tutorial is intended as an introduction to working with PyMongoArrow. The reader is assumed to be familiar with basic PyMongo and MongoDB concepts.
Prerequisites¶
Before we start, make sure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:
import pymongoarrow as pma
This tutorial also assumes that a MongoDB instance is running on the default host and port. Assuming you have downloaded and installed MongoDB, you can start it like so:
$ mongod
Extending PyMongo¶
The pymongoarrow.monkey
module provides an interface to patch PyMongo,
in place, and add PyMongoArrow’s functionality directly to
Collection
instances:
from pymongoarrow.monkey import patch_all
patch_all()
After running patch_all()
, new instances of
Collection
will have PyMongoArrow’s APIs,
e.g. find_pandas_all()
.
Note
Users can also directly use any of PyMongoArrow’s APIs
by importing them from pymongoarrow.api
. The only difference in
usage would be the need to manually pass the instance of
Collection
on which the operation is to be
run as the first argument when directly using the API method.
Test data¶
Before we begein, we must first add some data to our cluster that we can query. We can do so using PyMongo:
from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
client.db.data.insert_many([
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])
Defining the schema¶
PyMongoArrow relies upon a user-specified data schema to marshall
query result sets into tabular form. Users can define the schema by
instantiating pymongoarrow.api.Schema
using a mapping of field names
to type-specifiers, e.g.:
from pymongoarrow.api import Schema
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
There are multiple permissible type-identifiers for each supported BSON type. For a full-list of supported types and associated type-identifiers see Supported Types.
Find operations¶
We are now ready to query our data. Let’s start by running a find
operation to load all records with a non-zero amount
as a
pandas.DataFrame
:
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
We can also load the same result set as a pyarrow.Table
instance:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
Or as numpy.ndarray
instances:
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
In the NumPy case, the return value is a dictionary where the keys are field names and values are the corresponding arrays.
Aggregate operations¶
Running aggregate
operations is similar to find
. Here is an example of
an aggregation that loads all records with an amount
less than 10:
# pandas
df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# arrow
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
# numpy
ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
Writing to other formats¶
Result sets that have been loaded as Arrow’s Table
type can
be easily written to one of the formats supported by
PyArrow. For example,
to write the table referenced by the variable arrow_table
to a Parquet
file example.parquet
, run:
import pyarrow.parquet as pq
pq.write_table(arrow_table, 'example.parquet')
Pandas also supports writing DataFrame
instances to a variety
of formats including CSV, and HDF. For example, to write the data frame
referenced by the variable df
to a CSV file out.csv
, run:
df.to_csv('out.csv', index=False)