Quick Start =========== This tutorial is intended as an introduction to working with **PyMongoArrow**. The reader is assumed to be familiar with basic `PyMongo `_ and `MongoDB `_ concepts. Prerequisites ------------- Before we start, make sure that you have the **PyMongoArrow** distribution :doc:`installed `. In the Python shell, the following should run without raising an exception:: import pymongoarrow as pma This tutorial also assumes that a MongoDB instance is running on the default host and port. Assuming you have `downloaded and installed `_ MongoDB, you can start it like so: .. code-block:: bash $ mongod Extending PyMongo ^^^^^^^^^^^^^^^^^ The :mod:`pymongoarrow.monkey` module provides an interface to patch PyMongo, in place, and add **PyMongoArrow**'s functionality directly to :class:`~pymongo.collection.Collection` instances:: from pymongoarrow.monkey import patch_all patch_all() After running :meth:`~pymongoarrow.monkey.patch_all`, new instances of :class:`~pymongo.collection.Collection` will have PyMongoArrow's APIs, e.g. :meth:`~pymongoarrow.api.find_pandas_all`. .. note:: Users can also directly use any of **PyMongoArrow**'s APIs by importing them from :mod:`pymongoarrow.api`. The only difference in usage would be the need to manually pass the instance of :class:`~pymongo.collection.Collection` on which the operation is to be run as the first argument when directly using the API method. Test data ^^^^^^^^^ Before we begein, we must first add some data to our cluster that we can query. We can do so using **PyMongo**:: from datetime import datetime from pymongo import MongoClient client = MongoClient() client.db.data.insert_many([ {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1)}, {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11)}, {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9)}, {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}]) Defining the schema ------------------- **PyMongoArrow** relies upon a **user-specified** data schema to marshall query result sets into tabular form. Users can define the schema by instantiating :class:`pymongoarrow.api.Schema` using a mapping of field names to type-specifiers, e.g.:: from pymongoarrow.api import Schema schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) There are multiple permissible type-identifiers for each supported BSON type. For a full-list of supported types and associated type-identifiers see :doc:`supported_types`. Find operations --------------- We are now ready to query our data. Let's start by running a ``find`` operation to load all records with a non-zero ``amount`` as a :class:`pandas.DataFrame`:: df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema) We can also load the same result set as a :class:`pyarrow.Table` instance:: arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema) Or as :class:`numpy.ndarray` instances:: ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema) In the NumPy case, the return value is a dictionary where the keys are field names and values are the corresponding arrays. Aggregate operations -------------------- Running ``aggregate`` operations is similar to ``find``. Here is an example of an aggregation that loads all records with an ``amount`` less than 10:: # pandas df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema) # arrow arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema) # numpy ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema) Writing to other formats ------------------------ Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type can be easily written to one of the formats supported by `PyArrow `_. For example, to write the table referenced by the variable ``arrow_table`` to a Parquet file ``example.parquet``, run:: import pyarrow.parquet as pq pq.write_table(arrow_table, 'example.parquet') Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety of formats including CSV, and HDF. For example, to write the data frame referenced by the variable ``df`` to a CSV file ``out.csv``, run:: df.to_csv('out.csv', index=False)