Data Types#

PyMongoArrow supports a majority of the BSON types. As Arrow and Polars provide first-class support for Lists and Structs, this includes Embedded arrays and documents.

Support for additional types will be added in subsequent releases.

Note

For more information about BSON types, see the BSON specification.

Note

Decimal128 types are only supported on little-endian systems. On big-endian systems, null will be used.

BSON Type	Type Identifiers
String	`py.str`, an instance of `pyarrow.string`
Embedded document	`py.dict`, and instance of `pyarrow.struct`
Embedded array	An instance of `pyarrow.list_`,
ObjectId	`py.bytes`, `bson.ObjectId`, an instance of `pymongoarrow.types.ObjectIdType`, an instance of `pymongoarrow.pandas_types.PandasObjectId`
Decimal128	`bson.Decimal128`, an instance of `pymongoarrow.types.Decimal128Type`, an instance of `pymongoarrow.pandas_types.PandasDecimal128`.
Boolean	an instance of `bool_`, `bool`
64-bit binary floating point	`py.float`, an instance of `pyarrow.float64()`
32-bit integer	an instance of `pyarrow.int32()`
64-bit integer	`int`, `bson.int64.Int64`, an instance of `pyarrow.int64()`
UTC datetime	an instance of `timestamp` with `ms` resolution, `py.datetime.datetime`
Binary data	`bson.Binary`, an instance of `pymongoarrow.types.BinaryType`, an instance of `pymongoarrow.pandas_types.PandasBinary`.
JavaScript code	`bson.Code`, an instance of `pymongoarrow.types.CodeType`, an instance of `pymongoarrow.pandas_types.PandasCode`.

Type identifiers can be used to specify that a field is of a certain type during pymongoarrow.api.Schema declaration. For example, if your data has fields ‘f1’ and ‘f2’ bearing types 32-bit integer and UTC datetime respectively, and ‘_id’ that is an ObjectId, your schema can be defined as:

schema = Schema({
  '_id': ObjectId,
  'f1': pyarrow.int32(),
  'f2': pyarrow.timestamp('ms')
})

Unsupported data types in a schema cause a ValueError identifying the field and its data type.

Embedded Array Considerations#

The schema used for an Embedded Array must use the pyarrow.list_() type, so that the type of the array elements can be specified. For example,

Extension Types#

The ObjectId, Decimal128, Binary data and JavaScript code are implemented as extension types for PyArrow and Pandas. For arrow tables, values of these types will have the appropriate pymongoarrow extension type (e.g. pymongoarrow.types.ObjectIdType). The appropriate bson Python object can be obtained using the .as_py() method, or by calling .to_pylist() on the table.

>>> from pymongo import MongoClient
>>> from bson import ObjectId
>>> from pymongoarrow.api import find_arrow_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> table = find_arrow_all(coll, {})
>>> table
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
foo: int32
----
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
foo: [[100,200]]
>>> table["_id"][0]
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
>>> table["_id"][0].as_py()
ObjectId('64408b0d5ac9e208af220142')
>>> table.to_pylist()
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
 {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]

When converting to pandas, the extension type columns will have an appropriate pymongoarrow extension type (e.g. pymongoarrow.pandas_types.PandasDecimal128). The value of the element in the dataframe will be the appropriate bson type.

>>> from pymongo import MongoClient
>>> from bson import Decimal128
>>> from pymongoarrow.api import find_pandas_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> df = find_pandas_all(coll, {})
>>> df
                        _id  foo
0  64408bf65ac9e208af220144  0.1
1  64408bf65ac9e208af220145  0.1
>>> df["foo"].dtype
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
>>> df["foo"][0]
Decimal128('0.1')
>>> df["_id"][0]
ObjectId('64408bf65ac9e208af220144')

As of this writing, Polars does not support Extension Types.

Null Values and Conversion to Pandas DataFrames#

In Arrow (and Polars), all Arrays are nullable. Pandas has experimental nullable data types as, e.g., “Int64” (note the capital “I”). You can instruct Arrow to create a pandas DataFrame using nullable dtypes with the code below (taken from here)

>>> dtype_mapping = {
...     pa.int8(): pd.Int8Dtype(),
...     pa.int16(): pd.Int16Dtype(),
...     pa.int32(): pd.Int32Dtype(),
...     pa.int64(): pd.Int64Dtype(),
...     pa.uint8(): pd.UInt8Dtype(),
...     pa.uint16(): pd.UInt16Dtype(),
...     pa.uint32(): pd.UInt32Dtype(),
...     pa.uint64(): pd.UInt64Dtype(),
...     pa.bool_(): pd.BooleanDtype(),
...     pa.float32(): pd.Float32Dtype(),
...     pa.float64(): pd.Float64Dtype(),
...     pa.string(): pd.StringDtype(),
... }
... df = arrow_table.to_pandas(
...     types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
... )
... del arrow_table

Defining a conversion for pa.string() in addition converts Arrow strings to NumPy strings, and not objects.

Nested Extension Types#

Pending ARROW-179, extension types such as ObjectId that appear in nested documents will not be converted to the corresponding PyMongoArrow extension type, but will instead have the raw Arrow type (FixedSizeBinaryType(fixed_size_binary[12])).

These values can either be consumed as-is or converted individually to the desired extension type, e.g. _id = out['nested'][0]['_id'].cast(ObjectIdType()).