Data Types#

PyMongoArrow supports a majority of the BSON types. As Arrow and Polars provide first-class support for Lists and Structs, this includes Embedded arrays and documents.

Support for additional types will be added in subsequent releases.

Note

For more information about BSON types, see the BSON specification.

Note

Decimal128 types are only supported on little-endian systems. On big-endian systems, null will be used.

BSON Type

Type Identifiers

String

py.str, an instance of pyarrow.string

Embedded document

py.dict, and instance of pyarrow.struct

Embedded array

An instance of pyarrow.list_,

ObjectId

py.bytes, bson.ObjectId, an instance of pymongoarrow.types.ObjectIdType, an instance of pymongoarrow.pandas_types.PandasObjectId

Decimal128

bson.Decimal128, an instance of pymongoarrow.types.Decimal128Type, an instance of pymongoarrow.pandas_types.PandasDecimal128.

Boolean

an instance of bool_, bool

64-bit binary floating point

py.float, an instance of pyarrow.float64()

32-bit integer

an instance of pyarrow.int32()

64-bit integer

int, bson.int64.Int64, an instance of pyarrow.int64()

UTC datetime

an instance of timestamp with ms resolution, py.datetime.datetime

Binary data

bson.Binary, an instance of pymongoarrow.types.BinaryType, an instance of pymongoarrow.pandas_types.PandasBinary.

JavaScript code

bson.Code, an instance of pymongoarrow.types.CodeType, an instance of pymongoarrow.pandas_types.PandasCode.

Type identifiers can be used to specify that a field is of a certain type during pymongoarrow.api.Schema declaration. For example, if your data has fields ‘f1’ and ‘f2’ bearing types 32-bit integer and UTC datetime respectively, and ‘_id’ that is an ObjectId, your schema can be defined as:

schema = Schema({
  '_id': ObjectId,
  'f1': pyarrow.int32(),
  'f2': pyarrow.timestamp('ms')
})

Unsupported data types in a schema cause a ValueError identifying the field and its data type.

Embedded Array Considerations#

The schema used for an Embedded Array must use the pyarrow.list_() type, so that the type of the array elements can be specified. For example,

Extension Types#

The ObjectId, Decimal128, Binary data and JavaScript code are implemented as extension types for PyArrow and Pandas. For arrow tables, values of these types will have the appropriate pymongoarrow extension type (e.g. pymongoarrow.types.ObjectIdType). The appropriate bson Python object can be obtained using the .as_py() method, or by calling .to_pylist() on the table.

>>> from pymongo import MongoClient
>>> from bson import ObjectId
>>> from pymongoarrow.api import find_arrow_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> table = find_arrow_all(coll, {})
>>> table
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
foo: int32
----
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
foo: [[100,200]]
>>> table["_id"][0]
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
>>> table["_id"][0].as_py()
ObjectId('64408b0d5ac9e208af220142')
>>> table.to_pylist()
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
 {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]

When converting to pandas, the extension type columns will have an appropriate pymongoarrow extension type (e.g. pymongoarrow.pandas_types.PandasDecimal128). The value of the element in the dataframe will be the appropriate bson type.

>>> from pymongo import MongoClient
>>> from bson import Decimal128
>>> from pymongoarrow.api import find_pandas_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> df = find_pandas_all(coll, {})
>>> df
                        _id  foo
0  64408bf65ac9e208af220144  0.1
1  64408bf65ac9e208af220145  0.1
>>> df["foo"].dtype
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
>>> df["foo"][0]
Decimal128('0.1')
>>> df["_id"][0]
ObjectId('64408bf65ac9e208af220144')

As of this writing, Polars does not support Extension Types.

Null Values and Conversion to Pandas DataFrames#

In Arrow (and Polars), all Arrays are nullable. Pandas has experimental nullable data types as, e.g., “Int64” (note the capital “I”). You can instruct Arrow to create a pandas DataFrame using nullable dtypes with the code below (taken from here)

>>> dtype_mapping = {
...     pa.int8(): pd.Int8Dtype(),
...     pa.int16(): pd.Int16Dtype(),
...     pa.int32(): pd.Int32Dtype(),
...     pa.int64(): pd.Int64Dtype(),
...     pa.uint8(): pd.UInt8Dtype(),
...     pa.uint16(): pd.UInt16Dtype(),
...     pa.uint32(): pd.UInt32Dtype(),
...     pa.uint64(): pd.UInt64Dtype(),
...     pa.bool_(): pd.BooleanDtype(),
...     pa.float32(): pd.Float32Dtype(),
...     pa.float64(): pd.Float64Dtype(),
...     pa.string(): pd.StringDtype(),
... }
... df = arrow_table.to_pandas(
...     types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
... )
... del arrow_table

Defining a conversion for pa.string() in addition converts Arrow strings to NumPy strings, and not objects.

Nested Extension Types#

Pending ARROW-179, extension types such as ObjectId that appear in nested documents will not be converted to the corresponding PyMongoArrow extension type, but will instead have the raw Arrow type (FixedSizeBinaryType(fixed_size_binary[12])).

These values can either be consumed as-is or converted individually to the desired extension type, e.g. _id = out['nested'][0]['_id'].cast(ObjectIdType()).