Extension Types

This tutorial is intended as an introduction to working with PyMongoArrow and its corresponding extension types. The reader is assumed to be familiar with basic PyMongo and MongoDB concepts. For more information see the Arrow extension type docs.

Extension types with Arrow

Both extension types, pymongoarrow.types.ObjectIdType and pymongoarrow.types.Decimal128StringType, are only partially supported in PyArrow. They will work when used in a schema, but will show up in the table as a fixed_size_binary(12) or string respectively, and will be pyarrow.lib.FixedSizeBinaryScalar or pyarrow.lib.StringScalar upon accessing the values:

schema = Schema({"_id": ObjectIdType(), "data": Decimal128StringType()})
table = find_arrow_all(coll, {}, schema=schema)
print(table)
>>> pyarrow.Table
>>> _id: fixed_size_binary[12]
>>> data: string
>>> ----
>>> _id: [[63C003BF0A1D5281D33B0AFD,63C003BF0A1D5281D33B0AFE,63C003BF0A1D5281D33B0AFF,63C003BF0A1D5281D33B0B00]]
>>> data: [["0.1","1.0","0.00001",null]]
>>> ...
print(type(table["_id"][0]))
print(type(table["data"][0]))
>>> <class 'pyarrow.lib.FixedSizeBinaryScalar'>
>>> <class 'pyarrow.lib.StringScalar'>

Extension types with Pandas/NumPy

Extension types with Pandas/NumPy are only partially supported. They will work when used in a schema, but will show up in the table as a pandas.object, but will be converted to either py.bytes or py.str upon accessing the values:

schema = Schema({"_id": ObjectIdType(), "data": Decimal128StringType()})
table = find_pandas_all(coll, {}, schema=schema)
print(table.info())
>>> RangeIndex: 4 entries, 0 to 3
>>> Data columns (total 2 columns):
>>>  #   Column      Non-Null Count  Dtype
>>> ---  ------      --------------  -----
>>>  0   _id         4 non-null      object
>>>  1   data        3 non-null      object
print(type(table["_id"][0]))
print(type(table["data"][0]))
>>> <class 'bytes'>
>>> <class 'str'>