Data Types#
PyMongoArrow supports a majority of the BSON types. As Arrow and Polars provide first-class support for Lists and Structs, this includes Embedded arrays and documents.
Support for additional types will be added in subsequent releases.
Note
For more information about BSON types, see the BSON specification.
Note
Decimal128
types are only supported on little-endian systems.
On big-endian systems, null
will be used.
BSON Type |
Type Identifiers |
---|---|
String |
|
Embedded document |
|
Embedded array |
An instance of |
ObjectId |
|
Decimal128 |
|
Boolean |
an instance of |
64-bit binary floating point |
|
32-bit integer |
an instance of |
64-bit integer |
|
UTC datetime |
an instance of |
Binary data |
|
JavaScript code |
|
Type identifiers can be used to specify that a field is of a certain type
during pymongoarrow.api.Schema
declaration. For example, if your data
has fields ‘f1’ and ‘f2’ bearing types 32-bit integer and UTC datetime
respectively, and ‘_id’ that is an ObjectId
, your schema can be defined as:
schema = Schema({
'_id': ObjectId,
'f1': pyarrow.int32(),
'f2': pyarrow.timestamp('ms')
})
Unsupported data types in a schema cause a ValueError
identifying the
field and its data type.
Embedded Array Considerations#
The schema used for an Embedded Array must use the pyarrow.list_()
type,
so that the type of the array elements can be specified. For example,
Extension Types#
The ObjectId
, Decimal128
, Binary data
and JavaScript code
are implemented as extension types for PyArrow and Pandas.
For arrow tables, values of these types will have the appropriate
pymongoarrow
extension type (e.g. pymongoarrow.types.ObjectIdType
). The appropriate bson
Python object can be obtained using the .as_py()
method,
or by calling .to_pylist()
on the table.
>>> from pymongo import MongoClient
>>> from bson import ObjectId
>>> from pymongoarrow.api import find_arrow_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> table = find_arrow_all(coll, {})
>>> table
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
foo: int32
----
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
foo: [[100,200]]
>>> table["_id"][0]
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
>>> table["_id"][0].as_py()
ObjectId('64408b0d5ac9e208af220142')
>>> table.to_pylist()
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
{'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]
When converting to pandas, the extension type columns will have an appropriate
pymongoarrow
extension type (e.g. pymongoarrow.pandas_types.PandasDecimal128
). The value of the element in the
dataframe will be the appropriate bson
type.
>>> from pymongo import MongoClient
>>> from bson import Decimal128
>>> from pymongoarrow.api import find_pandas_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> df = find_pandas_all(coll, {})
>>> df
_id foo
0 64408bf65ac9e208af220144 0.1
1 64408bf65ac9e208af220145 0.1
>>> df["foo"].dtype
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
>>> df["foo"][0]
Decimal128('0.1')
>>> df["_id"][0]
ObjectId('64408bf65ac9e208af220144')
As of this writing, Polars does not support Extension Types.
Null Values and Conversion to Pandas DataFrames#
In Arrow (and Polars), all Arrays are nullable. Pandas has experimental nullable data types as, e.g., “Int64” (note the capital “I”). You can instruct Arrow to create a pandas DataFrame using nullable dtypes with the code below (taken from here)
>>> dtype_mapping = {
... pa.int8(): pd.Int8Dtype(),
... pa.int16(): pd.Int16Dtype(),
... pa.int32(): pd.Int32Dtype(),
... pa.int64(): pd.Int64Dtype(),
... pa.uint8(): pd.UInt8Dtype(),
... pa.uint16(): pd.UInt16Dtype(),
... pa.uint32(): pd.UInt32Dtype(),
... pa.uint64(): pd.UInt64Dtype(),
... pa.bool_(): pd.BooleanDtype(),
... pa.float32(): pd.Float32Dtype(),
... pa.float64(): pd.Float64Dtype(),
... pa.string(): pd.StringDtype(),
... }
... df = arrow_table.to_pandas(
... types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
... )
... del arrow_table
Defining a conversion for pa.string()
in addition converts Arrow strings to NumPy strings, and not objects.
Nested Extension Types#
Pending ARROW-179
, extension types such as ObjectId
that appear in nested documents will not
be converted to the corresponding PyMongoArrow extension type, but will
instead have the raw Arrow type (FixedSizeBinaryType(fixed_size_binary[12])
).
These values can either be consumed as-is or converted individually to the
desired extension type, e.g. _id = out['nested'][0]['_id'].cast(ObjectIdType())
.