Problems with columnar data¶
If we read JSON into Polars, it might have the wrong schema.
Columnar dtype does not permit union data types¶
For example, if we have a union dtype (a concept Polars doesn't support), then we will either silently lose data or fail to parse completely (usually the latter).
Take this data with a union dtype for the strs column, either a string or a list of strings:
df = pl.DataFrame(
{
"json_data": [
'{"id": 1, "nums": [0], "strs": "a"}',
'{"id": 2, "nums": [1, 2], "strs": ["b"]}',
'{"id": 3, "nums": [3, 4, 5], "strs": ["c", "d", "e"]}',
]
}
)
While we can store it as a string dtype json_data column, a naive implementation of schema
parsing will decide at the first string that we have a string column:
polars_schema = df.genson.infer_polars_schema("json_data")
print(polars_schema)
print("\nSchema details:")
for name, dtype in polars_schema.items():
print(f" {name}: {dtype}")
Inferred Polars Schema:
Schema({'id': Int64, 'nums': List(Int64), 'strs': String})
Schema details:
id: Int64
nums: List(Int64)
strs: String
The JSON schema column is more expressive, and has an anyOf concept to display union dtypes,
because a JSON path is not a 'column' and does not have to have a single data type:
Inferred JSON Schema:
{
"$schema": "http://json-schema.org/schema#",
"properties": {
"id": {
"type": "integer"
},
"nums": {
"type": "array",
"items": {
"type": "integer"
}
},
"strs": {
"anyOf": [
{
"type": "array",
"items": {
"type": "string"
}
},
{
"type": "string"
}
]
}
},
"required": [
"id",
"nums",
"strs"
],
"type": "object"
}
If we were to write that out as if it were Python annotations, it'd be