azure-kusto-python
azure-kusto-python copied to clipboard
Data Corruption in ingest_from_dataframe() for Float Values Above ~1.845 Billion
Description
QueuedIngestClient.ingest_from_dataframe() corrupts large float values (>1 billion) when ingesting with CSV format. The same data ingested via ingest_from_file() works correctly, indicating the bug is in the DataFrame serialization path.
Environment
-
SDK Version:
azure-kusto-ingest==5.0.5 - Python Version: 3.11
- Data Format: CSV (also reproduces with PARQUET)
Bug Behavior
Expected vs Actual Results
| Original Value | Expected in Kusto | Actual in Kusto | Corruption |
|---|---|---|---|
| 1845541179.654398 | 1,845,541,179.654398 | 866,772.283... | ✗ Corrupted |
| 1843594418.8130903 | 1,843,594,418.813... | Correct | ✓ Works |
Minimal Reproduction Code
df = pd.DataFrame([
{
"Group": "{}",
"MetricName": "Total_Sum",
"MetricValue": 1845541179.654398, # Above threshold - Gets CORRUPTED
"MetricTime": datetime(2025, 12, 4),
"UpdatedTime": datetime(2025, 12, 4, 11, 46, 5, 620000),
"CreatedTime": datetime(2025, 12, 4, 11, 46, 6, 126000),
},
{
"Group": "{}",
"MetricName": "Total_Sum",
"MetricValue": 1843594418.8130903, # Below threshold - Works CORRECTLY
"MetricTime": datetime(2025, 12, 3),
"UpdatedTime": datetime(2025, 12, 3, 11, 13, 37, 9000),
"CreatedTime": datetime(2025, 12, 3, 11, 13, 6, 176000),
},
])
# Setup Kusto connection
kcsb = KustoConnectionStringBuilder.with_interactive_login(
"https://ingest-YOUR_CLUSTER.kusto.windows.net/"
)
# Method 1: Using ingest_from_dataframe() - CORRUPTS values above threshold
ingestion_props = IngestionProperties(
database="YOUR_DATABASE",
table="YOUR_TABLE",
data_format=DataFormat.CSV,
)
with QueuedIngestClient(kcsb) as ingest_client:
ingest_client.ingest_from_dataframe(df, ingestion_properties=ingestion_props)
# Result in Kusto after batch ingestion completes:
# Row 0: MetricValue = 866,772.283442... (CORRUPTED)
# Row 1: MetricValue = 1,843,594,418.8130903 (CORRECT)
# Method 2: Using ingest_from_file() - WORKS CORRECTLY for all values
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False, newline='') as tmp_file:
tmp_path = tmp_file.name
df.to_csv(tmp_file, index=False)
with QueuedIngestClient(kcsb) as ingest_client:
ingest_client.ingest_from_file(tmp_path, ingestion_properties=ingestion_props)
import os
os.unlink(tmp_path)
The Bug: GZIP + Binary Mode
The SDK writes CSV to a gzip-compressed binary stream ("wb" mode), but pandas to_csv() expects a text stream when dealing with CSV format!
What Happens:
-
gzip.open(temp_file_path, "wb")opens in binary write mode -
df.to_csv(temp_file, ...)tries to write text to binary stream - Python does automatic encoding/conversion
- Large float values get corrupted during this text→binary→gzip conversion
- Values above ~1.845B trigger precision loss in the encoding step
The bug is specifically in:
-
File:
azure-kusto-ingest/azure/kusto/ingest/base_ingest_client.py - Line: ~Line 140-145
- Issue: Writing CSV to gzip binary stream corrupts large float values
-
Fix: Use
gzip.open(temp_file_path, "wt")instead of"wb"
Output of pip freeze
annotated-doc==0.0.3
annotated-types==0.7.0
antlr4-python3-runtime==4.13.2
anyio==4.11.0
applicationinsights==0.11.10
argcomplete==3.5.3
azure-ai-agents==1.1.0
azure-ai-projects==1.0.0
azure-appconfiguration==1.7.2
azure-batch==15.0.0b1
azure-cli==2.81.0
azure-cli-core==2.81.0
azure-cli-telemetry==1.1.0
azure-common==1.1.28
azure-core==1.35.1
azure-cosmos==3.2.0
azure-data-tables==12.4.0
azure-datalake-store==1.0.1
azure-identity==1.25.0
azure-keyvault==4.2.0
azure-keyvault-administration==4.4.0
azure-keyvault-certificates==4.7.0
azure-keyvault-keys==4.11.0
azure-keyvault-secrets==4.7.0
azure-keyvault-securitydomain==1.0.0b1
azure-kusto-data==5.0.5
azure-kusto-ingest==5.0.5
azure-mgmt-advisor==9.0.0
azure-mgmt-apimanagement==4.0.0
azure-mgmt-appconfiguration==5.0.0
azure-mgmt-appcontainers==2.0.0
azure-mgmt-applicationinsights==1.0.0
azure-mgmt-authorization==5.0.0b1
azure-mgmt-batch==17.3.0
azure-mgmt-batchai==7.0.0b1
azure-mgmt-billing==6.0.0
azure-mgmt-botservice==2.0.0
azure-mgmt-cdn==12.0.0
azure-mgmt-cognitiveservices==14.1.0
azure-mgmt-compute==34.1.0
azure-mgmt-containerinstance==10.2.0b1
azure-mgmt-containerregistry==14.1.0b1
azure-mgmt-containerservice==40.1.0
azure-mgmt-core==1.6.0
azure-mgmt-cosmosdb==9.8.0
azure-mgmt-datalake-store==1.1.0b1
azure-mgmt-datamigration==10.0.0
azure-mgmt-eventgrid==10.2.0b2
azure-mgmt-eventhub==12.0.0b1
azure-mgmt-extendedlocation==1.0.0b2
azure-mgmt-hdinsight==9.1.0b2
azure-mgmt-imagebuilder==1.3.0
azure-mgmt-iotcentral==10.0.0b2
azure-mgmt-iothub==5.0.0b1
azure-mgmt-iothubprovisioningservices==1.1.0
azure-mgmt-keyvault==12.1.0
azure-mgmt-loganalytics==13.0.0b4
azure-mgmt-managementgroups==1.0.0
azure-mgmt-maps==2.0.0
azure-mgmt-marketplaceordering==1.1.0
azure-mgmt-media==9.0.0
azure-mgmt-monitor==7.0.0
azure-mgmt-msi==7.0.0
azure-mgmt-mysqlflexibleservers==1.0.0b3
azure-mgmt-netapp==10.1.0
azure-mgmt-policyinsights==1.1.0b4
azure-mgmt-postgresqlflexibleservers==1.1.0b2
azure-mgmt-privatedns==1.0.0
azure-mgmt-rdbms==10.2.0b17
azure-mgmt-recoveryservices==4.0.0
azure-mgmt-recoveryservicesbackup==9.2.0
azure-mgmt-redhatopenshift==1.5.0
azure-mgmt-redis==14.5.0
azure-mgmt-resource==23.3.0
azure-mgmt-resource-deployments==1.0.0b1
azure-mgmt-resource-deploymentscripts==1.0.0b1
azure-mgmt-resource-deploymentstacks==1.0.0b1
azure-mgmt-resource-templatespecs==1.0.0b1
azure-mgmt-search==9.2.0
azure-mgmt-security==6.0.0
azure-mgmt-servicebus==10.0.0b1
azure-mgmt-servicefabric==2.1.0
azure-mgmt-servicefabricmanagedclusters==2.1.0b1
azure-mgmt-servicelinker==1.2.0b3
azure-mgmt-signalr==2.0.0b2
azure-mgmt-sql==4.0.0b22
azure-mgmt-sqlvirtualmachine==1.0.0b5
azure-mgmt-storage==24.0.0
azure-mgmt-synapse==2.1.0b5
azure-mgmt-trafficmanager==1.0.0
azure-mgmt-web==9.0.0
azure-monitor-query==1.2.0
azure-multiapi-storage==1.6.0
azure-storage-blob==12.26.0
azure-storage-common==1.4.2
azure-storage-queue==12.13.0
azure-synapse-accesscontrol==0.5.0
azure-synapse-artifacts==0.21.0
azure-synapse-managedprivateendpoints==0.4.0
azure-synapse-spark==0.7.0
bcrypt==5.0.0
certifi==2025.8.3
cffi==2.0.0
chardet==5.2.0
charset-normalizer==3.4.3
click==8.3.0
colorama==0.4.6
cryptography==46.0.1
decorator==5.2.1
Deprecated==1.3.1
distro==1.9.0
fabric==3.2.2
fastapi==0.121.0
h11==0.16.0
humanfriendly==10.0
idna==3.10
ijson==3.4.0
invoke==2.2.1
isodate==0.7.2
javaproperties==0.5.2
jmespath==1.0.1
jsondiff==2.0.0
knack==0.11.0
microsoft-security-utilities-secret-masker==1.0.0b4
msal==1.34.0b1
msal-extensions==1.2.0
msrest==0.7.1
numpy==2.3.3
oauthlib==3.3.1
packaging==25.0
pandas==2.3.3
paramiko==3.5.1
pkginfo==1.12.1.2
portalocker==2.10.1
psutil==7.1.3
py-deviceid==0.1.1
pycomposefile==0.0.34
pycparser==2.23
pydantic==2.11.9
pydantic_core==2.33.2
PyGithub==1.59.1
Pygments==2.19.2
PyJWT==2.10.1
PyNaCl==1.5.0
pyOpenSSL==25.3.0
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.3
requests==2.32.5
requests-oauthlib==2.0.0
scp==0.13.6
semver==3.0.4
six==1.17.0
sniffio==1.3.1
sshtunnel==0.1.5
starlette==0.49.3
statsd==4.0.1
tabulate==0.9.0
tenacity==9.1.2
typing-inspection==0.4.1
typing_extensions==4.15.0
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.37.0
websocket-client==1.3.3
wrapt==2.0.1
xmltodict==0.15.1
@AsafMah, are you maintainer or owner of this repo? can you help to take a look of this issue and PR?
Fixed in 6.0.1, if the problem still persists, please comment.