azure-kusto-python icon indicating copy to clipboard operation
azure-kusto-python copied to clipboard

Data Corruption in ingest_from_dataframe() for Float Values Above ~1.845 Billion

Open dylanw-oss opened this issue 1 month ago • 1 comments

Description

QueuedIngestClient.ingest_from_dataframe() corrupts large float values (>1 billion) when ingesting with CSV format. The same data ingested via ingest_from_file() works correctly, indicating the bug is in the DataFrame serialization path.

Environment

  • SDK Version: azure-kusto-ingest==5.0.5
  • Python Version: 3.11
  • Data Format: CSV (also reproduces with PARQUET)

Bug Behavior

Expected vs Actual Results

Original Value Expected in Kusto Actual in Kusto Corruption
1845541179.654398 1,845,541,179.654398 866,772.283... ✗ Corrupted
1843594418.8130903 1,843,594,418.813... Correct ✓ Works

Minimal Reproduction Code

df = pd.DataFrame([
    {
        "Group": "{}",
        "MetricName": "Total_Sum",
        "MetricValue": 1845541179.654398,  # Above threshold - Gets CORRUPTED
        "MetricTime": datetime(2025, 12, 4),
        "UpdatedTime": datetime(2025, 12, 4, 11, 46, 5, 620000),
        "CreatedTime": datetime(2025, 12, 4, 11, 46, 6, 126000),
    },
    {
        "Group": "{}",
        "MetricName": "Total_Sum",
        "MetricValue": 1843594418.8130903,  # Below threshold - Works CORRECTLY
        "MetricTime": datetime(2025, 12, 3),
        "UpdatedTime": datetime(2025, 12, 3, 11, 13, 37, 9000),
        "CreatedTime": datetime(2025, 12, 3, 11, 13, 6, 176000),
    },
])

# Setup Kusto connection
kcsb = KustoConnectionStringBuilder.with_interactive_login(
    "https://ingest-YOUR_CLUSTER.kusto.windows.net/"
)

# Method 1: Using ingest_from_dataframe() - CORRUPTS values above threshold
ingestion_props = IngestionProperties(
    database="YOUR_DATABASE",
    table="YOUR_TABLE",
    data_format=DataFormat.CSV,
)

with QueuedIngestClient(kcsb) as ingest_client:
    ingest_client.ingest_from_dataframe(df, ingestion_properties=ingestion_props)

# Result in Kusto after batch ingestion completes:
# Row 0: MetricValue = 866,772.283442... (CORRUPTED)
# Row 1: MetricValue = 1,843,594,418.8130903 (CORRECT)

# Method 2: Using ingest_from_file() - WORKS CORRECTLY for all values
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False, newline='') as tmp_file:
    tmp_path = tmp_file.name
    df.to_csv(tmp_file, index=False)

with QueuedIngestClient(kcsb) as ingest_client:
    ingest_client.ingest_from_file(tmp_path, ingestion_properties=ingestion_props)

import os
os.unlink(tmp_path)

The Bug: GZIP + Binary Mode

The SDK writes CSV to a gzip-compressed binary stream ("wb" mode), but pandas to_csv() expects a text stream when dealing with CSV format!

What Happens:

  1. gzip.open(temp_file_path, "wb") opens in binary write mode
  2. df.to_csv(temp_file, ...) tries to write text to binary stream
  3. Python does automatic encoding/conversion
  4. Large float values get corrupted during this text→binary→gzip conversion
  5. Values above ~1.845B trigger precision loss in the encoding step

The bug is specifically in:

  • File: azure-kusto-ingest/azure/kusto/ingest/base_ingest_client.py
  • Line: ~Line 140-145
  • Issue: Writing CSV to gzip binary stream corrupts large float values
  • Fix: Use gzip.open(temp_file_path, "wt") instead of "wb"

Output of pip freeze

annotated-doc==0.0.3 annotated-types==0.7.0 antlr4-python3-runtime==4.13.2 anyio==4.11.0 applicationinsights==0.11.10 argcomplete==3.5.3 azure-ai-agents==1.1.0 azure-ai-projects==1.0.0 azure-appconfiguration==1.7.2 azure-batch==15.0.0b1 azure-cli==2.81.0 azure-cli-core==2.81.0 azure-cli-telemetry==1.1.0 azure-common==1.1.28 azure-core==1.35.1 azure-cosmos==3.2.0 azure-data-tables==12.4.0 azure-datalake-store==1.0.1 azure-identity==1.25.0 azure-keyvault==4.2.0 azure-keyvault-administration==4.4.0 azure-keyvault-certificates==4.7.0 azure-keyvault-keys==4.11.0 azure-keyvault-secrets==4.7.0 azure-keyvault-securitydomain==1.0.0b1 azure-kusto-data==5.0.5 azure-kusto-ingest==5.0.5 azure-mgmt-advisor==9.0.0 azure-mgmt-apimanagement==4.0.0 azure-mgmt-appconfiguration==5.0.0 azure-mgmt-appcontainers==2.0.0 azure-mgmt-applicationinsights==1.0.0 azure-mgmt-authorization==5.0.0b1 azure-mgmt-batch==17.3.0 azure-mgmt-batchai==7.0.0b1 azure-mgmt-billing==6.0.0 azure-mgmt-botservice==2.0.0 azure-mgmt-cdn==12.0.0 azure-mgmt-cognitiveservices==14.1.0 azure-mgmt-compute==34.1.0 azure-mgmt-containerinstance==10.2.0b1 azure-mgmt-containerregistry==14.1.0b1 azure-mgmt-containerservice==40.1.0 azure-mgmt-core==1.6.0 azure-mgmt-cosmosdb==9.8.0 azure-mgmt-datalake-store==1.1.0b1 azure-mgmt-datamigration==10.0.0 azure-mgmt-eventgrid==10.2.0b2 azure-mgmt-eventhub==12.0.0b1 azure-mgmt-extendedlocation==1.0.0b2 azure-mgmt-hdinsight==9.1.0b2 azure-mgmt-imagebuilder==1.3.0 azure-mgmt-iotcentral==10.0.0b2 azure-mgmt-iothub==5.0.0b1 azure-mgmt-iothubprovisioningservices==1.1.0 azure-mgmt-keyvault==12.1.0 azure-mgmt-loganalytics==13.0.0b4 azure-mgmt-managementgroups==1.0.0 azure-mgmt-maps==2.0.0 azure-mgmt-marketplaceordering==1.1.0 azure-mgmt-media==9.0.0 azure-mgmt-monitor==7.0.0 azure-mgmt-msi==7.0.0 azure-mgmt-mysqlflexibleservers==1.0.0b3 azure-mgmt-netapp==10.1.0 azure-mgmt-policyinsights==1.1.0b4 azure-mgmt-postgresqlflexibleservers==1.1.0b2 azure-mgmt-privatedns==1.0.0 azure-mgmt-rdbms==10.2.0b17 azure-mgmt-recoveryservices==4.0.0 azure-mgmt-recoveryservicesbackup==9.2.0 azure-mgmt-redhatopenshift==1.5.0 azure-mgmt-redis==14.5.0 azure-mgmt-resource==23.3.0 azure-mgmt-resource-deployments==1.0.0b1 azure-mgmt-resource-deploymentscripts==1.0.0b1 azure-mgmt-resource-deploymentstacks==1.0.0b1 azure-mgmt-resource-templatespecs==1.0.0b1 azure-mgmt-search==9.2.0 azure-mgmt-security==6.0.0 azure-mgmt-servicebus==10.0.0b1 azure-mgmt-servicefabric==2.1.0 azure-mgmt-servicefabricmanagedclusters==2.1.0b1 azure-mgmt-servicelinker==1.2.0b3 azure-mgmt-signalr==2.0.0b2 azure-mgmt-sql==4.0.0b22 azure-mgmt-sqlvirtualmachine==1.0.0b5 azure-mgmt-storage==24.0.0 azure-mgmt-synapse==2.1.0b5 azure-mgmt-trafficmanager==1.0.0 azure-mgmt-web==9.0.0 azure-monitor-query==1.2.0 azure-multiapi-storage==1.6.0 azure-storage-blob==12.26.0 azure-storage-common==1.4.2 azure-storage-queue==12.13.0 azure-synapse-accesscontrol==0.5.0 azure-synapse-artifacts==0.21.0 azure-synapse-managedprivateendpoints==0.4.0 azure-synapse-spark==0.7.0 bcrypt==5.0.0 certifi==2025.8.3 cffi==2.0.0 chardet==5.2.0 charset-normalizer==3.4.3 click==8.3.0 colorama==0.4.6 cryptography==46.0.1 decorator==5.2.1 Deprecated==1.3.1 distro==1.9.0 fabric==3.2.2 fastapi==0.121.0 h11==0.16.0 humanfriendly==10.0 idna==3.10 ijson==3.4.0 invoke==2.2.1 isodate==0.7.2 javaproperties==0.5.2 jmespath==1.0.1 jsondiff==2.0.0 knack==0.11.0 microsoft-security-utilities-secret-masker==1.0.0b4 msal==1.34.0b1 msal-extensions==1.2.0 msrest==0.7.1 numpy==2.3.3 oauthlib==3.3.1 packaging==25.0 pandas==2.3.3 paramiko==3.5.1 pkginfo==1.12.1.2 portalocker==2.10.1 psutil==7.1.3 py-deviceid==0.1.1 pycomposefile==0.0.34 pycparser==2.23 pydantic==2.11.9 pydantic_core==2.33.2 PyGithub==1.59.1 Pygments==2.19.2 PyJWT==2.10.1 PyNaCl==1.5.0 pyOpenSSL==25.3.0 PySocks==1.7.1 python-dateutil==2.9.0.post0 pytz==2025.2 PyYAML==6.0.3 requests==2.32.5 requests-oauthlib==2.0.0 scp==0.13.6 semver==3.0.4 six==1.17.0 sniffio==1.3.1 sshtunnel==0.1.5 starlette==0.49.3 statsd==4.0.1 tabulate==0.9.0 tenacity==9.1.2 typing-inspection==0.4.1 typing_extensions==4.15.0 tzdata==2025.2 urllib3==2.5.0 uvicorn==0.37.0 websocket-client==1.3.3 wrapt==2.0.1 xmltodict==0.15.1

dylanw-oss avatar Dec 09 '25 20:12 dylanw-oss

@AsafMah, are you maintainer or owner of this repo? can you help to take a look of this issue and PR?

dylanw-oss avatar Dec 11 '25 23:12 dylanw-oss

Fixed in 6.0.1, if the problem still persists, please comment.

AsafMah avatar Dec 28 '25 07:12 AsafMah