Parsing and conversion to `DataFrame` performance enhancement
This PR introduces substantial performance improvements and architectural changes to KML.jl. Due to the extensive nature of these modifications, I understand if you prefer that I create a separate public fork instead. I'm submitting this PR to give you the option to incorporate these changes, if it's not too much.
Performance improvements
Benchmarks comparing main branch vs this PR on Windows 11, Julia 1.11.5:
| Operation | File Size | Main Branch | This PR | Improvement |
|---|---|---|---|---|
| KMLFile reading | 100 placemarks | 73.35 ms | 5.09 ms | 14.4x faster |
| KMLFile reading | 20,000 placemarks | 23.03 s | 1.05 s | 21.9x faster |
| DataFrame extraction | 100 placemarks | 114.9 ms | 0.52 ms | 221x faster |
| DataFrame extraction | 20,000 placemarks | 19.51 s | 81.4 ms | 240x faster |
| Memory usage | All sizes | - | - | 82% reduction |
I used the KML test file produced with the function below:
function create_test_kml(n_placemarks::Int; filename="test.kml")
open(filename, "w") do io
println(io, """<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>Test Document</name>
<Folder>
<name>Test Folder</name>""")
for i in 1:n_placemarks
lat = -90 + 180 * rand()
lon = -180 + 360 * rand()
println(io, """ <Placemark>
<name>Place $i</name>
<description>Description for place $i</description>
<Point>
<coordinates>$lon,$lat,0</coordinates>
</Point>
</Placemark>""")
end
println(io, """ </Folder>
</Document>
</kml>""")
end
end
Otherwise, I also benchmarked against ArchGDAL.jl for the conversion of a layer from a KML file to a DataFrame, and it is between 1.5 and 4 times faster. Here are some KML files found in the wild that I used for benchmark:
- https://www.dec.ny.gov/data/der/enzones/enzone2022.kmz
- https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/atoms/files/WRS-2_bound_world_0.kml
- https://earthquake.usgs.gov/static/lfs/nshm/qfaults/qfaults.kmz
Architectural changes
1. Module restructuring
On the way to enhancing performance, I reorganized the codebase into the following modules:
-
types.jl- Type definitions with thread-safe tag mapping -
xml_parsing.jl- XML to KML object parsing -
coordinates.jl- Automa-based coordinate parsing -
field_conversion.jl- Type-stable field conversions -
tables.jl- Tables.jl interface implementation -
time_parsing.jl- ISO 8601 time parsing with Automa -
html_entities.jl- HTML entity decoding -
layers.jl- Layer navigation and selection - Additional utility modules
2. New LazyKMLFile type
# Loads file without materializing KML objects
lazy_kml = read("file.kml", LazyKMLFile)
- Caches layer information on first access
- Optimized for DataFrame extraction workflows
3. PlacemarkTable for efficient data extraction
# Direct path to DataFrame
df = DataFrame(PlacemarkTable("file.kml"))
# Or with the extension loaded
df = DataFrame("file.kml")
- Streaming placemark extraction
- Minimal object materialization
- Tables.jl compliant interface
4. Type-stable parsing implementation
- Pre-compiled tag-to-type mappings
- Thread-safe symbol caches
- Zero-allocation coordinate parsing
- Optimized field assignment with type inference
5. Extensions for optional dependencies
-
KMLDataFramesExt- DataFrame integration -
KMLGeoInterfaceExt- GeoInterface.jl support -
KMLMakieExt- Makie.jl plotting recipes -
KMLZipArchivesExt- KMZ file support
Key implementation details
Coordinate parsing
Replaced regex-based parsing with Automa.jl FSM:
# Before: Multiple regex passes
# After: Single-pass FSM with pre-allocated output
parse_coordinates_automa("0,0 1,1") # 10x faster
Memory optimizations
- Removed intermediate string allocations
- Pre-sized collections based on file structure
- Lazy evaluation of nested elements
- Efficient text extraction without concatenation
Thread safety
- Immutable tag/type caches created at module initialization
- ReentrantLock for LazyKMLFile cache access
- No global mutable state
Breaking changes
None. All existing APIs maintained with identical behavior.
New dependencies
-
Parsing & performance:
Automa(coordinate/time parsing),Parsers(number parsing),StaticArrays(coordinate storage) -
HTML entity handling:
Scratch(caching),JSON3(parsing entity definitions),Downloads(fetching definitions),Serialization(cache storage) -
Data handling:
Tables(Tables.jl interface),TimeZones&Dates(temporal data) -
User interface:
REPL(interactive layer selection)
The package also moves GeoInterface from a direct dependency to a weak dependency (extension), along with new weak dependencies for DataFrames, Makie, and ZipArchives support.
Testing
All existing tests pass after test code adjustment. However, I have added only a minimum of additional tests for the extra code before getting your feedback on this PR.
Alternative approach
If these changes are too extensive for the main package, I'm happy to maintain them as a separate public fork (e.g., FastKML.jl or similar) to avoid fragmenting the ecosystem while providing an alternative for performance-critical applications.
Whoa!
Lots to digest here! My gut reaction is that this is too much to take on in this package. My goal was to make KML.jl as lightweight as possible so that it required little or no maintenance. It was built to satisfy a specific need for a specific customer.
That being said, I've only glanced at the changes. If we merge this, would you be able to unofficially commit to supporting any issues that pop up? I'd be happy to add you as a maintainer.
Of course, after such a significant change, I would handle the issues that will inevitably arise.
Cool, can you get the CI to go green?
OK I'll make a PR to update CI which is failling for other reasons
Current runner version: '2.325.0'
Runner Image Provisioner
Operating System
Runner Image
GITHUB_TOKEN Permissions
Secret source: None
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/cache: v1`. Please update your workflow to use v3/v4 of actions/cache to avoid interruptions. Learn more: https://github.blog/changelog/[2](https://github.com/JuliaComputing/KML.jl/actions/runs/15564020918/job/43823386918?pr=14#step:1:2)024-12-05-notice-of-upcoming-releases-and-breaking-changes-for-github-actions/#actions-cache-v1-v2-and-actions-toolkit-cache-package-closing-downCurrent runner version: '2.325.0'
Runner Image Provisioner
Operating System
Runner Image
GITHUB_TOKEN Permissions
Secret source: None
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/cache: v1`. Please update your workflow to use v3/v4 of actions/cache to avoid interruptions. Learn more: https://github.blog/changelog/[2](https://github.com/JuliaComputing/KML.jl/actions/runs/15564020918/job/43823386918?pr=14#step:1:2)024-12-05-notice-of-upcoming-releases-and-breaking-changes-for-github-actions/#actions-cache-v1-v2-and-actions-toolkit-cache-package-closing-down
Still have to loosen compat, to match CI script
Good. CI is green. @joshday: ready for review
Looking over the PR, I think your stuff belongs as a new package. I think there's room in the Julia ecosystem for both a lightweight KML.jl as well as a feature-full FastKML.jl (or whatever you call it). I'd be happy to refer to your fork in the README.