sqlite_scanner icon indicating copy to clipboard operation
sqlite_scanner copied to clipboard

feature: Add Remote (HTTP(S)) Support for SQLite Databases

Open ak2k opened this issue 8 months ago • 5 comments

Remote (HTTP(S)) Support for SQLite Databases

This PR adds support for querying remote SQLite databases over HTTP/HTTPS to the SQLite scanner extension, leveraging DuckDB's CachingFileSystem.

Features

  • Query SQLite databases directly from HTTP/HTTPS URLs without downloading the entire file
  • Seamless integration with sqlite_scan() and ATTACH syntax
  • Adaptive read-ahead optimization (1MB-128MB) reduces network round trips

Implementation

  • SQLiteDuckDBCacheVFS: Custom SQLite VFS that delegates file I/O to DuckDB's CachingFileSystem
    • Integrates with DuckDB's external file cache for efficient block caching
  • DuckDBCachedFile: Wrapper around DuckDB's CachingFileHandle with adaptive read-ahead
    • Dynamically adjusts read-ahead size based on access patterns
  • Integration:
    • Leverages DuckDB's httpfs extension for HTTP client functionality

Usage

-- Load required extensions
LOAD sqlite_scanner;
INSTALL httpfs;
LOAD httpfs;

-- Query remote SQLite database
SELECT * FROM sqlite_scan('https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite', 'Artist');

-- Attach as a database  
ATTACH 'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite' AS remote_db (TYPE sqlite);
SELECT * FROM remote_db.Album;

Related Issues: #39, #141

ak2k avatar Jun 05 '25 07:06 ak2k

[Updated]

@Maxxen Thanks for the excellent suggestions! I've re-implemented based on your feedback.

Integrated with DuckDB's filesystem and caching infrastructure

  • Now wraps DuckDB's FileSystem API for all remote file access (HTTP/HTTPS via httpfs)
  • Uses DuckDB's CachingFileSystem which provides intelligent block caching
  • Cache is managed by DuckDB's CachingFileSystem infrastructure

Simplified concurrency model

  • Removed per-file mutexes - file operations are lock-free
  • Only uses mutex for VFS registry operations (registration/unregistration)
  • Relies on CachingFileSystem's internal synchronization for thread safety

Adaptive read-ahead optimization

  • Implements dynamic read-ahead (1MB-128MB) to reduce network round trips
  • Adjusts read size based on sequential vs random access patterns
  • Works on top of CachingFileSystem's block caching

The implementation is much cleaner now - it properly delegates to DuckDB's existing infrastructure rather than reimplementing caching logic. Thanks again for the great foundation and suggestions!

ak2k avatar Jun 05 '25 09:06 ak2k

This is very cool! Does it work with hosted SQLite solutions like Turso?

Alex-Monahan avatar Jun 05 '25 14:06 Alex-Monahan

This is very cool! Does it work with hosted SQLite solutions like Turso?

Thank you @Alex-Monahan! I wasn't familiar with Turso, but it seems to implement a custom wire format rather than providing HTTP-like access to the stored file format, so regrettably, it likely wouldn't.

ak2k avatar Jun 06 '25 12:06 ak2k

I've reorganized this PR into 5 logical commits to attempt to make the review process easier. Each commit is self-contained, compiles, and passes all tests:

Commit 1: Add ClientContext parameter to SQLiteDB::Open methods

  • Minimal API change to support future extensibility
  • Required for subsequent commits that need access to DuckDB's context

Commit 2: Lazy initialization when opening remote SQLite databases

  • Implements lazy initialization to prevent deadlocks during remote file access
  • Includes necessary validation logic for busy_timeout

Commit 3: Add SQLite VFS implementation for remote file support

  • Core VFS implementation that integrates with DuckDB's CachingFileSystem
  • Handles HTTP error mapping and adaptive read-ahead optimization
  • No user-facing changes yet

Commit 4: Enable HTTP/HTTPS SQLite database access via sqlite_scan

  • Activates HTTP/HTTPS support for sqlite_scan() function
  • Includes 7 working tests demonstrating basic functionality
  • ATTACH support intentionally deferred to next commit

Commit 5: Add HTTP ATTACH support and improve error handling

  • Completes the implementation with ATTACH functionality
  • Improves error handling throughout the PR
  • Adds remaining tests for complex queries

Let me know if you'd like me to adjust the commit structure or if you have any questions about specific changes or feedback.

ak2k avatar Jun 27 '25 23:06 ak2k

Curious if this is blocked by something? Would love to help get this merged if any additional contributions are necessary

ngalluzzo avatar Oct 20 '25 22:10 ngalluzzo