Current status of custom FDW?
What's the current status of creating a FDW? I know you were working on it when we chatted over a year ago and had made some progress but still had a way to go.
Is it something where subgoals would make sense? E.g., supporting a read-only FDW first, then adding support for 'insert' and 'update'?
Is there anything that someone could help with?
For people with no idea what I'm referring to - a "FDW" = "Foreign Data Wrapper". In my side-project I have user-defined types that can hold cryptographic material like digital certificates (aka SSL certs) and encrypted private keys. They need to be loaded somehow - I can create user-defined functions that read them from local files.
However I think a cleaner approach would be creating a custom FDW for them - you would specify the location of the file and it's top-level encryption key when you define the FDW, but could then treat it like any other database when you're looking for encryption keys. This is somewhat more secure than keeping the values within the database itself since a database dump won't include them and the top-level encryption key used with the FDW should have been provided by something outside of the database itself. (An attacker could still get the keys from a filesystem dump but the top-level encryption key won't be included in that dump.)
There's also a modern twist since many sites will now prefer to use something like Hashicorp Vault instead of deploying files to the server instances. In this case the FDW would make making a call to an external REST service, not reading a local file, but the resulting 'table(s)' should look the same.
Here's the required callbacks: https://www.postgresql.org/docs/current/fdw-callbacks.html
What I remember from looking at it last was that the callbacks involve a bunch of native PG structs to be somehow made available for manipulation in Java.
That makes it something likely to be much more pleasant to implement over the work currently in progress in PR #399. I don't know if you've had a look at that yet.
Okay. I have the attention span of a squirrel (thanks ADHD!) but I can probably hit a few milestones below:
- dummy package - you can define the FDW but nothing else. (No jar yet.)
- add: you can provide a jar in the FDW definition but nothing beyond a 'success' notice.
- add: additional options available to the java code.
- add: you can retrieve the metadata*
- add: you can retrieve values.
Advanced read-only:
You can push queries onto the FDW. By default PostgreSQL will load everything and then perform any filtering itself. If you can push (supported) queries onto the FDW then it can perform that filtering itself.
I consider this more important than adding read-write since my primary objective is to support extraction of cryptographic credentials from standard external files and the immediately putting them into a corresponding UDT. For obvious reasons I don't want the FDW to return everything.
(In fact an unrestricted query should probably result in an error - convention be damned!)
For most developers the main concern will be efficiency. There's no point in burning cycles and network bandwidth to provide data that will be immediately dropped.
This'll probably need a good design document that evolves for a while before code is written. At first glance, it appears to me:
- The FDW handler function must be in a compiled language. So PL/Java will need to have a single, fixed one.
- The handler function returns addresses of callbacks, and it is passed no arguments, so the callback addresses it returns can't depend on anything. They, too, will have to be fixed, and it will have to be possible for each callback, when called at runtime, to determine from its arguments which PL/Java FDW it is being called for, so the right Java code can be invoked.
- It might be good, as a start, to run down the whole list of callbacks and make sure each has a viable way to do that (from which argument, and how, will the target FDW be identified?).
- If PostgreSQL makes certain decisions based on whether certain callback addresses are or aren't present in the struct returned by the handler ... and as the handler is passed no arguments, it might be necessary that all PL/Java FDWs claim the same capabilities (in terms of which callbacks are present). Or, one might identify a small finite set of FDW types differing in their capabilities, with PL/Java supplying an appropriately-named handler function for each, and the right handler function would have to be specified in the
CREATE FOREIGN DATA WRAPPERcommand.
I should have something soon.
I don't know what - but as several people have pointed out to me recently "some progress is better than no progress" (when seen from the outside) so even a dummy FDW that does nothing but provide some canned values would be a big step forward since it's something other people could build on. If nothing else it gets the idea past the initial hurdle of simply having something to modify vs. creating something from scratch in a complex environment.
The "minimal useful functionality" is probably the ability to replace a stored procedure that returns a cursor. Static (or null) values for things like cost, estimated record size, etc.
However this could result in a game changer since a FDW is tied to the server, not session or query, so it will have a single long-running JVM. I don't know if there's still a separate JVM per session (or even per query?) but having a long-lived JVM means you can start doing things like internal caching or speculative execution.
As for the callbacks - Holy Ramen Noodles you're right about that - but if you look the dummy implementations or even the PostgreSQL FileFDW implementation you'll see that the FDW only needs to implement around 8 functions for read-only access, and it looks like that will translate to only a few java methods. Basically an implementation of 'scan' and a way to get some metadata. I can look at your existing code to see how you load the jar, call a specific class/method, and create the resulting Tuple.
Re speculative execution. Some of the existing FDW wrap REST calls to external servers. In many cases you don't need to provide any information in the request and it's okay if the data is a little outdated. Think servers that provide routine weather information, tv schedules, etc.
In that case you can dramatically improve performance by performing speculative execution and caching the results. E.g., the FDW might specify that the cache should be updated every 10 minutes -- so the FDW sets up a loop so it makes the REST call and caches the results every 10 minutes. That cache is used when the user queries the FDW tables - they'll get an immediate response instead of waiting for a REST call to complete.
You could certainly write a stored procedure that makes the REST call and updates a PostgreSQL table, and a second stored procedure that reads from that table. (You would do this instead of reading from the table directly for encapsulation, logging access, etc.) But I can't think fo an easy way to perform the periodic updates short of setting up a cron task.
However this could result in a game changer since a FDW is tied to the server, not session or query, so it will have a single long-running JVM. I don't know if there's still a separate JVM per session (or even per query?)
PL/Java itself is a JVM per session (well, per session where PL/Java is used).
I was not picturing a FDW behaving inherently differently. In what process do you picture such a JVM running, and what would be responsible for starting/stopping that process and the JVM within it?
This is one reason for the steps I mentioned - it provides a way to explore that question before committing to a full implementation.
Loading and unloading the FDW's JVM is easy - there are _PG_init(void) and _PG_fini(void) hooks. There are several options on finding the JVM binary:
- Reuse the existing per-session implementation, if possible.
- Ideally JAVA_HOME will be available.
- On Linux check whether
/etc/alternatives/javais defined. (E.g., on my system it's a symlink to/usr/lib/jvm/java-11-openjdk-amd64/bin/java, - On Linux check
/usr/lib/jvmon Linux systems. That's where java is usually installed (at least on Debian derivations) and there are symlinks using simple names likeopenjdk-11. The_PG_init(void)could check a short list of compatible JVMs. (E.g., on my system I could useopenjdk-11but probably wouldn't useopenjdk-18due to the changes in the security model.)
So I consider that a solved problem. It needs to implemented but there are several clear paths forward.
However there are many other questions that can best be resolved by testing a minimal working implementation. Prior research is important, of course, but that assumes the people who wrote the documentation anticipated this use case.
Three obvious questions
-
is there a way for the FDW (given what it has) to obtain the contents of the session when the methods are called? This would allow the FDW to use the session JVM instead of a local JVM.
-
if not - what will be required to pass data directly between the FDW JVM and the per-session JVM if desired?
-
long-term and definitely 'opt-in' - is there a way for the FDW to replace the per-session JVM? This would require care in order to maintain isolation, e.g., the FDW would use a child classloader to load the jar + dependencies.
The second question is because I can see scenarios where a per-session JVM might want to use a persistent JVM for some tasks. Caching wouldn't be a good idea - it would break on a clustered database.
A better example would be access to a remote system that requires high security. (Think Kerberos, OAuth2, mutual TLS authentication, etc.) You could implement Kerberos and mutual TLS in the per-session JVM - although a careful security review might be concerned about access to the required credentials. However OAuth2 requires a HTTP/S callback to get the initial token and to renew it and that introduces uncertainty into the per-session JVM. E.g., how long should it wait for a response?
In this case I think a hybrid approach where the FDW is responsible for obtaining and (pro-actively) renewing the credentials and the per-session JVM retrieves it (e.g., via a standard SQL query) would make much more sense.
That said... it's not a big leap from "the per-session JVM makes the network call using credentials" to "the per-session JVM directly passes the information to the FDW and it makes the network call. The credentials are never exposed."
You could handle this using Tuples - the per-session JVM calls "INSERT .... RETURNING ..." so it's already an option once you support a read/write FDW. But that exposes the contents to any auditing tools with access to the actual data flow and I could see the security folks insisting on a direct call if possible.
(Hmm.... vague memories of the Java 1.0 java.rpc classes.... That might be the way to provide direct access despite using separate JVMs.)
Loading and unloading the FDW's JVM is easy - there are
_PG_init(void)and_PG_fini(void)hooks. There are several options on finding the JVM binary:
Sure. In what OS process, though, do you envision this happening?
PL/Java does it in the backend process serving your session.
After that essay... I should be clear that my current focus is much more limited. Think "FileFDW" but with two extensions:
-
it can perform arbitrary manipulation of the data before providing the results in the expected manner.
-
it can use java to perform that manipulation.
The FDW implementation will be a skeleton with a single required option - the location of the jar file. (Ideally something already uploaded via sqlj, but initially it will be an external jar.
The jar will need to implement one or more interfaces - TBD.
The FDW skeleton can either use the standard METADATA-INF entry to locate the implementation, or it could scan the jar for all classes that implement it.
That's enough for my immediate needs, and I suspect enough for other people.
Everything else is "hmm, that could be useful since I've encountered a similar task at work" but definitely far beyond the scope of what I'm looking at at them moment. However these are important questions to ask in order to avoid implementations that fix the immediate problem but can't be extended.
The more I think about it the more I'm convinced that there's some way to get the session at the appropriate time, e.g., when the scan is initialized. It's just too useful to not provide.
In the meanwhile, or if we want to stick with a single JVM, the _PG_init() function can make a fork-exec call. The parent process writes the PID to the usual location, with some mechanism that supports multiple instances. (Future task....). This is a good idea even if the JVM launcher performs its own fork-exec since it gives us a place to hang additional resources that can be controlled via IPC.
_PG_fini() retrieves the PID and calls kill() so the JVM can perform a clean shutdown.
The more I think about it the more I'm convinced that there's some way to get the session at the appropriate time, e.g., when the scan is initialized.
I'm becoming a little concerned: there's enough vagueness in what you've written there that I am not sure what your mental model is of how PostgreSQL itself works.
In the meanwhile, or if we want to stick with a single JVM, the
_PG_init()function can make a fork-exec call.
PL/Java starts up from _PG_init() in the backend process serving your session. So that's already a JVM per session (that uses PL/Java). Doing a fork there to create another process and another JVM in it would lead to two processes and two JVMs per session, which I don't think anyone is pining for, and I don't think would be at all necessary for the example uses you've indicated.
I don't think there's anything about a `FileFDW-with-some-Java-manipulations that can't be done happily within the same JVM that PL/Java is already going to start for your session, and I would urge focusing on a solution that will do that, as doing anything else will be a significant complexity jump.
The jar will need to implement one or more interfaces - TBD.
I would also suggest front-loading the effort to at least sketch out those interfaces. They're likely to need some iteration, and some difficulties may be identified early that way.
a single required option - the location of the jar file. (Ideally something already uploaded via
sqlj, but initially it will be an external jar.
Because PL/Java already supplies facilities for loading jars, defining class paths, and associating those with PostgreSQL schemas, I would suggest instead that the options name a PostgreSQL schema (thereby, its class path) and the name of a class that implements the required interface.
That way, you can put your FDW implementation all in one jar, or have it and supporting libraries in separate jars, or include it in the jars of your other PL/Java functionality, and you don't have to reinvent any wheels that PL/Java already has.
... also, while a schema and class name would be required in all cases, it will be important for additional options to be accepted. The Java interface needs to include a validation method to make sure any additional options and values supplied make sense to that FDW.
Two things finally occurred to me at the same time.
First, there's no reason the first FDW needs to be able to interact with java. We can start with something simple and entirely written in C just to have a working example so we have a baseline. Once we have a good handle on it we can then start incorporating JNI calls.
Second, I've been thinking about a 'bytecode' UDT for a while. Long story, no need to go into it here.
Candiate FDW
This suggests a candidate FDW: jar_fdw.
It should be simple to implement:
- Copy the existing
file_fdw - Add zip file support
- Expose the information required by the
ClassLoader
From the javadoc we only require two methods:
class NetworkClassLoader extends ClassLoader {
String host;
int port;
public Class findClass(String name) {
byte[] b = loadClassData(name);
return defineClass(name, b, 0, b.length);
}
private byte[] loadClassData(String name) {
// load the class data from the connection
. . .
}
}
The latter is trivla - it's a select.
Implementing a SecureClassLoader only requires adding a CodeSource object. This contains the URL (required) and an optional array of either CodeSIgner or Certificate. We can and should use the URL of the jar file.
Implementation
There are three candidate implementations of the zip functionality.
- develop our own using zlib.
- eliminates the need for additional dependencies
- more secure since we have full control of the code
- easy to use standard TOC for 'push down' searches
- (or create our own if the latter is missing)
- I have experience implementing the zlp archive in C
- use the libzip library (read and write)
- use the libzzip library (read-only)
We only need 'read only' access - that means there's no benefit for using the libzip library unless we know it is more secure.
Note - limited initial implementation
Note: the following topics are mostly ideas for future enhancements. The initial work needs to be focused on POC and may include temporary measures like using a third-party zip implementation even if the ultimate goal is to write our own.
Classloader concerns
We could create a separate FDW for each jar but it makes a lot more sense for the FDW to allow the user to provide multiple files since the only thing that matters is the classpath. Each row can contain a short foreign key into a jar file table.
"Backwards compatibility"
A FDW can access the host database. This means the FDW can transparently access either local files or the existing tables.
Usage
Using a FDW requires two steps.
You need to perform three steps before you can access a FDW:
- define the SERVER (you can have multiple instances)
- define the USER
- define the FOREIGN TABLE
All three steps allow you to provide configuration options. For now we can restrict the list of jars to the server but keep the door open for adding it to other levels for more sensitive jars. Of course all of the standard database PERMISSIONS still apply - but those permissions can't reach the ClassLoader.
Maven
This is much more advanced - but a natural place to look at integrating java later. It's also a possible solution when running in a hosted environment where we have limited to no access to the local filesystem a priori. (We should still be able to use a temp directory for caching, etc.)
For java devs the maven coordinates make much more sense than the location of the jar file. It's also enough information to find the jar file if we know the location of the local repo and it follows the standard layout.
It's also a natural component of the CodeSource UDT, although at the cost of being an unsupported schema.
The C-based implementation would be limited to local files or at most explicit URLs to a download site. However a java-based implementation can use the standard maven archive classes to find and download the jars from anywhere.
A quick update....
I feel a bit like that Bugs Bunny cartoon set in opposing forts where Yosemite Sam keeps opening a door to find another door behind hit, while Bugs is nailing new doors on the end of the stack as fast as YS is opening them.
For the FDW I started from the PostgreSQL documentation, then realized I should reduce the problem to one that's entirely in C, reviewed some existing classes before discovering a 'blackhole_fdw' implementation that provides a working skeleton. It doesn't do anything but you can build it and install it. You just get no records with SELECT and INSERT, UPDATE, and DELETE have no effect. But it's working blank slate....
... however it hasn't been updated since PostgreSQL 10 or so (so it's missing a number of methods that (probably) don't need, and
... more importantly it has a null implementation. That doesn't mean it doesn't initialize and return structures, just that there's nothing but nulls and integers. Not even named fields. It's not hard to look at the functions called to add this ourselves but it takes a bit of time.
What I had hoped for was a skeleton implementation that was replaced the actual queries with placeholders (e.g., something accessing an internal cache) while fully implementing the glue. We would still need to implement the actual access to an external resource, provide the proper schema, etc.
At the moment I've forked the project and am adding this myself. I'll probably check for other skeleton implementations in a few days... but it depends on how quickly I make progress.
Once this is done it will be easy to add the jarfile code - one of the libraries I mentioned above uses memory-mapped files and that should be fast enough. (I'll document a few ideas if we need to improve the performance.)
This is all a bit ironic since this effort started by revisiting a separate project that would require several FDWs (in C) and I thought pljava would be a simpler version of those FDWs since a read-only implementation would still be good enough to get started. Now I have to open another door, one with no guarantees that there isn't yet another door behind it.
PL/Java aims to support implementing PostgreSQL functionality in Java. For 20 years it has supported implementing functions and triggers and UDTs in Java, and more recently casts and aggregates in Java, and in the ongoing refactoring it grows to support implementing whole new PLs in Java, which can support functions and procedures and inline blocks and triggers and UDTs....
It's a natural idea to extend PL/Java to also support writing FDWs in Java, where one would write a Java class that implements some interface like PLJavaBasedFDW, in much the same way that a new PL is simply a new class that implements PLJavaBasedLanguage.
So I would rather expect an early effort at PL/Java FDW support to look something like this.
Ah, the eternal question of whether you do top-down or bottom-up....
I've focused on bottom-up since there's an existing API - it shouldn't drive the final interface but it should definitely inform it to minimize the amount of 'hidden code' that can cause unexpected behavior.
I think I've made enough progress to see whether the pieces can meet in the middle and actually run some tests.
API vs SPI
Important note:
- an API is called by the client
- an SPI is called by the server
In both cases the purpose is abstraction and encapsulation. The API, e.g., jdbc, doesn't expose the database internals. The SPI, e.g., for an external library, allows the service to avoid exposing any internals. The nature of a FDW will always be that of an SPI - there is no way to call these methods other than via the FDW (or explicitly added UDF) - so the implementation of the SPI should rely on as few database internal structures and methods as possible. Ideally none, although there may be practical reasons for a few exceptions. (E.g., it should be possible to write to the server's log and/or user console - but we can still mask the use of elog()).
Simplified Internal SPI
I've been focused on a simplified internal SPI. The goal is to handle the PostgreSQL specific stuff in a single implementation that calls a simplified internal SPI. With pl/java this would be a set of methods that call the provided java functions.
I don't know whether some database-provided information should be pulled by the java or pushed from the C via an SPI. A single interface can include both, of course.
Usage
The skeleton FdwRoutines and FdwValidator will need to include implementations of the corresponding functions. I'm in the 'has-a' camp here - they should use dependency injection.
Current status
I can successfully build a .so that can be loaded and respond to select * from my_foreign_table. At the moment it only writes the equivalence of ResultSetMetaData to the client via 'NOTIFY'. I'm sure I can also populate the results since I have both attribute name (for a Map key) and required attribute type (via attrtypid).
I haven't taken this final step since the existing file_fdw also includes a fair amount of memory management in the 'read-only' functions. I know that it's a wrapper to the COPY command but don't know if the memory management is only required by the use of that functionality or if it's always required.
Minimal SPI
public interface MinimalFDW<T> {
<T> configure(Map<String, String> options); // see notes
// method used to provide 'import schema statement' ?
// alternate: method to provide ResultSetMetaData ?
}
There needs to be either three methods (for wrapper, server, and table) or an additional parameter added to the method signature. It is important to keep track of the origin since the same option may occur in more than one location and priority matters.
In addition there's no ban on duplicate keys. In fact it may be preferred, e.g., if it's the frontend to a cluster the designer may prefer one entry per server instead of one entry with a list of servers. This would require replacing the value with List<String> even though nearly every entry would have a single value.
However... that also makes it easy to support options that are simple flags - use an empty list.
PLANNER SPI
We can skip this for now - there's no need for it until query optimization is supported. At most we could use a method that returns the estimated number of rows, and possibly the estimated cost.
SELECT only SPI
public interface ReadOnly {
void open(User user, Details details);
void next(Record record);
void reset();
void close();
}
open
-
Useris user information. (At the moment I only have OID) -
Detailscontains information about this specific table. (Actual row metadata, etc.) - does not open resource
- ... except it can briefly open it to verify content, count number of rows, etc.
next
-
Recordis passed due to the way the underlying code works, but... - ... with better grasp of memory management it may be possible to have this return a
ResultSet
INSERT/UPDATE/DELETE/TRUNCATE SPI
TBD
Optimizations
-
The biggest optimization is support for push-down join restrictions. By default the FDW provides all records but using this allows the FDW to be far more selective about what it returns.
-
analyze()provides up-to-date statistics for the query optimizer. It is valuable if push-down joins are supported. -
vacuum()compacts the data. It only makes sense if read/write operations are supported.
I forgot to add - here's the work-to-date: https://github.com/coyotesong/simple_fdw
And I just created a pull request that shows the pljava side of this. There's a lot of handwaving since this is a lot like the existing UDT in some ways and different in other ways, but it's a start. I think the biggest difference may be that the UDT requires four exported methods while the FDW mostly calls unpublished methods. The two published methods are only used when you create a new FOREIGN TABLE.
That makes it harder for me to simply steal, I mean legally repurpose, the existing UDT implementation.
And it also means that the suggested FDW class may be completely off-base since it we don't need Datums and PG_ARGS for internal functions/methods. But it's a start... (and I definitely think we should avoid making these functions visible unless absolutely required.)
I just created a second pull request with much more serious effort to prototype working code. It works with a clean java implementation of the required tasks - there is NOT dependency on postgresql internals.
https://github.com/tada/pljava/pull/534
It has known issues but I've documented them. The serious ones are mostly related to memory management issues (which other people should be able to answer) and the details of how the persistent objects are actually managed by the database. This will make a big difference in terms of supporting multiple servers per FDW, and multiple foreign tables per server.
I also called out the fact that current user OID is known but not yet provided to the java code. I'm sure it's just a matter of knowing the right functions to call to extract the desired information from both the system and session info.
Important note:
* an API is called by the client * an SPI is called by the server
I appreciate the distinction you're making here, but I don't think you'll find it used as fastidiously in the PostgreSQL community; there are recurring discussions at the developer conferences and on the -hackers list concerning just what the PostgreSQL APIs are or aren't, and those discussions don't often get bogged down in just which is an API and which an SPI, and I wouldn't want that to happen to this one either.
The challenge in designing a programming interface, of whatever kind, is how to cleanly present the fundamental objects and operations needed, uncluttered by details that aren't fundamental. For example, when looking at RegClass, RegRole, RegType, and so on, there is no need to expose just how PL/Java grovels them out of PostgreSQL caches, keeps them coherent in the presence of updates, and so on. That would be clutter at the interface level. It is not clutter to expose classes, roles, and types, as those are fundamental things code needs to be able to pass around and talk about.
It has known issues but I've documented them. The serious ones are mostly related to memory management issues (which other people should be able to answer) and the details of how the persistent objects are actually managed by the database. This will make a big difference in terms of supporting multiple servers per FDW, and multiple foreign tables per server.
That reminds me that I have wanted for a while to write down in one place the things I have noticed so far about just what PostgreSQL gives us to work with here. This may as well be the place....
FDWs—what choices are predetermined by what PostgreSQL provides?
I know that you have been looking at the same details, so I hope that, no matter our disagreements on overall approach, we can agree that these points below are an objective description of what PostgreSQL requires us to work with.
The persistent objects
Handler and validator functions
PostgreSQL will allow any number of these to be declared, and it will allow multiple CREATE FOREIGN DATA WRAPPER declarations naming the same functions.
Foreign Data Wrappers
PostgreSQL allows any number of these to be declared and for multiple FDWs to share the same handler/validator functions. In addition to those functions, an FDW is declared with a name, owning role, USAGE ACL, and zero or more options: an arbitrary identifier-to-string mapping meaningful only to the implementing functions. FDWs that share implementing functions may commonly be declared with different options, but could also be identical in all but name and would still be distinct objects.
Foreign Servers
PostgreSQL allows any number of these to be declared and for multiple servers to share an FDW. In addition to the associated FDW, a foreign server is declared with a name, owning role, USAGE ACL, type, version, and its own set of server-level options. Type and version are two more uninterpreted strings meaningful only to the FDW in question—essentially two more options by another name, only not passed to the validator function.
Here again, you might commonly expect that distinct servers declared on the same FDW will differ in their type, version, or some options, but it is also possible they could be identical in all but name and still be distinct servers.
Foreign Tables
PostgreSQL allows any number of these to be declared and for multiple tables to share a foreign server. A foreign table is declared and represented in nearly all respects just like any local table, but for having an associated foreign server. its own table-level options, and also possibly per-column FDW options.
User Mappings
Essentially a map from (RegRole,ForeignServer) to a set of options for accessing that server.
Common elements
"Set of options" appears as a common element for FDWs, servers, tables, attributes, and user mappings. They all have the same grammar in PostgreSQL, an arbitrary SQL identifier mapped to an arbitrary string. Undocumentedly, though, the arbitrary SQL identifier may not contain = because an options declaration like `OPTIONS ("foo=bar" 'baz') will be silently accepted, but seen by the FDW validator and handler as "foo" 'bar=baz'.
Notes on the validator function
The validator function will be called to validate (separately): options supplied with an FDW declaration, options specified with a server declaration, those specified with a table declaration, those specified with a specific column in a table declaration, and those in a user mapping.
When called to validate a server declaration, the validator is not passed any reference to the server's declared FDW. Therefore, if there are any server options whose validity depends on the options chosen at the FDW level, the validator cannot check those. Also, the server's type and version are not passed to the validator at all and so cannot be validated.
Likewise, when called to validate table options or user mapping options, the validator is not passed any reference to the associated server. So if there are table options or user mapping options that depend on the options chosen at the server level, the validator cannot check those.
Column-level options are validated without identifying the table in which the column appears, so if any column options depend on the table options (or anything else, like the column type!), the validator can't check them.
Discussion
If I had designed the validator interface I might have put one more oid argument in the signature, so column validation could be passed the table, table and user mapping validation could be passed the server, and server validation could be passed the FDW. As for what to pass in the extra argument when validating an FDW declaration, I'd consider passing the handler function. Without that, an FDW implementation can't even detect at validation time if the CREATE FOREIGN DATA WRAPPER is naming some handler function that isn't even from the same implementation as the validator.
But really, what surprises me more is that the excellent precedent of PL validators wasn't followed, where the object to be validated is just provisionally entered in the catalog, visible to no one else before the transaction commits, its oid is passed to the validator, the validator can consult whatever fields it needs to, and if it rejects the declaration the provisional entry goes away with the rollback. Simple, solves all the shortcomings here, and with existing precedent back to PG 7.3.
But in any case, the limitations of the existing implementation we have to work with are as given above.
Notes on the handler function
The handler function has no arguments, not even which FDW it is being called for, so the struct of function pointers it returns must be equally suitable for any FDW that can be declared with that handler.
PostgreSQL is quite profligate in making calls to this function: even though the struct it returns will likely be nothing but unchanging static function pointers, PostgreSQL expects a freshly palloc'd copy on each call, keeps it only to the end of the query, and throws it away. (That means, in passing, that my initial idea, which could look at first glance as if it instantiated the FDW class in the handler function, would appear wildly impractical. But really it wasn't, because of the way PLJavaBasedLanguage treats function handling as staged programming. A closer look at the initial idea shows the class instantiation happening at the prepare stage, meaning it would happen once and automatically be cached in the Template for the lifespan of the handler RegProcedure without any extra ceremony needed. Such are the benefits of offering an idiom that makes staged programming simple enough you can forget you're doing it. )
Any Java / implementation state that is worth keeping around has got to be cached with no help from PostgreSQL. And because all of the FdwRoutine function pointers have to work for any FDW declared on the same handler, every one of those functions must have some way of identifying the table(s), server, and FDW in question from the PostgreSQL structures that are passed to it.
It must be the case that whatever structures are passed to each of those functions can always somehow be tortured into revealing that information. It would feel like a useful milestone, though, just to write up a table with n rows for the n FdwRoutine functions, stating the steps each one needs to go through to extract the table/server/fdw identity from the arguments it gets.
Some of the structures have 'private' fields that can be populated by FdwRoutine functions that are called earlier and referred to by those called later (within a query, still, and all thrown away at the end). Just which functions can be counted on to be called before which others seems to be a topic that is neither completely undocumented nor completely documented.
PostgreSQL does check the presence / nullness of individual function pointers in the FdwRoutine struct to learn the capabilities of the FDW. So while the implementation may in fact simply have n static function pointers to put into that struct, the handler function should still, when it populates that palloc'd FdwRoutine struct on the fly, store each function pointer or NULL depending on whether that capability is supported.
Discussion
Although PostgreSQL's profligate use of the handler function makes it a bad idea to instantiate implementing Java classes there, it may be a natural idea to be sure the Java class in question is loaded at that point, and determine, from which optional interfaces it implements, which function pointers to include in FdwRoutine.
That's about all I've got for now.
I wrote:
It must be the case that whatever structures are passed to each of those functions can always somehow be tortured into revealing that information. It would feel like a useful milestone, though, just to write up a table with n rows for the n
FdwRoutinefunctions, stating the steps each one needs to go through to extract the table/server/fdw identity from the arguments it gets.
Here's a start. It seems I can do about four at a time without my eyes starting to cross. It turns out that n = 45.
| Function | How to get table / server / wrapper from function's args |
|---|---|
GetForeignRelSize |
foreigntableid |
GetForeignPaths |
foreigntableid |
GetForeignPlan |
foreigntableid unless it's InvalidOid; then what? (baserel->fdw_private may be a shortcut if that was set. Or?) |
BeginForeignScan |
node->ss.ps.plan is the ForeignPlan we may have planted private info in; or look at its scan.scanrelid or fs_base_relids and grab a range table entry at the corresponding index from node->ss.ps.state->es_range_table |
IterateForeignScan |
|
ReScanForeignScan |
|
EndForeignScan |
|
GetForeignJoinPaths |
|
GetForeignUpperPaths |
|
AddForeignUpdateTargets |
|
PlanForeignModify |
|
BeginForeignModify |
|
ExecForeignInsert |
|
ExecForeignBatchInsert |
|
GetForeignModifyBatchSize |
|
ExecForeignUpdate |
|
ExecForeignDelete |
|
EndForeignModify |
|
BeginForeignInsert |
|
EndForeignInsert |
|
IsForeignRelUpdatable |
|
PlanDirectModify |
|
BeginDirectModify |
|
IterateDirectModify |
|
EndDirectModify |
|
GetForeignRowMarkType |
|
RefetchForeignRow |
|
RecheckForeignScan |
|
ExplainForeignScan |
|
ExplainForeignModify |
|
ExplainDirectModify |
|
AnalyzeForeignTable |
|
ImportForeignSchema |
|
ExecForeignTruncate |
|
IsForeignScanParallelSafe |
|
EstimateDSMForeignScan |
|
InitializeDSMForeignScan |
|
ReInitializeDSMForeignScan |
|
InitializeWorkerForeignScan |
|
ShutdownForeignScan |
|
ReparameterizeForeignPathByChild |
|
IsForeignPathAsyncCapable |
|
ForeignAsyncRequest |
|
ForeignAsyncConfigureWait |
|
ForeignAsyncNotify |
I am cribbing some of these details from jdbc_fdw.
In cribbing from there, I notice that looking at a range table entry seems necessary in any case to obtain the user identity that is to be mapped.