Rethinking the code structure
Per discussions on the video meeting with @typhoonzero @shendiaomo
Two Components in SQLFlow
Compiler
- Front, the parser package parses SQL statement(s) and generates IR(s).
- Semantics Analysis(Runtime), which includes feature derivation, verifier, attribute filler and checker, model reloading.
- Optimizer(static), analyze the dependency of SQL statements, and generate a parallel.
- Backend, various code generator, which produces YAML(Argo workflow) file or AI program(TF/XGBoost), or programming program(optflow).
Interpreter
SQLFlow compiler generates a two-layers graph, and two kinds of interpreter execute each layer,
- First Graph(Argo Workflow), Argo controller is the interpreter.
- Secondary Graph(AI program), Python/PAI command-line/EDL command-line is the interpreter.
The Hoped Code Structure
/pkg
/interpreter(executor)
/graph(Argo)
/node(python/pai/alisa)
/compiler
/parser
/semantics analysis(runtime)
/feature_derivation
/verifier
/model_reload
/attribute filer && checker
/optimizer(static)
/parallel graph
/backend(codegen)
Incomplete Thinking About the Final
Insufficient and Thinking of the Current System
-
the workflow graph losses much detailed information.
We hope SQLFlow generates a more detailed graph. For example, if the graph can describe a group of TensorFlow ops running on CPU/GPU, .e.g. that we can optimize the AI pipeline throughput.
-
workflow can not get the best throughput.
The streaming graph can get better throughput. For example, we can custom a TensorFlow op to read data generated from the
SELECTclause in streaming, instead of creating a temporary table.
Code Structure
pkg/
/interpreter
/argo(graph executor)
/node(subgraph executor)
/semantics analysis(runtime?JIT?)
/feature_derivation
/verifier
/model
/attribute filler && checker
/graph
/compiler
/parser
/optimizer(静态)
/parallel graph
/backend(graph)
Something that needs to discuss next:
- What program that the graph node executes.
- A flatten graph or two-layer graph.
is SQLFlow a compiler or interpreter? It doesn't make sense for it to be both.
We don't have and probability will never have two layer of graphs. We don't have any graph in our system. The top level is a workflow represented by YAML. A workflow is not a graph. An Argo/Tekton workflow can have conditionals, loops, and even function definitions and function calls, whereas a graph cannot have these. The lower level is a Python program, which is not a graph either. A Python program could have all kinds of control flows, but a graph cannot.
It is disappointing to see our team members still sticking to the plastic idea of "graph" known as TensorFlow's early version used it as a very non-professional form of IR. Especially those who experienced PaddlePaddle, which tried so hard to propose an IR that is much more powerful than graph. All the way, innovators like Chris Lattern have been introducing the professional form of IR into TensorFlow, but so sorry that people cannot see the efforts.
The Current Structure
After several structure-adjustment PRs (#2481 #2484 #2491 #2500 #2502 #2505 ), the current package structure has become:
pkg
├── attribute # semantics
├── codegen # codegen
├── database # basic lib
├── executor # step execution
├── ir # intermediate representation
├── log # basic lib
├── model # basic lib
├── modelzooserver # server
├── parser # syntax
├── pipe # basic lib
├── proto # protobuf definitions
├── sql # step execution
├── sqlflowserver # server
├── sqlfs # basic lib
├── step # step execution
├── tar # basic lib
├── test # basic lib
├── verifier # semantics
└── workflow # workflow execution
The Proposed Structure
There're still several problems:
- We can restructure the packages according to their functionalities as standard components of a compiler, for example: put
attributeandverifierin asemanticspackage, put all basic libraries in abasicpackage, putsqlflowserverandmodelzooserverin aserverpackage - The
executorgenerates code for step execution and executes the code subsequently. We should decouple the code generation phase and execution phase, and put the decoupled code incodegenandsteprespectively. Similarly, because thesqlpackage callsexecutorfor step execution, the files insqlshould be put instep. After this stage, the package structure should be:pkg ├── basic │ ├── database │ ├── log │ ├── model │ ├── pipe │ ├── sqlfs │ ├── tar │ └── test ├── codegen │ ├── alisa.go │ ├── pai.go │ ├── ... │ └── couler ├── ir ├── parser ├── proto ├── semantics │ ├── attribute │ └── verifier ├── server │ ├── modelzooserver │ └── sqlflowserver └── execution ├── step │ └── executor.go └── workflow - We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use
step -eto generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts. After this phase, we don't need thepkg/stepandcmd/stepanymore.
I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
- The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
- The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
- The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
- The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the pkg/execution/step folder ?
3. We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use
step -eto generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts.
In the current architecture, we always run step -e {sql_statement} in each step. It will bring the limit that one SQL statement is mapped to only one step. And the step binary does the parse and build IR work again in the step image. It contains some duplicated work.
In the future, one SQL statement can be translated into several steps. So it would be better that after translating the SQL program into a work flow. We can get what each step executes obviously such as Data Analysis, Data Exploration, Model training instead of executing a general command step -e.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the
pkg/execution/stepfolder ?
No, we don't. We only have to move something like table_writer to the basic package.
I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
- The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
- The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
- The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
- The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
In a discussion with @Yancey1989 , we found that we still have to implement a feature derivation mechanism in Python like the previous migration prototype to make the proposed structure available.
The problem is that:
- The feature derivation mechanism must run in step execution.
- The codegen package in the current architecture depends heavily on feature derivation to generate python code.
As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.
As a result, we have to first generate a
.yamlinsqlflowserverand secondly generate.pys in thestepbinary.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.
I think we can generate a .yaml file, for each step, call tensorflow/xgboost/pai code generator to generate a submitter program entrypoint which is .py script, the following snippet .yaml is an very simple example:
steps:
name: step-1
args: ["python", "-c"]
command: |
import sqlflow.runtime.tensorflow
tensorflow.train(....)
That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.
We can also separate feature derivation, verifier, into separated steps to decouple workflow step logic.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
@wangkuiyi sqlflow is good.
That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.
I am afraid that if we use this way, we would move many Go codes into Python.
In this way, the sqlflowserver may only do the following things:
- parse the SQL statements
- attribute checking
- monitor the workflow status
- codegen to call
tensorflow.train/predict/evaluate/...
All other things would be done in Python, for example:
- database connection (fetch samples to do any verification or derivation).
- feature derivation.
- generate feature column API calls of TensorFlow/XGBoost.
- ...
Python codes may be less maintainable than Go codes.
The updated code structure as the following reason based on https://github.com/sql-machine-learning/sqlflow/issues/2494#issuecomment-647692915
- move semantics from Go to
runtimePython pacakge. - remove
basictop folder. - move Go code to
gofolder.
|-go
| |--cmd
| | |--sqlflowserver // SQLFlow gRPC server
| | |--modelzooserver // SQLFlow Model Zoo gRPC server
| | `--sqlflow // SQLFlow command-line tool
| |--pkg
| | |--ir
| | |--parser
| | |--log
| | |--model
| | |--pipe
| | |--sqlfs
| | |--tar
| | |--test
| | |--codegen
| | | |--pai
| | | | |--tensorflow
| | | | |--xgboost
| | | | |--kmeans
| | | |--alisa
| | | |--tensorflow
| | | |--couler
| | | `--xgboost
| | |--server // SQLFlow server interface implementation
| | | |--proto
| | | |--run
| | | `--fetch
| | |--modelzoo
| | |--executor
| | | |--argo // Argo is workflow executor
| | | `--python // Python is workflow step executor
|-python
| |--sqlflow.runtime
| | |--pai
| | | `--tensorlfow/xgboost/shap
| | |--alisa
| | |--tensorflow
| | |--xgboost
| | |--feature_derivation
| | |--verifier
| | `--db_writer
| `--couler
`-java
The following tasks should be done:
- move Go code to
gofolder. - move feature derivation to
runtimePython package. - move verifier to
runtimePython package. - update codegen based on newly feature derivation and verifier code.
- move
pai/alisaSubmittertoruntimePython package and only keep Python as the workflow step interpreter.
Python codes may be less maintainable than Go codes.
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?
@Yancey1989 It may be more complex. Let us find some ways for better maintainable Python codes, such as improving code coverage, etc.
TODOs for SQLFlow compiler refactor:
- Move
feature_derivationto theruntimePython package. - Separate
verifierinto two parts.- Compile-time does the attribute checker.
- Runtime verifies data schema.
- Refactor the existing code generators.
- update based on newly feature derivation and verifier code.
- add
codegen/paito generate PAI submitter program. - add
codegen/alisato generate Alisa submitter program.
- Move workflow step
responseGo package to Python.
There are two main problems from the above plans:
- Python code is harder to maintain then Go. We can do the following items to solve it:
- Google Code Style
- Improve code coverage
- Some type checker tools e.g. pytype to check Python types in static.
- ROI. We should move many Go code to Python, which takes about two man-months, do we do that immediately?
Supply a Python db-api to access alisa. @Yancey1989
- add
codegen/paito generate PAI submitter program.
During workflow compilation, for TO RUN statement, we will have the flexibility to generate different command line call for the step according to deployment platform and execution program. Upgrade the compiler to generate the various code in the step according to these two or more variables.
- Vanillar Kubernetes && Python Program: python a_python_program.py param_a param_b
- Vainilla Kuberenetes && Executable Binary: an_execution_binary param_a param_b
- MaxCompute && Python Program: alisa.submitter a_python_program.py param_a param_b