[DISCUSSION][java] Architecture Discussion for Java Repository
Describe the enhancement requested
We need to have a focused discussion on the overall architecture of the Java repository. While we’ve had preliminary conversations in comments (#738, #739), those discussions lack a consolidated summary and clear direction.
To move forward, we should revisit key architectural details, and ultimately produce a design document that will guide the development of remaining work.
Component(s)
Java
I have some questions:
Q1: Is the I/O granularity at the chunk level?
For example, in java-io-parquet, is each I/O operation reading one entire chunk, and then filtering and computation within the chunk are done in memory?
Q2. After the data is read into memory, what memory structure can be used, whether to use arrow or create a new class, such as GarTable
I have some questions: Q1: Is the I/O granularity at the chunk level? For example, in
java-io-parquet, is each I/O operation reading one entire chunk, and then filtering and computation within the chunk are done in memory?
In that case we lost all the benfits of GAR (except better compression). Because GAR data is sorted by IDs and we have also index table, we can benefit of it doing pushdows and in case, for example, user wants to read only 2-hop neighborhoud of small subset of IDs in the Graph, we must check first index table and min-max statistics per column stored in headers of parquet files to skip most of chunks. Reading and filtering in memory sounds crazy for me, like why?
I think we should go to the building a proper AST of the whole query first, then apply all optimizations like pishdown of the filters to the reader and use thia information to filter chunks by analyzing headers and/or index table.
If we could pushdwon this would be very good, and we did implement it in C++ library. But we may have many storage places and formats, for example, s3 stores chunks in csv format, so it seems that we cannot read the specified rows through offset.
I'm thinking about a more general interface for java-io-common module.
Or if store can't pushdown, we can downgrade to memory processing.
I think "common" should pushdown to "io" as much as possible and "io" will do or not do optimizations. io-parquet can scan headers and use index table (to skip whol parquer row groups), but csv can only get offsets of rows from index table (to skip rows without parsing) and connot do min-max skipping.
Like Array[Filter] pushDownPredicates, Array[String] pushDownProjections and concrete implementation of io should decide what do with them.
Do you think io level is a proxy for io? That is, an interface similar to the following is provided at io level:
interface FileReader{
Array<Row> readFile(URI fileUri,Fileters filters,Projections projections,...);
}
class ParquetReaderProxy implements FileReader{
public Array<Row> readFile(URI fileUri,Fileters filters,Projections projections,...) {
//use apache-parquet library to read file
}
}
and in higher levels, we provide access interface:
class PorpertyGroupReader{
FileReader fileReader;
Array<Row> readFromFile(int chunkIndex, PropertyGroup pg, URI baseUri){
//use baseUri to openFile
//use chunkIndex to pushDown
//use pg to decoding properties
return fileReader.readFile(...,...,...)
}
}
What do you think about building a whole AST of the user query in a top level PorpertyGroupReader (or even on a level above)? It allows us to determine filters/projection and push them to the Array<Row> readFile(URI fileUri,Fileters filters,Projections projections,...);
Or we can postpone it but leave a place for it in the future.
I think the AST is highly related to query, but at the moment we only have simple traversal query requirements. Therefore, I think it can be built later as an "optimization" feature once the high-level functions are completed.
My point is let's leave a place for it? At the moment we can have a dummy placeholders.
And even for simple traversal queries! For example, in the graph people likes movies, if I want to do something MATCH (:people {id: [1, 2, 3]} -> [likes] -> (:movies {name: [Termninator, Matrx]}) return people, it is definitely not a bad idea to skip all the chunks from poople that does not contains vertices with id 1, 2, 3. The same is true for edges likes: we can simply skip most of parquet chunks and read to memory only chunks, related to edges, that starts from 1, 2, 3
etc.
Otherwise we may read to memory millions of nodes and billions of edges just to scan them and filter out most of them...
I agree to leave a place for it.
I think it's enough to leave an interface/abstract class in the api-level, we don't need to add level/module.
The main job of GAR is to provide the management format of storage rather than the computing engine.
For Q2, I think we can provide an abstract interface, and different memory types can implement this interface.
interface Row{
public<T> T getValue(Long columnIndex);
}
class ArrowProxy implements Row{
public<T> T getValue(Long columnIndex) {
//use arrow library to decoding
}
}
I/O Layer
- readFile(fileUri, filters, projections, chunkIndex) – read data from a file.
- supportsPushdown() – check if filters/projections can be applied.
High-Level Layer
- readFromFile(chunkIndex, propertyGroup, baseUri) – read a specific chunk.
- readAll(propertyGroup, baseUri) – read all data.
Memory Layer
- getValue(columnIndex) – get value from a row.
- getSchema() – get column types.
Future Optimization (placeholder)
- buildAST(query) – plan for query optimization.