databend icon indicating copy to clipboard operation
databend copied to clipboard

Specification of type system

Open leiysky opened this issue 4 years ago • 11 comments

It's important to provide a well defined type system.

In ANSI SQL specification, there are many data types, while all of the databases have their own type systems that always have difference fron ANSI SQL.

The most complicated part is the implicit conversion between data types.

For example, there isn't a real BOOLEAN type in MySQL. They represents BOOLEAN with TINYINT(1), thus literal TRUE value is actually 1 with type TINYINT.

Besides, MySQL allows implicit conversion from string type to float, that is, if you write down a SQL query like SELECT SUM(char_column) FROM table, MySQL can still produce a bizzare result(it will implicitly convert char_column into a double_column, and the conversion rule is not well defined).

While PostgreSQL do have BOOLEAN type, and it doesn't allow non-reasonable implicit conversion.

We've discussed about this in https://github.com/datafuselabs/datafuse/issues/1492, and the detailed discussion about type system will be finished in this issue.

leiysky avatar Aug 19 '21 05:08 leiysky

cc @sundy-li

leiysky avatar Aug 19 '21 05:08 leiysky

Thanks for the issue, I think in the alpha version, the types we must have are:

PrimitiveTypes:

  • UInt8
  • UInt16
  • UInt32
  • UInt64
  • Int8
  • Int16
  • Int32
  • Int64
  • Float32
  • Float64

UInt8/Int8 aka tinyint, UInt16/Int16 aka smallint, UInt32/Int32 aka int, UInt64/Int64 aka bigint, Float32 aka float, Float64 aka double.

Boolean

  • Boolean Some database system takes UInt8/tinyint(1) as Boolean, but In Datafuse I think it's ok to have a native boolean type, which stores one value as a bit. I think it can work as well as UInt8.

String, Binary

  • Utf8 <---> LargeUtf8 in arrow2
  • Binary <---> LargeBinary in arrow2

Binary can cover the case of Utf8, seems we can keep Binary type and allow users to add charset constraints to the type.

Complex types

  • List(Box<DataField>) <---> LargeList in arrow2
  • Struct(Vec<DataField>)

List aka Array, Struct aka tuple.

Date, DateTime Types

Date, DateTime types are logic types, we can map these types into primitive types. In Arrow's date Specification, they are signed types. I don't thinks we must follow arrow's specification, we can have our own date/datetime types.

  • Date16 <--> UInt16

  • Date32 <--> UInt32

  • Date64 <--> UInt64

  • DateTime32 <--> UInt32

  • DateTime64 <--> UInt64

sundy-li avatar Aug 19 '21 07:08 sundy-li

PrimitiveTypes:

Maybe we should involve DECIMAL type as well?

leiysky avatar Aug 19 '21 07:08 leiysky

Note that Utf8 and Binary have a maximum number of characters (over all items) of i32:MAX; this is not a per-item constraint; sum over all characters over all items < i32:MAX.

wrt to not following the arrow spec, the constraint there is that if you need FFI (e.g. UDFs written in python for ML and things like that), it will be more difficult to integrate those types. I am not sure that is a problem per se; just mentioning.

jorgecarleitao avatar Aug 19 '21 08:08 jorgecarleitao

Maybe we should involve the DECIMAL type as well?

Yes, we can add it later, add Decimal(P, S) after the alpha version.

Note that Utf8 and Binary have a maximum number of characters (over all items) of i32:MAX; this is not a per-item constraint; sum over all characters over all items < i32:MAX.

Thanks, in Datafuse, we take Utf8 and Binary as LargeUtf8 and LargeBinary in arrow2. Because using i32 to represent the offset is really not enough for large data. I even want to change to offset to be u64 instead of i64, but it will destroy the arrow spec.

the constraint there is that if you need FFI

That's true, still need some considerations and discussions about that.

sundy-li avatar Aug 19 '21 08:08 sundy-li

out of curiosity, why 65535? I though that that was the max of u16:

fn main() {
    println!("i32::MAX: {}", i32::MAX);
    println!("i16::MAX: {}", i16::MAX);
    println!("u16::MAX: {}", u16::MAX);
}

prints

i32::MAX: 2147483647
i16::MAX: 32767
u16::MAX: 65535

jorgecarleitao avatar Aug 19 '21 08:08 jorgecarleitao

out of curiosity, why 65535

Oh, that's my bad.

sundy-li avatar Aug 19 '21 08:08 sundy-li

Hi @sundy-li, it seems this specification basically follows the DataType defined here: https://github.com/datafuselabs/datafuse/blob/681efe24288e41adacd9e5ce978f361a6d37ab42/common/datavalues/src/types/data_type.rs#L23-L51

I have need for this in #1538 and #1537 , can I just treat it as a stable API and rely on it?

leiysky avatar Aug 20 '21 06:08 leiysky

Can I just treat it as a stable API and rely on it?

Yes, currently we treat it as our own type.

sundy-li avatar Aug 20 '21 07:08 sundy-li

FYI @andylokandy

Xuanwo avatar Jul 02 '22 12:07 Xuanwo

@leiysky, maybe we can close this issue now?

Xuanwo avatar Sep 16 '22 06:09 Xuanwo