c3c
c3c copied to clipboard
Help getting headers working
I added an option to the compiler to emit headers for the entire project --emit-c-headers. The problem I ran into is clearly it's incomplete. I got part way.
Currently I got stuck in static void header_print_type(FILE *file, Type *type) because there's no handling of typedefs.
When I tried
case TYPE_TYPEDEF:
OUTPUT("%s", type->name); //this clearly is the typedef's name, but I need to output the typedef somewhere
//header_print_type(file, type->decl->typedef_decl.type_info->type); //digging through helpers this seemed like a good shot
return;
I got the resulting c struct which seems right
struct lexer_TokenInfo__
{
usize offset;
usize length;
int32_t col;
int32_t line;
};
I suspect what's needed although I haven't worked it out is what really belongs in here:
static void header_gen_typedef(FILE* file, int indent, Decl* decl)
{
if (decl->extname)
OUTPUT("typedef %__ %s;\n", decl->extname, decl->extname);
}
The resulting file is clearly having problems with usize and isize etc...
Solved the issue with gotos referencing other code I found, I'll slowly work through this:
case TYPE_TYPEDEF:
type = type->canonical;
goto RETRY;
usize should be replaced by size_t, so you can do a check: if (type == type_usize) ... to print the right thing. Do that for void*, usize, isize, iptr, uptr, iptrdiff, uptrdiff
This needs to be done before resolving the typedef.
Any thoughts on how you'd handle function pointers / arrays?
I'm realizing it seems in C3 you have to define function pointers (which appear to be typedefs). That it might be simpler to printout the typedef's name in place of where it's used and attempt to have all typedefs near the top of the file.
Do you have a naming scheme for slices and other array like structs? I think I've managed to output c~ esque types. I'll need to reorder how decls are evaluated to handle those first.
I haven't decided on anything. Try things out and we'll see what works.
Have a reserved identifier schema I could use to avoid collisions with user structs?
Right now I'm thinking of encoding subarrays and flexible arrays with names like:
struct __subarray_<base64_name_of_c3_type / base64_name_of_c_type>
struct __flexarray_<base64_name_of_c3_type / base64_name_of_c_type>
I think collisions are ok to be honest. Let's add it and then have ways to resolve them. For most user defined types there is @extname to redefine it already. And @extname on modules would solve the problem where my_module_foo collides with something else and you want c3_my_module_foo or something else.
Slight problem realized base64 is guaranteed to have at least one invalid character which will not fit within 0-9a-zA-Z_. I'll probably work out a different encoding to use tomorrow.
I would not recommend using base64. For structs, the extname is expected. For subarrays:
I'm slightly confused, are slices / flexible arrays not both internal structs which should be typed? I might do:
struct _slice_<base32_name_of_c3_type>
When I was trying the header_gen algorithm the flexible array case did get hit, is it supposed to be equivalent to the array case?
For this:
case TYPE_FLEXIBLE_ARRAY:
TODO
case TYPE_SUBARRAY:
TODO
Replace with:
case TYPE_FLEXIBLE_ARRAY:
header_print_type(type->array.base);
OUTPUT("[]");
break;
case TYPE_SUBARRAY:
OUTPUT("c3_subarray");
break;
Then elsewhere define:
#ifndef c3_subarray
typedef struct c3_subarray__ { void* ptr, size_t len } c3_subarray;
#endif
Well that's what I was trying to get at about the typing issue, if { void* ptr, size_t len} is used the c compiler's going to know that ptr and len exist, but when the signatures are written it might break things. The void* ptr member should match up with the type.
Hence why I was going to do roughly what you're suggesting except uniquely naming a struct with the corresponding type.
struct _subarray_<an_encoding_of_the_c_type> { c_type *ptr; size_t len }
Then...provided I've wirtten an algorithm that can actually properly write out c types then it's just a matter of introducing that struct into the compiler and it should just get printed as part of the header.
Digging around across github I found a base62 implementation*, I rewrote it in c. I'm sure there's other better bithackery approaches but this version seemed fairly simple. Since there's only a few additional characters in the type system compared to base62, I'm sure there's a better way to encode the typenames safely.
https://gist.github.com/Andersama/dd214ee85c8fefb9cb7f36f3b03c11d2
Considering identifiers a-zA-Z0-9_ technically would be base63. base62 is considered alphanumeric only so a-zA-Z0-9. Technically speaking whitespaces are not required, so the encoder could be rewritten to just skip whitespaces. The additional characters required would be []*. If you considered c++ as a potential output references were allowed & and && only need & as an additional character. <> would be needed for templates and : for namesspaces. I'm likely forgetting something, but worst case the alphabet required would be something like base70, but since outputting to c++ would allow for all those characters you wouldn't need a specialized function like this.
If I'm not mistaken technically c's base66, provided whitespace is removed, so with a small edit the length of the output shouldn't be much longer than the input.
Ok, took me a few days to work out a base62 algorithm. Could you do an overview of your type system? EG: What "failable" and "fault" types look like? I'm assuming those are structs as well. Is the "vector" type a c++ "vector" or an "llvm" vector like a simd register?
Here's some example output of the base62 encoder
struct lexer_TestSlice__
{
struct __subarray_c3RydWN0IGxleGVyX1Rva2VuX18QA tkns;
};
typedef struct lexer_TestDynSlice__ lexer_TestDynSlice;
struct lexer_TestDynSlice__
{
uint32_t member_0;
struct __flexarray_c3RydWN0IGxleGVyX1Rva2VuX18QA tkns;
};
Here's the branch I'm working off of: https://github.com/Andersama/c3c/tree/gen-headers-on-build
Getting much furthur, but I've run into this:
/* function */
void scan_nothing(struct std_array_list$$lexer_Token_List__*, struct std_array_list$$lexer_TokenInfo_List__*, struct std_array_list$$lexer_TokenData_List__*, struct std_array_list$$char_List__*, struct lexer_TokenInfo__*);
It looks as though the parameterized modules are outputting $$ which I'm not sure is legal for c syntax.
fault should just be typedef uintptr_t OptionalResult
The failable (optional) type is a bit more complicated. A parameter cannot be an optional but a return value can. In that case, move the normal result into the first position as an out value. So:
fn int! getFoo(void* z) => OptionalResult getFoo(int *result, void *z)
I don't know if it's suitable for parameterized types to be emitted into the C headers. $ is available as GCC extension so it will work for Clang/GCC at least.
And I'll have a look at your branch tomorrow
You'll probably make way more headway than I am, I only just barely started making sense of how the ast is structured and only because I'm looking at the llvm codegen for reference.
I thought you were doing the headers, but this looks like you're doing C3 -> C compilation?
Well...at the point I'd be outputting c headers which work for the compiled program I figured I might as well be able to output the function definitions (in equivalent C form).
Is there a context object similar to the LLVM one that the header.c file could use? I'm not sure if it's needed, but it looks like you're using the context object for some optimizations while you're using llvm. Otherwise what I'm doing so far seems like a fairly confusing tree walk with the ast which doesn't exactly match up with your llvm codegen.
Hey, I'm not sure, I might've found an llvm codegen bug in llvm_emit_const_expr:
case CONST_POINTER:
llvm_value_set(be_value, LLVMConstNull(llvm_get_type(c, type)), type);
For handling const_pointer this seems to be generating a null pointer, so I can only assume this is not for handling fixed addresses
I have successfully managed to convert some c3 functions into their c equivalents. Lot of things are still broken, but some basic functionality is there. I'm definitely picking up more of how the ast is structured.
/* function: is_second_idnt */
bool lexer_is_second_idnt(uint8_t c) {
return c >= 48 && c <= 57 || c >= 97 && c <= 122 || c >= 65 && c <= 90 || c == 95;
}
/* function: is_any_newline */
bool lexer_is_any_newline(uint8_t c) {
return c == 10 || c == 13 || c == 11 || c == 12;
}
/* function: is_eof */
bool lexer_is_eof(uint8_t c) {
return (uint32_t)c == 0 || (uint32_t)c == 26;
}
Making some pretty decent headway if you'd like to take a look at the branch again, there is an apparent memory leak into the scratch buffer...I'll find it eventually, it's slow enough fairly large files can still output most of their contents.
Couple of things left to do:
- add the "includes" at the top of the file so that functions from other modules can be found
- properly handling member access, I just spit out
.and the extname, but things break where->should have been used instead - fixing of failables and other c3 types that get introduced in places
You shouldn't ever convert expressions or statements though @Andersama
Not sure what you mean. I'm more or less converting the c3 ast back to c. For just the headers you're right I likely don't need to convert expressions or statements.
It could easily be a flag to output function bodies or not.
Or* if you mean the output should be similar to llvm ir instructions but as c I'm sure if I followed along w/ your llvm codegen example to make much more sense of it.
What you're doing if you're doing pure C output is essentially a C backend. Getting headers is so that C code can easily use C3 code that's either in .o or as libraries. So for header gen underlying expressions and statements should never be generated. If you are looking at doing a C backend, then that is something that could be added, but it should probably be organized with the normal codegen flow.
I can integrate what you did for the headers (types + function declarations + globals) into the main branch, but the expression and statement lowering is a huge work and should be moved to a separate codegen rather than the header gen.
Right, ok so we're on the same page. I'm not exactly too familiar with the codegen stage of compilers, I've written quite a few lexers and parsers, some type checking. You can probably tell I'm more or less mirroring what you've got for llvm. I was somewhat falling back on just trying to convert the ast as is and then working out later what lowering and optimizations might look like.
The additional builtin types are the pain point for me at the moment, because defining the functions after structs would be easy enough, but it doesn't seem like outside structs and enums the other types are collected together anywhere. If for example there were a vec of decl* 's to substructs or the other types that'd be great.
The functionality producing headers are intended for creating static and dynamic languages to be used with C/C++. So there should not be any need to use the statements. And there should be no need to do any lowering.