c3c icon indicating copy to clipboard operation
c3c copied to clipboard

Help getting headers working

Open Andersama opened this issue 3 years ago • 38 comments

I added an option to the compiler to emit headers for the entire project --emit-c-headers. The problem I ran into is clearly it's incomplete. I got part way.

Currently I got stuck in static void header_print_type(FILE *file, Type *type) because there's no handling of typedefs.

When I tried

case TYPE_TYPEDEF:
	OUTPUT("%s", type->name); //this clearly is the typedef's name, but I need to output the typedef somewhere
	//header_print_type(file, type->decl->typedef_decl.type_info->type); //digging through helpers this seemed like a good shot
	return;

I got the resulting c struct which seems right

struct lexer_TokenInfo__
{
   usize offset;
   usize length;
   int32_t col;
   int32_t line;
};

I suspect what's needed although I haven't worked it out is what really belongs in here:

static void header_gen_typedef(FILE* file, int indent, Decl* decl)
{
	if (decl->extname)
		OUTPUT("typedef %__ %s;\n", decl->extname, decl->extname);
}

The resulting file is clearly having problems with usize and isize etc...

Solved the issue with gotos referencing other code I found, I'll slowly work through this:

		case TYPE_TYPEDEF:
			type = type->canonical;
			goto RETRY;

Andersama avatar Aug 02 '22 12:08 Andersama

usize should be replaced by size_t, so you can do a check: if (type == type_usize) ... to print the right thing. Do that for void*, usize, isize, iptr, uptr, iptrdiff, uptrdiff

lerno avatar Aug 02 '22 12:08 lerno

This needs to be done before resolving the typedef.

lerno avatar Aug 02 '22 12:08 lerno

Any thoughts on how you'd handle function pointers / arrays?

I'm realizing it seems in C3 you have to define function pointers (which appear to be typedefs). That it might be simpler to printout the typedef's name in place of where it's used and attempt to have all typedefs near the top of the file.

Andersama avatar Aug 02 '22 14:08 Andersama

Do you have a naming scheme for slices and other array like structs? I think I've managed to output c~ esque types. I'll need to reorder how decls are evaluated to handle those first.

Andersama avatar Aug 02 '22 20:08 Andersama

I haven't decided on anything. Try things out and we'll see what works.

lerno avatar Aug 02 '22 20:08 lerno

Have a reserved identifier schema I could use to avoid collisions with user structs?

Right now I'm thinking of encoding subarrays and flexible arrays with names like:

struct __subarray_<base64_name_of_c3_type / base64_name_of_c_type>
struct __flexarray_<base64_name_of_c3_type / base64_name_of_c_type>

Andersama avatar Aug 02 '22 21:08 Andersama

I think collisions are ok to be honest. Let's add it and then have ways to resolve them. For most user defined types there is @extname to redefine it already. And @extname on modules would solve the problem where my_module_foo collides with something else and you want c3_my_module_foo or something else.

lerno avatar Aug 02 '22 22:08 lerno

Slight problem realized base64 is guaranteed to have at least one invalid character which will not fit within 0-9a-zA-Z_. I'll probably work out a different encoding to use tomorrow.

Andersama avatar Aug 03 '22 05:08 Andersama

I would not recommend using base64. For structs, the extname is expected. For subarrays: _slice is fine. There is no need to name the flexible array members.

lerno avatar Aug 03 '22 08:08 lerno

I'm slightly confused, are slices / flexible arrays not both internal structs which should be typed? I might do:

struct _slice_<base32_name_of_c3_type>

When I was trying the header_gen algorithm the flexible array case did get hit, is it supposed to be equivalent to the array case?

Andersama avatar Aug 03 '22 23:08 Andersama

For this:

case TYPE_FLEXIBLE_ARRAY:
  TODO
case TYPE_SUBARRAY:
  TODO
  

Replace with:

case TYPE_FLEXIBLE_ARRAY:
   header_print_type(type->array.base);
   OUTPUT("[]");
   break;
case TYPE_SUBARRAY:
   OUTPUT("c3_subarray");
   break;

Then elsewhere define:

#ifndef c3_subarray
typedef struct c3_subarray__ { void* ptr, size_t len } c3_subarray;
#endif

lerno avatar Aug 03 '22 23:08 lerno

Well that's what I was trying to get at about the typing issue, if { void* ptr, size_t len} is used the c compiler's going to know that ptr and len exist, but when the signatures are written it might break things. The void* ptr member should match up with the type.

Hence why I was going to do roughly what you're suggesting except uniquely naming a struct with the corresponding type.

struct _subarray_<an_encoding_of_the_c_type> { c_type *ptr; size_t len }

Then...provided I've wirtten an algorithm that can actually properly write out c types then it's just a matter of introducing that struct into the compiler and it should just get printed as part of the header.

Andersama avatar Aug 04 '22 01:08 Andersama

Digging around across github I found a base62 implementation*, I rewrote it in c. I'm sure there's other better bithackery approaches but this version seemed fairly simple. Since there's only a few additional characters in the type system compared to base62, I'm sure there's a better way to encode the typenames safely.

https://gist.github.com/Andersama/dd214ee85c8fefb9cb7f36f3b03c11d2

Considering identifiers a-zA-Z0-9_ technically would be base63. base62 is considered alphanumeric only so a-zA-Z0-9. Technically speaking whitespaces are not required, so the encoder could be rewritten to just skip whitespaces. The additional characters required would be []*. If you considered c++ as a potential output references were allowed & and && only need & as an additional character. <> would be needed for templates and : for namesspaces. I'm likely forgetting something, but worst case the alphabet required would be something like base70, but since outputting to c++ would allow for all those characters you wouldn't need a specialized function like this.

If I'm not mistaken technically c's base66, provided whitespace is removed, so with a small edit the length of the output shouldn't be much longer than the input.

Andersama avatar Aug 04 '22 06:08 Andersama

Ok, took me a few days to work out a base62 algorithm. Could you do an overview of your type system? EG: What "failable" and "fault" types look like? I'm assuming those are structs as well. Is the "vector" type a c++ "vector" or an "llvm" vector like a simd register?

Here's some example output of the base62 encoder

struct lexer_TestSlice__
{
   struct __subarray_c3RydWN0IGxleGVyX1Rva2VuX18QA tkns;
};
typedef struct lexer_TestDynSlice__ lexer_TestDynSlice;
struct lexer_TestDynSlice__
{
   uint32_t member_0;
   struct __flexarray_c3RydWN0IGxleGVyX1Rva2VuX18QA tkns;
};

Here's the branch I'm working off of: https://github.com/Andersama/c3c/tree/gen-headers-on-build

Getting much furthur, but I've run into this:

/* function */
void scan_nothing(struct std_array_list$$lexer_Token_List__*, struct std_array_list$$lexer_TokenInfo_List__*, struct std_array_list$$lexer_TokenData_List__*, struct std_array_list$$char_List__*, struct lexer_TokenInfo__*);

It looks as though the parameterized modules are outputting $$ which I'm not sure is legal for c syntax.

Andersama avatar Aug 05 '22 21:08 Andersama

fault should just be typedef uintptr_t OptionalResult The failable (optional) type is a bit more complicated. A parameter cannot be an optional but a return value can. In that case, move the normal result into the first position as an out value. So:

fn int! getFoo(void* z) => OptionalResult getFoo(int *result, void *z)

lerno avatar Aug 06 '22 02:08 lerno

I don't know if it's suitable for parameterized types to be emitted into the C headers. $ is available as GCC extension so it will work for Clang/GCC at least.

lerno avatar Aug 06 '22 02:08 lerno

And I'll have a look at your branch tomorrow

lerno avatar Aug 06 '22 02:08 lerno

You'll probably make way more headway than I am, I only just barely started making sense of how the ast is structured and only because I'm looking at the llvm codegen for reference.

Andersama avatar Aug 06 '22 04:08 Andersama

I thought you were doing the headers, but this looks like you're doing C3 -> C compilation?

lerno avatar Aug 06 '22 14:08 lerno

Well...at the point I'd be outputting c headers which work for the compiled program I figured I might as well be able to output the function definitions (in equivalent C form).

Andersama avatar Aug 06 '22 21:08 Andersama

Is there a context object similar to the LLVM one that the header.c file could use? I'm not sure if it's needed, but it looks like you're using the context object for some optimizations while you're using llvm. Otherwise what I'm doing so far seems like a fairly confusing tree walk with the ast which doesn't exactly match up with your llvm codegen.

Andersama avatar Aug 07 '22 05:08 Andersama

Hey, I'm not sure, I might've found an llvm codegen bug in llvm_emit_const_expr:

		case CONST_POINTER:
			llvm_value_set(be_value, LLVMConstNull(llvm_get_type(c, type)), type);

For handling const_pointer this seems to be generating a null pointer, so I can only assume this is not for handling fixed addresses

Andersama avatar Aug 07 '22 07:08 Andersama

I have successfully managed to convert some c3 functions into their c equivalents. Lot of things are still broken, but some basic functionality is there. I'm definitely picking up more of how the ast is structured.

/* function: is_second_idnt */
bool lexer_is_second_idnt(uint8_t c) {
return c >= 48 && c <= 57 || c >= 97 && c <= 122 || c >= 65 && c <= 90 || c == 95;
}
/* function: is_any_newline */
bool lexer_is_any_newline(uint8_t c) {
return c == 10 || c == 13 || c == 11 || c == 12;
}
/* function: is_eof */
bool lexer_is_eof(uint8_t c) {
return (uint32_t)c == 0 || (uint32_t)c == 26;
}

Andersama avatar Aug 07 '22 08:08 Andersama

Making some pretty decent headway if you'd like to take a look at the branch again, there is an apparent memory leak into the scratch buffer...I'll find it eventually, it's slow enough fairly large files can still output most of their contents.

Couple of things left to do:

  1. add the "includes" at the top of the file so that functions from other modules can be found
  2. properly handling member access, I just spit out . and the extname, but things break where -> should have been used instead
  3. fixing of failables and other c3 types that get introduced in places

Andersama avatar Aug 08 '22 07:08 Andersama

You shouldn't ever convert expressions or statements though @Andersama

lerno avatar Aug 08 '22 07:08 lerno

Not sure what you mean. I'm more or less converting the c3 ast back to c. For just the headers you're right I likely don't need to convert expressions or statements.

It could easily be a flag to output function bodies or not.

Or* if you mean the output should be similar to llvm ir instructions but as c I'm sure if I followed along w/ your llvm codegen example to make much more sense of it.

Andersama avatar Aug 08 '22 09:08 Andersama

What you're doing if you're doing pure C output is essentially a C backend. Getting headers is so that C code can easily use C3 code that's either in .o or as libraries. So for header gen underlying expressions and statements should never be generated. If you are looking at doing a C backend, then that is something that could be added, but it should probably be organized with the normal codegen flow.

lerno avatar Aug 10 '22 11:08 lerno

I can integrate what you did for the headers (types + function declarations + globals) into the main branch, but the expression and statement lowering is a huge work and should be moved to a separate codegen rather than the header gen.

lerno avatar Aug 10 '22 11:08 lerno

Right, ok so we're on the same page. I'm not exactly too familiar with the codegen stage of compilers, I've written quite a few lexers and parsers, some type checking. You can probably tell I'm more or less mirroring what you've got for llvm. I was somewhat falling back on just trying to convert the ast as is and then working out later what lowering and optimizations might look like.

The additional builtin types are the pain point for me at the moment, because defining the functions after structs would be easy enough, but it doesn't seem like outside structs and enums the other types are collected together anywhere. If for example there were a vec of decl* 's to substructs or the other types that'd be great.

Andersama avatar Aug 10 '22 19:08 Andersama

The functionality producing headers are intended for creating static and dynamic languages to be used with C/C++. So there should not be any need to use the statements. And there should be no need to do any lowering.

lerno avatar Aug 10 '22 20:08 lerno