[RFC] New Headergen Proposal

michaelrj-google · 2024-06-18T22:51:04.111Z

Written by my interns @aaryanshukla @RoseZhang

Objective

The new HeaderGen function should be able to generate header files based on command-line input, without having to depend on TableGen.

Background

The old header generator (libc-hdrgen aka “headergen”) is based on tablegen, which creates an awkward dependency on the rest of LLVM for our build system. By creating a new standalone headergen (codename: Regeneration) we can unblock Fuchsia and Googles repositories using LLVM-libc’s headers.

Design

Overview

Proposed Design

Repo for Example Implementation: GitHub - aaryanshukla/newheadergen

We use an intermediate representation consisting of classes to facilitate the generation of headers from any frontend.
For now, we created classes for each attribute of the function header: macros, enumerations, types, function, arguments, and includes.
Our design currently includes functions to parse YAML to IR to headers and we will add from Clang Extract Api Json to IR to headers. We acknowledge the benefits of the Clang Extract API and plan to implement that format later on.
Additionally, there will be cross functionality that allows a yaml file to be parsed into the intermediate representation and then converted into JSON format, and vice versa.
There will be separate YAML files for different headers. In CMake, they will be concatenated into single files separated by standard (YAML will treat concatenated files as one file).

New HeaderGen Design Doc1024×768 22.4 KB

Detailed design

GitHub

GitHub - aaryanshukla/newheadergen

Contribute to aaryanshukla/newheadergen development by creating an account on GitHub.

Alternatives considered

File Format1280×960 127 KB

File Organization1280×960 118 KB

Header Generation File1280×960 133 KB

Quality attributes

Testability

Ensure that headers produced from new HeaderGen are equivalent to old headergen
We would also rework cmake to ensure all tests are still being passed in LLVM.
Add unit tests. This would just be creating a template of the function header and extracting the generated headers to see if they pass the case. Mainly for the future in case people wanted to add a header or change HeaderGen.

Project management

New HeaderGen is Aaryan and Rose’s STEP intern project. The last day of the internship is Friday, August 9th. Ideally, the new HeaderGen will be released a week prior.

Documentation plan

We will update existing docs on how to use the new headergen system and inform upstream about the new changes since it will impact the rest of LLVM.

Launch plans

Our project will replace the old headergen system so all the documents currently in headergen will be replaced. The impact is removing the dependency on tablegen so that Fuchsia and Googles repositories have less dependencies to worry about and are able to create standalone builds.

Launch will hopefully happen towards the end of July.

Operations

Rollback strategy

If there are errors with the new HeaderGen, we will rollback to the old headergen + TableGen while we undergo fixes.

nickdesaulniers · 2024-06-18T22:57:32.021Z

Breaking the dependency on tablegen will simplify constructing an entirely llvm based toolchain; we would no longer need to build tablegen for libc, then rebuild tablegen for llvm & clang against the new libc.

The use of the term Intermediate Representation is a bit much; really this is simply “in memory representation.” Sounds like we deserialize yaml and reserialize it as C code (header files).

Why do we need to support JSON?

aaryanshukla · 2024-06-18T23:45:05.417Z

The role of Json would only be needed if we used Clang Extract Api which is already built into LLVM. There are some issues with how Clang Extract spews out Json structured data for our needs such as not including macros and enumerations. After we have a successful implementation from yaml to c headers then we can start to fix the issues with Clang Extract Api.

nickdesaulniers · 2024-06-20T16:33:37.205Z

Perhaps for planning for the future, it may be nice to use similar field names that clang extract API uses, even if we just support YAML and not JSON out of the box. Then, folks may use clang extract API to get JSON, and use some out of the box JSON to YAML converter to get data roughly in the format that we consume.

RoseZhang · 2024-06-20T17:32:29.272Z

Using similar field names is certainly fixable and we would benefit from having consistency. But users shouldn’t have to use an out of the box converter since we would be able to output both JSON and YAML and do conversions from one to the other. Since representation would be the same, there would just be separate scripts for generating JSON and YAML.

ilovepi · 2024-06-21T15:19:59.164Z

Thanks for the RFC. We are all surely glad to see a resolution to the odd tablegen dependency with libc. I have a few comments and questions about the proposal.

Can you clarify what you mean by “intermediate representation consisting of classes”? Based on Nick’s comment I’m guessing this is an implementation detail about how you represent this in c++. Is that accurate? You may want to consider rewording that part to better convey your point.

Is your point 4 stating that headergen will take either JSON or YAML input, and can either output the headers or an equivalent config file in JSON or YAML? This is a kind of surprising functionality IMO. Can you provide a bit of explanation about why it is necessary to convert between the formats?

I think maybe the discourse formatting may have truncated some of point 5, since it seems to be missing some text. I assume you listed the separator and it was formatted away. I’m also slightly concerned about the “combine in CMake” bits. Can you provide some more rationale for that decision?

My observation on JSON vs YAML in LLVM is that LLVM as a whole has been moving more in the direction of using JSON over YAML. We’re seeing it crop up in more and more tools without YAML support.

But JSON is valid YAML, so I’m not sure there’s a downside to prioritizing JSON, or in only supporting it.

aaryanshukla · 2024-06-21T17:34:24.506Z

Regarding the intermediate representation of classes, the wording is a little unclear. What we mean is that we deserialize YAML or JSON input into an intermediate class-based representation in python. This helps us manipulate and transform the data more effectively before generating the C code. This is mainly an implementation detail about how we handle this internally.

In point 4, we aim to provide flexibility in the format of the input. Users can choose to provide their input in either JSON or YAML. We have developed a mechanism to convert between JSON and YAML and to handle both formats seamlessly. The reason for this flexibility is due to some issues with the JSON structure from Clang-Extract-API that don’t meet our needs. This conversion capability allows users to work in the format they find most convenient and provides a way to leverage existing tools and workflows.

In point 5, The “separator” refers to YAML’s ability to concatenate files using CMake. This feature is useful because it allows users to combine multiple YAML files during the CMake compilation process. This way, users can specify functions across different YAML files without needing to list each file explicitly in the CLI.

We acknowledge that LLVM has been moving towards using JSON in more tools, which makes JSON support essential. While JSON is valid YAML, prioritizing JSON can streamline compatibility with LLVM’s ecosystem. However, we encountered specific issues with the JSON format from Clang-Extract-API. Our immediate goal is to replace the old headergen with a functional version that supports YAML effectively. Once that is established, we plan to extend full JSON support to ensure comprehensive compatibility.

ilovepi · 2024-06-21T19:38:09.024Z

It sounds like a lot of the complexity here is related to limitations in Clang-Extract-API’s output. Perhaps that should be explicitly out of scope for the MVP, but stay as a long-term goal?

aaryanshukla · 2024-06-21T19:44:33.068Z

I agree, will modify that in terms of short-term and long-term goals.

nickdesaulniers · 2024-06-21T21:24:54.432Z

ilovepi:

My observation on JSON vs YAML in LLVM is that LLVM as a whole has been moving more in the direction of using JSON over YAML.

While JSON is ok as an interchange format, it is quite poor as a configuration language in my experience. YAML shines over JSON in:

doesn’t need termination; JSON is not allowed to have trailing commas which is annoying for maintaining git blame (for what that’s worth). I guess you could have preceeding commas, but that’s a criminal offence.
JSON cannot have comments in it. That’s annoying when leaving breadcrumbs and additional info in config files, or banners about not modifying generated files that get checked in.
YAML can be joined together with cat and still be valid. This lets us break up config files into smaller pieces that can be organized logically (say by standard such a C standard, POSIX, GNU or BSD extensions, as we do today with the tablegen flies in llvm-libc), then have tooling join together the YAML files on the fly. This is helpful when config files become large and unwieldy.
YAML anchors allow for aggressively deduplicating values.

At the end of the day, there are web-based and command line based tooling for converting from one format to the other, so it doesn’t matter that much what we pick.

If we want to manually update these config files by hand, YAML is simply less painful. If we want to generate these config files via clang-extract-api exclusively, then JSON is probably better (though, since we can’t have comments, then we can’t put a banner along the lines of “autogenerated via $blah, do not hand modify” that other tools in tree do (such as llvm/utils/update*_test_checks.py)).

ilovepi · 2024-06-21T23:43:45.575Z

Those are fair points, and I suppose I hadn’t been thinking about how we’d typically structure a configuration file. To be clear, I’m not opposed to YAML. My comment was more to provide context on the direction I see the project heading. But it’s clear that you have a rationale for that decision, but some of those points being called out in the RFC would have been helpful.