XLS (Accelerated HW Synthesis) Lecture @ UCSC 2024

Posted by cdleary on 2024-04-08

What follows is a transcript (with some edits for clarity) of the XLS (Accelerated HW Synthesis) guest lecture from Professor Beamer's CSE 228A - Winter 2024 @ UC Santa Cruz.

Note that while the topic of this lecture is the XLS (Accelerated HW Synthesis) hardware development environment / toolkit this is primarily a course that teaches Chisel and in particular covers Programming Language (PL) aspects around Chisel's design, which framed some of the content and allusions for the talk.


Thanks Scott, thank you everybody for having me here today.

As Scott said, I previously worked on XLA, which is a linear algebra compiler; now I work on XLS -- the joke is: 24 more projects until we run out of names.


XLS is short for Accelerated Hardware Synthesis. The idea of this project is that we're kind of "up leveling" the hardware design process beyond what we might normally do with RTL (Register Transfer Language).

It's not exactly all the way to High Level Synthesis which you may have learned about or heard about -- we're going to talk about some of the differences there.

One kind of funny story: when I first gave an XLA talk publicly, one of the people in the audience listening to the talk ended up creating the ML framework called JAX, because they were like, "I know how I can put my research together with that thing that just got presented," so it's one of these funny phenomenon of: you never know who's going to do something really awesome with the tech that you're trying to present.


This is very much a team sport. I'm presenting the work of many people, we have all the contributors listed on github. We have a fully open source code base and issue tracker, all on github.com/google/xls -- we have documentation which is rendered from there, it's all Apache2 licensed.

And just a note: this is still early days. We basically run the gamut inside of Google of applying XLS and seeing what it can do and how well it works, and building it out from its kind of bread-and-butter facilities. But there are sharp edges, and we work with teams specifically so we can build it out while we make sure we can maintain quality and have them avoid the sharp edges.

So if you go to try to test it out and you find bugs please report them and help make it better, that would be amazing.


Alright, so we're going to start with "Why?" I remember I was auditing a grad course and the professor drew on the board a pyramid. At the top was like "Applications", in the middle was "Operating Systems & Compilers", and at the bottom was "Hardware". The implication was that you want to work on hardware because all the stuff at the top of the pyramid is all benefitting from the work you're doing on the bottom.

But the professor took us on a rollercoaster ride and said, but there's also really only like two places you could get jobs doing high performance microprocessor architecture and mostly you'd probably get a job on Itanium. And, you know, we're not all using that processor today... So it was kind of a bleak story. But I would claim that, since 20 years ago it's pretty different now. Anybody in the class want to say what they think is the main thing that's different between 20 years ago and now? Why is it less of a bleak story?

A: Moore's law has kind of died off.

Moore's law has kind of died off, exactly. And there's a really interesting hardware design game at the end of Moore's law. Does anybody know what a hyperscaler is? It's kind of an industry term.

A: It means very large cloud providers like Google, Amazon, etc.

Yeah, very large providers -- so there was kind of this evolutionary shift where we went from mainframe computers to minicomputer to microcomputers, and then people started filling warehouse-scale data centers full of microcomputers, right? So they were deploying at such scale they started calling these folks hyperscalers, and if you look at all the hyperscalers in industry, they're all starting to do custom hardware. It's an empirically observable phenomenon.

This was not so possible before before we were all on this treadmill, where CPU advances would get faster quicker than we could build custom hardware. But that is also going away, as part of the End of Moore's Law.


So how do we capitalize on this? What are we going to do in order to take advantage of this? Good tools and platforms are going to be necessary for this new era.

Before there were only a couple of design houses that needed to build hardware, and there were only a couple of ways you needed to do it, and a certain number of people that you knew was exactly the right number of people to build a certain piece of hardware. But as we start retargeting things more, reusing them across different use cases, as we start taking vertical slices from the application down to the hardware we're going to want to reuse and repurpose things in more places.

And this is sort of the purpose of this course, right? The idea of reusing and make iteration cycles faster, and bridging the gap between what we might think of as traditionally software-style activities and the building of hardware.


I got to work on the TPU, I'm very privileged to have done that, and I think that the real magic that I saw there was the hardware and the software people working across the aisle. So after working on the TPU, I was asking "What is really the key enabling technology that we need to get a lot more of that magic to happen?" To have a lot more of the hardware and software people working across the aisle, understanding shared artifacts and understanding cost models that our collaborators are thinking of on the other side of the fence. How do we start tearing down that fence?


I think another coincidental thing is that, if you look back at the history of hardware design, we've lost a lot of the attempts to really up-level what we do and how we do it, with what we might think of as the failure of traditional High Level Synthesis to win on designs and to win hearts and minds.

High Level Synthesis itself has not completely failed, [ed: people do use it successfully to do major and important things today]; but, for example, I can't easily go to a Hardware Engineer on an arbitrary project and start saying, "Let's start doing some of these things with High Level Synthesis," it's a difficult proposition in many cases.


So part of why we work on XLS is that we're trying to figure out: can we get it back? Can we start up-leveling and bridging that gap between software (and the way we do software things) and hardware (and the way we do hardware things).


And there was a previous time that something like this happened in history. I call it "the FORTRAN moment", but basically there was a point where we started writing things in FORTRAN, and it would do a register allocation for us. Right, this was one of the first higher-level languages? And then we would look at our register allocation that we spent three weeks on, and we'd say, "Hm, it actually made a better one..." I don't understand the one that it did, but by virtue of not understanding it I actually got a better result, faster.

So this idea of not really compromising on what we got in terms of the quality, but also getting sped up: this is really the key we need to unlock people wanting to do this up-leveling process.

So the idea here is that you write hardware in your traditional methodology, you compare to a version that was created with more automation: ideally the version with more automation has better Quality of Result (QoR), is more correct/featureful, and took less time.


There's really two pillars in how we might think about new tooling and automation improvements; one is methodology.

Methodology is things like velocity: how fast can I go; the OODA loop (orient, observe, decide, act). How testable is it, how reusable is it?

And the Quality of Result, is really at the end of the day, the Power/Performance/Area I was able to get out for my design and how good was that -- things like what throughput did it have, how many logic levels did it end up being, how did it end up optimizing against the process node that I'm targeting?


Q: What exactly do you mean by High Level Synthesis?

Yeah, so High Level Synthesis -- and we're going to talk about this a little more in subsequent slides, but it's good to have context early -- was in some ways the idea that we could take software almost exactly as software exists and turn it into hardware. There's subtleties there, that kind of were not explored as well, and that's part of what we're talking about today. But the idea that you could use a software style methodology and write a software looking artifact, and that it would automatically turn into gates and a synthesizable artifact is the basic idea of High Level Synthesis.


In trying to automate more and up-level people and get good things to happen, who are our users? We have segments of different users that we think about.

One is we want to take Hardware Engineers as they do things today, lift up the level of abstraction that they can work at, make them more productive.

The second one is taking Software Engineers, who think in terms of performance. They're software engineers who are optimizing assembly, even ones who look at micro-op ports and think about how quickly small snippets of assembly can go. These are people who are aligned to the idea of "how we can we make something go fast": what are the costs, what can the ultimate workload do on this piece of hardware that I'm trying to create.

And then the third category is people who might do modeling; for example in computer architecture classes you might model different components. That modeling actually could create synthesizable hardware artifacts, you could close the loop on, "Can this branch predictor I just tried out close timing at 2GHz on a real process node? How much space/area does it take?" You might be doing GEM5 modeling or something along these lines in school -- can I actually get a hardware artifact out of that as part of my process of modeling what it would be?

If you wrote it in say, C or C++, and used unbounded queues or something like this, you can't turn that into hardware. But maybe it's not a very big leap to write something that actually could be synthesized as hardware.

So again this is all in the interest of creating more shared artifacts and understanding. We're getting more productivity by platforming people, putting tools underneath them that then we can analyze their workload and do good things to it, and automation: how do we automatically do things that would be very tedious to do by hand?


This is the XLS stack diagram -- it's a little bit of an eye chart, try not to be overwhelmed, but the basic idea is that language things come in at the top. We have a Domain Specific Language (DSLX) that mimics Rust called DSLX. So think: "A Domain Specific Langauge called X".

That turns into our Intermediate Representation (IR), and you just had a lecture last time on what a hardware IR might look like, so that's basically a data structure that represents the program / represents the hardware you're trying to create.

We can optimize that, and we can also run it, on the host computer, as native code. We actually have a way to Just-In-Time (JIT) compile it into fast assembly code. Once we've optimized it we can also schedule it, mapping it into cycles. Again, going back to the idea of "how do we make software and hardware artifacts more similar," the notion of introducing the idea of cycles in a flexible and automatic way is part of what we do here in this flow. Scheduling is a big part of taking something that was not designed with a particular set of cycles in mind and introducing that automatically in a flexible way.

Then we generate, ultimately, Verilog. Verilog can be simulated or synthesized through backend tools. Backend tools then spit out a gate-level netlist. The gate level netlist we have ways of proving are equivalent to the original input program?


Q: What is the purpose of generating that x86?

That's for very fast functional simulation -- I'll show you at the end, say you're writing a floating point unit, we can run the whole 2^32 space in like 10 seconds, because we can go straight into native assembly code using that description that you made up at the top.

There's also frontends for C++ that certain groups inside of Google do use, though our preferred path for people is to use this Domain Specific Language because we think that there are challenges in C being directly used as a hardware description language, which we're going to talk about a little more.


Ok, so, it's not a CSE228A lecture without notebooks!

These notebooks have publicly available versions; bit.ly/learn-xls and bit.ly/xls-playground -- this is the "learn XLS" notebook.

What we have here is the first "Domain Specific Language X" (DSLX) code that you'll see. As I said, it's a language that's very similar to Rust. Rust is a nice choice because it's an expression-oriented immutable-focused dataflow-oriented language.

We don't have references or borrows here, so it's also simpler than normal Rust, and maybe more accessible to more people, but it basically acts as a function description language where we don't need to invent a new syntax for people to learn and the semantics are as close to Rust as we can get them.


This is a basic 8-bit adder, and what we're doing here is taking two inputs, which are 8-bits, spitting out an output which is 8-bit, and we're writing an inline unit test where we passed 41 and 1 and we get 42.

When you run this cell it runs according to this %%dslx "magic line" (as they call it in Jupyter/Colab) which says that the top entry point is this adder8 function. We want to make something that's one pipeline stage with no flops on the input or output -- and of course, that's just going to make combinational logic, since they're no flops on either side.


Our unit test passed ok. If I change this to 43, it should fail, and it does, it says "The program being interpreted failed" here in the result tab. So what happens on success? We get our Intermediate Representation which we can see in the "ir" tab. This is the top level function, it takes these two parameters in and spits this result out -- it has position correlation information so that we can track how things from the original program flowed through the system.


When you optimize it, you can see the result in the "opt" tab -- there's nothing in this program you can optimize directly, it's just an 8-bit add. The Verilog we spit out is in the "verilog" tab, it has a single pipe stage with no flops around it, as we requested, in our user-created module.


There's also the "schedule" tab -- the scheduler ends up doing more work in more interesting cases, but it shows here what the node was that came in, how long the delay was for this node (targeting this underlying process node), and what happened on the critical path for the design that you made. The "bom" tab shows the "Bill of Materials", which shows what objects are in the design that came out in th end, after optimization.


One thing I kind of snuck by you here -- we're using a Process Design Kit (PDK) and it's called ASAP7 -- this is not a "real"/manufacturable design kit, this is an approximation of a 7 nanometer (FinFet) PDK that's fully available in open source for collaborating on and researching in all sort of ways.


Q: Will the Bill of Materials (BoM) reflect anything specific to the PDK? E.g. will I see skywater130 specific objects in the Bill of Materials when I target sky130 as the PDK?

A: At this level we don't specialize for the underlying (PDK) target, so there will be no real change to the bill of materials targeting different process nodes, except that the delays for sky130 will be different from the delays in targeting ASAP7 -- the delays will end up producing different pipelines and different Verilog, and that could end up changing exactly what was in the "verilog" tab at the end, but you won't see a sky130-specific thing listed in the bill of materials.


Q: Is the assert_eq a test-only construct or can it be used like SV assertions?

A: There is a way to fail a design with an asynchronous failure, but this assert_eq is specifically for testing purposes where it can spit out a nice error message. The way that we do it inside of a function is with a construct called fail! -- this is just more evocative of the fact that the hardware can potentially fail when synthesized. That turns into something that's like a SV assertion in the resulting hardware.


Moving on to the next example, we're doing an 8-bit adder with a carry output. In the DSL, like in Rust, the values have to be the same width when they're added together, so we're turning them into 9-bit values, adding them together, slicing out the lower 8 bits using a slice syntax, and then grabbing the upper bit. This is actually showing that there's two kinds of slice syntax: one is slicing from LSb to MSb with exclusion on the MSb value, and the other one is "at a start position, give me something of this width (with this type)"; i.e. at bit 8, give me a 1-bit (unsigned) value.

That's returning us the sum and the carry bit as a result. This is just to show you various feature of how the XLS stack all works.


We also have a standard library. The standard library, as one example, gives you facilities for doing widening arithmetic operations. Naturally in languages like Chisel, I believe, when you add two values then you get the expanded width, right? When you add two 8-bit values there's naturally a carry bit that comes out. That's not how Rust semantics work, but we offer a standard library function that takes in two 8-bit values and produces a 9-bit value out if that's your preferred semantics, and then you can slice out of that result just the same.


Now we're introducing the notion of a parameterized function -- this is parameterized on bitwidth N. Instead of what you'd do in a generative form, e.g. you might take in an arbitrary type with a require statement, what we do here is explicitly parameterize the design on certain values. Here we take in two values with width N, then we add the values together, and we slice up to the Nth bit, and then we slice out the N+1th bit as the carry output. Does this make sense as the generalization of what we did in the previous cell?

Now to test it, we use a module-level constant, and we do a unit test with both a u4 and a u8 -- both of these instantiate the parametric design we made.


Now in this cell we're also doing a multiply -- if you take two values in of size N, it produces a multiple result of size N+N -- that's the maximum product size. Normally in software we slice out the lower bits and we use those -- you can also do that with the multiply operator * as in Rust.


Here's where it starts to get a little more interesting. Now we're actually going to specify the clock period where we want to chop the pipeline stage. Show of hands, in your project designs have you needed to figure out where you're going to put a pipeline stage? Like, "I need to put a register here, I can't just put arbitrary amounts of functionality". Maybe just one or two designs that are timing sensitive in this class. Timing and timing-closure is a big part of the hardware design process: you need to figure out, "does this close at 1GHz?", "does this close at 2GHz?" If your design doesn't close at a particular frequency, your design doesn't work: you can't tape it out, you can't get any hardware back.

So timing closure and automatically figuring out what fits into a clock cycle is a big part of the traditional hardware design process. What's nice here is that XLS will be able to figure out that I need to do this particular computation in two stages.

Aside: this is not my favorite view (in the Colab) for the schedule came out -- we actually have an IR visualizer too, but this is one we can show easily in a table.

This is basically showing that the product op, which we saw in the code, is taking the most time on the critical path, and we've automatically scheduled it into two pipeline stages, one which just contains the multiply op and another that contains the rest of the operations.

Does it make sense that scheduling things into cycles is a big part of the hardware design process?


I'm going to skip ahead a bit in the colab. So far we've seen just functions -- we've called functions and synthesized them and scheduled them. We actually have a way to do things also "in time". If you just have a function, it can't really say what it does as time progresses.

This is called a Communicating Sequential Process. This process basically takes a function, executes it in time, and marches forward. This process is going to use the function that we previously described, this multiplier-adder, and it's going to talk over certain channels. Channels are the moral equivalents of ports in Verilog.

Channels are analogous to ports in Verilog (and Chisel). We have explicit state initialization. We have configuration of how the process wires up to other communicating processes. We have a function called next that iterates this process forward in time.

Back to our communicating process example, we've taken our multiply accumulate function and turned it into something that interacts in time. First it will receive a receive a value from one channel; then it'll receive a value from another channel; then we're going to do a multiply-addition on the values we got; we're going to use the accumulator value that we're "carrying forward" in time, and we're going to send the result to the outside world.


As a result we've turned our function into something that can work in time, and the Verilog comes out for it, with various values doing ready/valid signaling.

Like in Chisel, there's ways to control how a channel works, what it consumes, what it produces, with the associated ready/valid signaling. We also have pluggable reset strategies, as well as the clock frequency target that we're working with.


Going down to one of the last examples, we're going to take this same multiply-accumulate communicating process and we're going to call this function, but we're also going to process a command that comes in -- so you can see how this one can perform a multipliplication, it can perform a multiply-accumulate, or it can reset its internal accumulator value.

You get this Command which is a struct -- it has values inside, potentially for the operands, you take in a command and you spit out the outputs. We use a match statement which is the equivalent of a Chisel when in order to perform an operation according ot the command. This is going to create a priority encoder [ed: technically in this case it's one hot so it will create a one-hot-selector] that says which of the results is going to flow into the resulting state value.


I think one of the things you'll notice if you've been working in Chisel a lot, as you've been doing in this class, one of the main differences between an environment like this and a Chisel-oriented environment, is that you have a first class language that you're interopping with, instead of a "layer on top that's producing a layer below". Something like a when function which is part of the Chisel DSL that's on top of Scala, is actually a first-class match statement in this environment.


Q: How is state declared / what is the type of it?

A: State is the second parameter to the next function. [ed: You have to produce the same type as the result of the next function as the type of the state, i.e. next does send/recv side effects and computes the next state value.]


Another part that I've glossed over here is the token. Imagine that you're making a process and everything inside of the process is concurrent. So again, a process is kind of like a "thread" or an "actor". Has anybody heard of actor model before, show of hands? A couple of people. An actor is basically just a thread of control that takes messages and produces messages to talk to the outside world. What the token does, is it sequences what has to happen in a particular sequence for the I/O operations.

So let's imagine I needed to receive something from ${RIGHT_HAND_SIDE_OF_ROOM} before I receive something from ${LEFT_HAND_SIDE_OF_ROOM} and it had to happen in that order [ed: maybe the person on the right hand side of the room would then somehow help help out the person of the left hand side of the room, which meant I needed to make sure it happened in that order]. I would thread the token first through the receive from ${RIGHT_HAND_SIDE_OF_ROOM} to the receive from ${LEFT_HAND_SIDE_OF_ROOM} and then I'd perform my operation.

If those could happen simultaneously where I don't know which "side of the room" would send to me first, I'd use the same token with fanout to both of the receives, get the values back, and then do my operation. This is basically the specification of does it have to happen in a particular sequence, or can it happen concurrently.

What's nice is that everything happens concurrently, unless otherwise specified, as hardware is designed to be.

This is one of the difference from C programming languages, as we'll see in a few slides.


There's also in this same notebook some other examples at the bottom -- people really like to see little counters that are sending out the counter value. This is a very simple example based on the same principles we saw, we could the value up and we send it out. You have a way to test that [sequential process] inline as well.


Then there's a nice beefy example here at the bottom, which is a very big leap in functionality, but it shows the parameterizability of a single communicating process like this. This is a variable length integer decoder. One of the things we use a lot at Google are protocol buffers, and protocol buffers have a way to compress the integers that are present inside of them, basically by using a stop bit, kind of like a byte-level decoding.

There are a lot of ways, that if you had a buffer full of these things, that you could decode them. You could have lookahead [e.g. try to decode N varints at once], you could not-lookahead, you could do them sequentially in time. All of these tradeoffs can be controlled through a single specification at this level, while flexing with cycles and time. So maybe you want to do it over multiple cycles, maybe you want to do something guaranteed within one cycle, maybe you want to pipeline it. All of those choices are within your control when you specify a communicating process like this and then specify how it gets mapped into cycles.

Does that roughly make sense? That these choices about how you might do things in time then can have the flexibility underneath?


And we didn't need to write anything explicitly generative, we didn't have to pre-plan how we were going to do any of this. We specified the dataflow computation for how it was going to happen, and then we work with the control that we have for mapping things into time, via clock period and throughput, for what we want to have as the per-cycle behavior.


So at the end here we have this thing called "worst case throughput" -- what "worst case throughput" says is: if "worst case throughput" is 1 you have to be able to accept a new example every clock cycle (or backpressure) -- if your "worst case throughput" is 4, you actually can accept a new value up to every 4 clock cycles. The compiler has to (otherwise) tell you, "I couldn't accept a value faster than every 4 clock cycles, therefore I can't give you a valid design [that meets your specification]". So that's what you're telling the compiler has to happen in this case.


Sometimes this is also called "initiation interval", but "initiation interval" is a bit of a static concept -- when dynamic things happen, e.g. if I didn't receive an event, is my "initiation interval" still 4 in that case? I didn't get this event that I needed to make forward progress. So we actually call it "worst case throughput" as a bound for "less than full throughput" designs.


So what did we see? We saw functions that take values and produce values, that are parameterized on certain explicit parameters. We saw that we can have safety in various forms through the type system; e.g. we said that things have to be a particular size to be added or multiplied, and these kinds of properties. So this is a very safe environment compared to the arbitrary casting and extending that might happen in certain HDL style environments [e.g. Verilog].


One of the other things that we didn't touch on as deeply is that in XLS there are real libraries that can now work across cycles. Normally if you're using a library or a function in Verilog, it needs to all happen within a single cycle; there's no composable unit of computation that can happen across the cycle boundary. In a thing like a generative language, we have to pre-plan for all the ways in which a library might be reusable (i.e. used to cross cycle boundaries). But if we have something like an AES computation -- this is just some cryptographic function -- these could all end up taking multiple cycles if they wanted to, or getting squashed into the same cycle if they wanted to, and the user doesn't need to pre-plan for any of that composition or how things might end up.


For people who are used to hardware, there's a basic premise, which is that this communicating process is like an "always-block on steroids". You've probably seen either through working with Chisel or looking at the output of Chisel that we get always_ff @ outputs, and those are things that are happening on a per-cycle basis. Those always-blocks are pre-planned for what goes in the cycle. But what having tooling "underneath" gives you is the ability to either smash those together, peel them apart, work on how quickly they can accept values, in an automated way, through the underlying tool.


At Google we use this build system called Bazel. All of the "how do things get physically incarnated" are all specified declaratively in this external file. These can all be specified on the command line as well, but this is where we consolidate all of our configuration information.


So we can pass it a target process node like skywater130 as we were talking about before. We pass it a reset strategy for how our resets works, we pass it "how are we going to do ready/valid signaling" -- is a channel ready/valid signaled or is a direct wire? All these kinds of properties.

We say what kind of pipeline we want to come out of the design, or whether we want something that's more like a Finite State Machine (FSM).


We also need escape hatches. Any time you try to increase the level of abstraction you need escape hatches for people know to do what they knew how to do before. So we have what we call Foreign Function Interfaces (FFI) which is familiar from software -- it's a way to call something external to the system from within the system. So here we see that we're specifying a Verilog external module, and we're also giving a behavioral specification, so that if we compile this to native code it can run as fast as native code, and then we don't need to go all the way down to the RTL level specification.

We don't have it now, but I expect that we're going to grow a system that tests that these two modules -- the one specified at the XLS level for fast simulation and the one specified down at the Verilog module -- produce identical results across a wide range of inputs.


I think part of the benefit I can provide here is giving a few lessons learned from the past. I wasn't there for the early wave of High Level Synthesis work, but I talked to a bunch of people who were there and a lot of the way that our team has been trying to build XLS has been informed by this, so I'm going to try to convey it here.

HLS kind of suffered from what I think were mismanaged expectations. There was this notion that you could just take C programmers off the street and have them go write some hardware by just typing in C code and the magical compiler would figure out how to turn it into the best possible hardware. I think this was bad for its adoption and people's actual understanding of what it could provide.

This is kind of the current that in some ways I feel we're working against. We just need to figure out how to improve the abstractions that people are already working with. Not necessarily take arbitrary C programs and figure out how to map them all down. Does that distinction make sense to everybody?


Another thing that happened in this "taking C programs and mapping them down for arbitrary C programs" is we took away key areas of control; ways that people would think about doing their job in RTL were not as clear or understandable or present in the environment they're working in, when they're using, for example, C as a hardware description language. How do I create an always block? How do I do a particular instantiation of a Verilog module? These don't have intuitive answers at least, right?

Part of the problem here is that we were taking away the control and turning it into a "you have to go figure out what the tool did" sort of problem. The tools also didn't necessarily offer good techniques for figuring out what happened inside. These are the things we really need to fix as we start to uplevel the abstraction beyond the cycle-by-cycle basis.


The pitfall, in summary from my perspective, is that we over-indexed on C and C++ as a hardware description language. Part of the reason they did that makes sense: if you have a native compiler, like GCC, that can turn a description into fast native code, then you might want to use that for your fast verification. Nowadays we actually have compilers as libraries: that path we saw where we sent XLS IR through LLVM, that's using LLVM as a library. That really unburdens you from the fact you needed to make yourself source-code-compatible with the C compiler input. So now there's a whole range of things that we can explore with fast native simulation.

With C you have no Hardware Description Language Features and arguably you have bad defaults, because as we've learned in this course, hardware is dataflow, right? There's not like a giant global memory that all these dataflow things are all accessing at one time. There's all these little objects that are passing messages to each other over wires. So this is dataflow computation not von-Neummann computation which assumes that giant global memory.


Again, the flawed expectation that lots of people believe is that with traditional HLS you could take arbitrary software and just give it to the HLS compiler and get hardware out. That works very poorly, if it works at all.


What we think we've introduced here is something called "Mid Level Synthesis" (MLS). The High Level Synthesis style of productivity but with the RTL-level of control and all the facilities for doing that. A lot of things that people care about are: what RAMs are going to be used and instantiated? How are cycles going to map on to things arriving or departing from this block? All these things should be understandable and specifiable, and that's really what we focus on in the way that we build XLS.

When you have this cross-cycle dataflow specification, you can transform it in lots of ways, but we also have to have the control there for people to reach for.


One of the funny things here is, if you remember the segments I was talking about earlier: there's hardware engineers and software engineers; part of the idea is getting them to look at this shared artifact and seeing what they need to see in order to make good progress and get their job done. For the hardware engineer you need them to be able to see the always block and the RAM instantiations, these sorts of things; for the software engineer, you need them to be able to see the message passing actors that are just kind of another software methodology, right?

And so this is called a gestalt shift -- you look at his object, and I tell you "look at it and try to see the rabbit" vs "look at it and try to see the duck". Maybe we can all make progress even though we're individually seeing rabbits and ducks, and that's kind of the idea: how do we create that thing that lets us all make progress the way that we're seeing the world and meet in the middle in some productive way.


Another cool thing that comes out of this is that, now that you've specified something and it crosses cycles, we can automate the feedback loop. It sounds like some of your projects ended up having timing closure questions -- timing closure is a really big part of what happens in the overall hardware design cycle. Being able to take the timing report that comes out of your backend tools -- this is OpenROAD, which I know Professor Beamer mentioned before. Yosys and OpenROAD can produce timing reports for reproducible research on these open process nodes like Skywater 130.

We had a paper at DATE'24 (Subgraph Extraction-based Feedback-guided Iterative Scheduling for HLS) that showed iterating on the results where XLS was predicting how long they would take: closing the loop and figuring out "oh, actually I can minimize area based on refinements of my original prediction".

Say originally you thought the multiply was going to take 500ps, it ended up only taking 400ps, now there's a way to figure that out from the backend tools and what that's providing to you in its timing report. This is a really exciting methodology because when you know exactly how long things take, you can push operations from one cycle to another in a pipeline to minimize the amount of registers that you need. Register are power, registers are area, registers are a really big part of the design process, and not a lot of people have a lot of patience to iterate 27 times on their timing report to actually get the best possible result. When you have automation tools underneath that can refine your cycle mapping strategy and the timing, you actually get this as part of the process.


Another "cartoon diagram" just to make sure that this point got across, but I see a lot of head nods so I won't belabor it. If I express the dataflow computation on the left, I can pipeline it -- insert these green bars -- this gives me a three stage pipeline, here. Or I can roll it up in time -- if I say "I'm not going to accept an example every clock, maybe I'm going to accept an example every three clocks" I can roll it up to that version on the right.

This is available without totally reworking your design -- if you were working in RTL you'd have to literally rip it all up and replace it, and you couldn't use your original description, and as the algorithm gets complicated that becomes a ton of work.


To be a little bit provacative... Everybody in this class has been working in Chisel. Chisel is an amazing step forward in the space of methdology. I kind of think of these things as "advanced methodologies", and I think that's the direction we all need to head, and we can't be like crabs in a barrel arguing over "what exactly is the best little distinction here and there". But I do think that it's probably technically true that parameterization unlocks the most power when you can actually flex your design against cycles. And as you're trying to close timing at a particular frequency, you'll probably observe this quite acutely. Because you'll have to rip and replace things, or you'll have to pre-plan for where all the cuts could be with respect to time.

If you have a two-level language like the Chisel environment -- which has Scala producing Chisel -- then you actually have to pre-plan for the ways in which your design may need to change its generation. Right? That's kind of the trade-off. You put in a conditional, or you put in some kind of calculation for how much stuff you might think fits. By contrast, this is all built in to a system where it does some amount of cycle mapping for you, and this really lets you compose and reuse libraries in that cross-cycle universe, which is where a lot of things do end up living.


Hardware Generation Languages, sometimes called HGLs, you have to kind of pre-plan. The software analogy here is to the idea of "zero cost abstractions". Can the compiler basically delete ready/valid semantics that are happening between blocks? Can it re-time things across pipe stages as you move from one clock domain [ed: i.e. old target frequency] to another clock domain [ed: i.e. new target frequency]? Or one process node to another process node? These are all questions of interest.

So you can actually have IP that's reusable between different projects that are on different process nodes and in different clock domains [i.e. target frequencies] which we actually do have at Google.

Again, most important is that we make progress on methodology! All advanced methodologies in my book are good.


This was the varint streaming decoder that we saw previously, and this is an example showing we have a nice parameter space, that we can, if we want to, put machine learning on top of.

We have this nice fully-specified parameter space, and what we can do is put a tuner on top, which proposes a set of parameters for the block, does the XLS instantiation across cycles -- that gives us native simulation to figure out how fast it ended up going. For example, if it's a dynamic computation that doesn't take a fixed number of cycles, you need to play lots of different examples against it in order to determine performance. You do synthesis to get your cost estimate out [e.g. area], and that all feeds the auto-tuner.

There's a system called Vizier which is publicly accessible and part of the cloud offering, and that gives us here is a "throughput vs design area" plot. And if you look hard you'll see the "2D pareto curve" that we talk about a lot in this class, where there are certain numbers that are "high area and high throughput" and certain numbers that are "low area and low throughput", and on this kind of convex frontier over here we get to see what are the best possible designs? The fact you can flex across cycles means that you can do optimization natively with generation "underneath the hood" on this multi-cycle design.

Q: You mentioned ML previously?

A: This is a Machine Learning based tuner. It's taking all these parameters, and that's it's parameter space, and then you're asking it to select parameters, and then evaluating what was the throughput, and evaluating what was the cost after synthesis, and that's the plot that we're making over here.


Professor Renau's lab at UCSC was in this efabless contest and actually generated an XLS block that was a multi-cycle block by using a conversational approach with ChatGPT. Here they're giving the specification of what it's looking to do [as an instruction] -- it spits out the Rust-looking code. Then we ask if it can write a test and it spits out a test, which is really cool. I think that the fact these languages are more similar to software languages that have a large training corpus actually helps it bootstrap its ability and understanding in the space.


LLMs have also advanced from the time that they did this work. Now the latest Gemini Ultra model you can feed the whole language reference manual in one context window; i.e. it can understand the whole language reference manual in one "bite" which probably means we can get better results if we retry the experiment today.

This is the kind of cool thing that comes from "you don't have to get the LLM to understand how to do cycle timing properly", you need it to generate correct code, and then explore the space of possibilities of how it maps to cycles, you don't need it to do correct cycle mapping at the Verilog level out of the box.


There's a bunch of things we're actively trying to build. As I said, we're employing XLS for the bread-and-butter facilities that it's good at on products inside of Google, it's kind of "running the gamut" inside the company, but we're also constantly expanding the frontier of things we think it can be good at.


Our Historical focus is to take really complicated things that might be leaf-level Verilog modules, like complex algorithms, or things that are hard to close timing on, and doing that automatically. But we're trying to go up the stack towards building bigger and bigger components of the chip entirely without leaving XLS. Because you can imagine, your design and interactive feedback flow -- the more you can stay in this world the more you can get the nice benefits, and then when you stage out to Verilog then people are doing things more "on the boundary" of what you've created.


Also, people are always looking for what are some wide QoR comparisons vs things that people have hand-written, and that takes time. So for things that are big hairy blocks that no-one has time to do, we can easily kind of swoop in and say "here, we were able to do it very quickly!" But people are saying "take the time to bake off against something someone had written," and we're doing more of that too.


Some of the things we're trying to do now as part of the frontier and the roadmap.

Sharing resources better: when you say, "I don't need something to happen every cycle", you can start saving the adders and multipliers that you use in cycle 0 with the ones that you use in cycle 1, and this kind of sharing is all possible "undernath the hood" so that you can save on area and save on the functional units you create in those modes.

Dynamic throughput in different states: sometimes you'll say, when I get this command I'll do this thing and it takes 2 cycles, when I get this other command I'll do this other thing and it takes 3 cycles -- having that dynamism automatically show up and present on the ready/valid interface is part of what we're working on now as well.

Automatically arbitrating RAMs: this is a big part of a lot of designs, because we basically want to say "different parts of the computation might want to use a RAM", and we might know that these are mutually exclusive things as a user, but the compiler may not be able to prove it, or we may not want to wait the amount of time it takes for the compiler to prove it -- can we just stick an entity in there that arbitrates the RAM and shows "yes at runtime you never used the same RAM from the two different places".

As I was saying earlier, zero-cost abstraction means smashing the always blocks together. So we make one computation and another computation, can we inline them for zero cost so we don't have to do ready/valid across the boundaries. Although ready/valid is really nice for latency-insensitive composition, where we can get reuse [of our designed hardware components], people worry about the costs of having ready/valid signaling everywhere, because it's like having little queues all around. So the more you can smash things together and fully eliminate the cost by using the compiler, the more accepting people will be of latency-insensitive design description.

We also have lots of cool experiments with auto Design Verification (DV) -- I'm not going to be able to get into that too much due to time, but basically there's "fuzzers" from software, and we can point them at XLS hardware artifacts, because XLS works like software. That's really great at finding bugs in random corners of the design more automatically.


We also have Foreign Function Interfaces (FFI) for instantiating combinational Verilog modules, like we saw with the divide unit. We should be able to do that for an entire communicating process, we should be able to say "this Verilog module acts like a communicating process shaped like ${THIS}" and we don't have that yet.


We're also working on an ECO (i.e. Engineering Change Order) flow, so that we can minimize the change to Verilog that we emit when you change something at the XLS level. This is very important for people who are going through the Physical Design (PD) process: at certain points in the design cycle they expect things not to change very much, and that they can basically freeze at the Verilog input level, and that any subsequent Verilog changes would be very minor. So, if your tool can only make large changes, that can be an impediment to getting people to use higher level techniques.


Since I think we have 5-ish more minutes I'll try to race through some more goodies.

All it takes to make that really fast test bench I was talking about that runs at blazing fast speed is these 30-ish lines. You basically create a builder, you say "this is the thing I want to run", "this is the thing I want to compare it to". You say "I'm gonna run that with num_samples=2**32" and it runs at 337 mega-samples per second, with 7-ish seconds remaining, it takes about 100nsec per sample, and it completes in about 12.5 seconds to do a 2^32 space.

So this is really the power of native compilation -- you take the original design at the function level, you map it into code, and it's as if you wrote C++. So everything you wrote here is effectively running through the same compiler, and at the same or possibly even better speed (because of exactly matched semantics) vs C++ that did the same thing.


Another kind of fun thing is quickchecks. Has anybody heard of quickcheck before? Quickcheck is originating from I think Haskell ecosystem and John Hughes -- basically you write a function and you have some generative thing that produces inputs to the function to check it over a whole bunch of values.

Here's an example where we're checking the "one hot predicate" -- if a value obeys this, it's one hot. So we run the one_hot built-in function on an arbitrary value and checking it obeys this property.

Here's another example where we take a 4-bit value, convert it to a boolean array, reverse, it, and then convert it from a boolean array to a bits-value, but with the opposite bit-order. These are the kinds of properties you can come up with when you write a function to say "if this was true for all inputs, then everything should be good".

You can run the DSL interpreter, which is our normal mode for fast-feedback operation, and that'll run a thousand instances and say "I didn't find a counterexample there". But you can also take that same quickcheck and you can prove it true for all inputs. So you can say "over the entire space of u3s or u4s this was true for all possibilities". And you can prove this because we have a way to turn the XLS IR, the program-level description, into something that a SAT solver knows how to chew on. All of the semantics are fully defined so we can enqueue any program into that SAT solver.


This is the IR visualizer that I like more than the table view we saw before. What's cool about this is you can hover over individual nodes and zoom in on them. First you say "show me the critical path" and then it pulls up this graph. I have an interactive demo but I wasn't able to get it running before we started the class, I'll be able to show folks after class. You're able to pull up your function, see what the critical path is, and then use this to reduce your critical path and close on timing, and eliminate unnecessary parts of computation, or figure out for a given process node how long you're expecting something to take.

These nice tools are also benefits of being kind of "internal" to the hardware emission process and how the program is represented.


There's also really nice things now called Language Servers, which can plug into various editors -- the DSL (DSLX) has a language server that just requires a few lines of code to plug in to your editor. What nice is that then, as you type, you get error messages and warning messages, and fast feedback for that iterative style development process we like in modern code development style. There's also things like autoformatters as goodies for this domain specific language.


That's pretty much all I had, let's open for longer form Q&A.


Q: Are there open source algorithms for converting the IR into the SAT solver format?

A: All of the XLS IR semantics are online -- really it's a question of defining for your IR what precisely is every operation doing. For things like and/or/xor there's a direct correspondence to what a SMT solver interface will do. It will actually give you a builder where you can build up those particular things. All the code for this is open source, everything I showed on the slides is all open source, so you can go look at the translator that walks the XLS program description and then builds the corresponding input for the SMT solver. Only rarely do you have to kind of break things down into a bit level, usually SMT theorems are high level enough that you can directly enqueue things from an IR like ours.


While I'm waiting for questions I'll also show -- once you push the result to Verilog through XLS -- in the "XLS playground" (which was the second Colab notebook link), you can synthesize this and place-and-route (P&R) it with the open source tools Yosys and OpenROAD -- Yosys does synthesis, and OpenROAD does place-and-route. We can hit play on this cell, and you get very fast feedback from this as well, it's usually order 20-30 seconds is all, and this is running in a colab environment, so it's not a giant machine or anything. So you can get fast and interesting results from doing ASIC-style synthesis, even in a colab environment using the open tool set today, which is great for reproducible research and collaboration opportunities, we collaborate with people in this flow.


Q: Are there examples of using the XLS with memories?

A: Yeah, I mentioned the RAM facilities we have, if you go to the xls/examples directory there's a ram.x file as a canonical example of using a RAM -- it's been designed so it can work with different portedness of RAM. Potentially you can select it, and potentially you can ask "what should I select for this particular computation". But a lot of that comes down to, if you e.g. have two sends you want to have happen on the same cycle, you're going to need two ports to service that if it's going to happen every single clock -- a lot of that's determined by your communication pattern.


Q: Earlier on when you had delay numbers, was that model driven or was that actually doing synthesis, how was that done?

A: In the documentation there's also more on this which I'll pull up. What we do is we do a characterization process, which is just a script, and for any open PDK the script runs through OpenROAD. What it does is it sweeps out a whole bunch of points for an operation being synthesized for a particular process node, through a particular synthesis tool. This is effectively what an RTL designer kind of does in their head when they're mapping out a pipeline on a whiteboard, say. They're like, eh I think this adder is going to take this many picoseconds (or this many logic levels, which is some approximation for picoseconds), and then ultimately they're figuring out where they're going to try to make the cut. But if they mis-estimate, it's difficult to go back and revisit, because you've already drawn the pipe stages in a particular place or particular way. So the characterization is also removing some of the uncertainty that the designer might have from process delay estimation. It's doing a simple linear combination, there's not currently wire loads or things in there -- there's room to improve the delay model, which we are also working on, but this is mostly better than what somebody would do in their head I would claim -- what they do in their head is kind of a rough approximation of what we do at least mathematically.


Q: Followup to two accesses per cycle needing two ports...

A: We don't do RAM inference in the traditional sense, we most expect that people will have picked their RAMs up front and then try to fit for the constraints of talking to that RAM, and that may end up backpressuring as a result; e.g. if I have two accesses per cycle and I only have a single ported RAM, there's no choice [ed: without generating more complex and possibly surprising store-load queue hardware]. As you describe the communication pattern and if it can backpressure you this is what ends up happening. One of the things that traditional HLS tried to do was to do all RAM inferencing and instantiation under the hood, but that was one of the things that people I think feel really took control away. People like to centrally instantiate RAMs and plumb them down as a pre-phase of design and planning. So this enables that to happen -- you could put a convenience layer on top that tried to infer some of these things and picked something to instantiate, but that doesn't come "out of the box" in any way.

Q: People not in the team using the language giving feedback? How did you change it based on feedback?

A: Constantly evolving it based on feedback from people not-on-our-team using it. Best feedback, in a way, is that it's a "small but surprisingly rich for what people want to do" language. It doesn't have traits like Rust has traits today, which is unfortunate because there's a lot more cool generic things you could do if we did have say, numerical traits, you could swap out floats for fixed point and similar, as has been talked about in this course. Because we're building XLS iteratively we're often replacing what RTL or chunks of RTL has been designed to do on the leaf level, and so our expressive power is pretty good most of the time compared to what we're replacing or substituting for. So it's going to continue to build out over time, and we do get feedback from people not on our team on things they'd like to see better, or things they'd like to have as features. Parameterized modules come from protobufs right now -- you can import a config for your design and people are asking if we can have higher order modules and things like this. Some interesting questions to be figured out there, like how you're expected to parameterize more universally, right now you can parameterize on a struct full of options.


Q: [ed: about trading off accelerator landing zones vs general purpose CPUs.]

A: One of the interesting things in the history of HLS -- one of the basic premises was, if we write something that's kind of like software, we could choose to either put it on the software side (e.g. as embedded code), like in firmware, or we could choose to put it in an embedded hardware block. And you can make that choice in more of a late bound way if you use "the same" or "similar" description on both sides, right? And so the choice between which one happens is largely going to depend on: do you have the ability to add a new functional unit to the thing you're working in? If you don't control the core, you can't put a new functional unit in the core. Do I have the ability to hide the latency of offloading to some kind of peripheral device, if I have the ability to add a peripheral device? This notion of "what is the overhead of getting out to a thing that would do acceleration and getting back" vs "what is the cost of doing it in the general purpose core". That's the main consideration that usually comes to mind. You have to do empirical study and evaluation for what workloads you get, what the cost of offloading is, what's the context switch overhead, can I hide the work I'm doing when I offload it. These are all parts of "when do I make an accelerator block" considerations.

tags: hardware