cpu-design fpga systemverilog microarchitecture performance

Nearly Doubling My CPU's Clock Speed by Removing Complexity

How simplifying my instruction fetch path improved Fmax from ~25 MHz to ~45 MHz on my custom dual-issue CPU.

Nearly Doubling My CPU’s Clock Speed by Removing Complexity

Over the past few weeks, I’ve been grinding on timing closure for my custom dual-issue CPU core (written in SystemVerilog for a Lattice FPGA).
Today, something huge happened: I improved the max clock frequency from ~25.26 MHz to 44.92 MHz.

And the wild part?

I did it by removing logic, not adding more clever tricks.


What Changed?

My instruction fetch stage originally had a pretty complex alignment system.
It was designed to handle instruction boundaries and speculative fetch behavior, but after reviewing the ISA and memory model, I realized:

I didn’t need alignment logic at all.

It was dead weight sitting directly on the critical path.

So I ripped it out.

The new fetch path is dramatically simpler, cleaner, and easier to pipeline.


The Timing Report Before

Here’s what the project was hitting earlier:

  • Max frequency: ~25.26 MHz
  • Critical path: Fetch → Alignment → Decode

The alignment logic had multiple nested layers, case trees, and shifting/merging operations.
On an FPGA LUT fabric, that’s basically a death sentence for timing.


The Timing Report After

After removing the alignment machinery, here’s what I saw:

ERROR: Max frequency for clock '$glbnet$clk$TRELLIS_IO_IN': 44.92 MHz (FAIL at 50.00 MHz)

I almost doubled the frequency while staying in the same 5-stage pipeline structure.

This is the closest I’ve ever been to hitting my target of 200 MHz on this core.


What I Learned

This was a powerful reminder:

  • Performance often comes from simplification, not complexity.
  • Long combinational chains are the true enemy of Fmax.
  • FPGA design rewards clarity over cleverness.
  • Sometimes stepping back and questioning assumptions wins harder than optimization.

As a 16-year-old building a dual-issue CPU from scratch, seeing a jump like this is insanely motivating.


What’s Next?

Now that fetch is no longer dominating the critical path, I’m turning my attention to:

  • The decode → issue boundary
  • Hazard logic delays
  • Register file fanout
  • Better pipelining between ID and EX

Screenshots, timing plots, and waveform captures coming soon.


If you have suggestions for improving Fmax further, especially around dual-issue decode stages, feel free to reach out.