Nearly Doubling My CPU's Clock Speed by Removing Complexity
How simplifying my instruction fetch path improved Fmax from ~25 MHz to ~45 MHz on my custom dual-issue CPU.
Nearly Doubling My CPU’s Clock Speed by Removing Complexity
Over the past few weeks, I’ve been grinding on timing closure for my custom dual-issue CPU core (written in SystemVerilog for a Lattice FPGA).
Today, something huge happened: I improved the max clock frequency from ~25.26 MHz to 44.92 MHz.
And the wild part?
I did it by removing logic, not adding more clever tricks.
What Changed?
My instruction fetch stage originally had a pretty complex alignment system.
It was designed to handle instruction boundaries and speculative fetch behavior, but after reviewing the ISA and memory model, I realized:
I didn’t need alignment logic at all.
It was dead weight sitting directly on the critical path.
So I ripped it out.
The new fetch path is dramatically simpler, cleaner, and easier to pipeline.
The Timing Report Before
Here’s what the project was hitting earlier:
- Max frequency: ~25.26 MHz
- Critical path: Fetch → Alignment → Decode
The alignment logic had multiple nested layers, case trees, and shifting/merging operations.
On an FPGA LUT fabric, that’s basically a death sentence for timing.
The Timing Report After
After removing the alignment machinery, here’s what I saw:
ERROR: Max frequency for clock '$glbnet$clk$TRELLIS_IO_IN': 44.92 MHz (FAIL at 50.00 MHz)
I almost doubled the frequency while staying in the same 5-stage pipeline structure.
This is the closest I’ve ever been to hitting my target of 200 MHz on this core.
What I Learned
This was a powerful reminder:
- Performance often comes from simplification, not complexity.
- Long combinational chains are the true enemy of Fmax.
- FPGA design rewards clarity over cleverness.
- Sometimes stepping back and questioning assumptions wins harder than optimization.
As a 16-year-old building a dual-issue CPU from scratch, seeing a jump like this is insanely motivating.
What’s Next?
Now that fetch is no longer dominating the critical path, I’m turning my attention to:
- The decode → issue boundary
- Hazard logic delays
- Register file fanout
- Better pipelining between ID and EX
Screenshots, timing plots, and waveform captures coming soon.
If you have suggestions for improving Fmax further, especially around dual-issue decode stages, feel free to reach out.