Thursday, March 07, 2013

Extreme debugging - a tale of microcode and an oven

It's been quite awhile since I debugged a computer program. Too long. Although I miss coding, the thing I miss more is the process of finding and fixing bugs in the code. Especially the really hard-to-track-down bugs that have you tearing your hair out - convinced your code cannot possibly be wrong - that something else must be the problem. But then when you track down that impossible bug, it becomes so obvious.

I wanted to write here about the most fun I've ever had debugging code. And also the most bizarre, since fixing the bugs required the use of an oven. Yes, an oven. It turned out the bugs were temperature dependent.

But first some background. The year is 1986. I'm the co-founder of a university spin-out company in Hull, England, called Metaforth Ltd. The company was set up to commercialise a stack-based computer architecture that runs the language Forth natively. In other words Forth is the equivalent of the CPU's assembly language. Our first product was a 16-bit industrial processor which we called the MF1600. It was a 2-card module, designed to plug into the (then) industry standard VME bus. One of the cards was the Central Processing Unit (CPU) - not using a microprocessor, but a set of discrete components using fast Transistor Transistor Logic devices. The other card provided memory, input-output interfaces, and the logic needed to interface with the VME bus.

The MF1600 was fast. It ran Forth at 6.6 Million Forth Instructions Per Second (MIPS). Sluggish of course by today's standards, but in 1986 6.6 MIPS was faster than any microprocessor. Then PCs were powered by the state-of-the-art Intel 286 with a clock frequency of 6MHz, managing around 0.9 Assembler MIPS. And because Forth instructions are higher level than assembler, the speed differential was greater still when doing real work.

Ok, now to the epic debugging...

One of our customers reported that during extended tests in an industrial rack the MF1600 was mysteriously crashing. And crashing in a way we'd not experienced before when running tried and tested code. One of their engineers noted that their test rack was running very hot, almost certainly exceeding the MF1600's upper temperature limit of 55°C. Out of spec maybe, but still not good.

So we knew the problem was temperature related. Now any experienced electronics engineer will know that electrical signals take time to get from one place to another. It's called propagation delay, and these delays are normally measured in billionths of a second (nanoseconds). And propagation delays tend to increase with temperature. Like any CPU our MF1600 relies on signals getting to the right place at the right time. And if several signals have to reach the same place at the same time then even a small extra delay in one of them can cause major problems.

On most CPUs when each basic instruction is executed, a tiny program inside the CPU actually does the work of that instruction. Those tiny programs are called microcode. Here is a blog post from several years ago where I explain what microcode is. Microcode is magic stuff - it's the place where software and hardware meet. Just like any program microcode has to be written and debugged, but uniquely - when you write microcode - you have to take account of how long it takes to process and route signals and data across the CPU: 100nS from A to B; 120nS from C and D, and so on. So if the timing in any microcode is tight (i.e. only just allows for the normal delay and leaves no margin of error), it could result in that microcode program crashing at elevated temperatures.

So, we reckoned we had one, or possibly several, microcode programs in the MF1600 CPU with 'tight' timing. The question was, how to find them.

The MF1600 CPU had around 86 (Forth) instructions, and the timing bugs could be in any of them. Now testing microcode is very difficult, and the nature of the problem made the testing problem even worse. A timing problem at elevated temperatures means that testing the microcode by single-stepping the CPU clock and tracing the signals through the CPU with a logic analyser wouldn't help at all. We needed a way to efficiently identify the buggy instructions. Then we could worry about debugging them later. What we wanted was a way to test (i.e. exercise single instructions, one by one), on a running system at high temperatures.

Then we remembered that we don't need all 86 instructions to run the computer. Most of them can be emulated by putting together a set of simpler instructions. So a strategy formed: (1) write a set of tiny Forth programs that replace as many of the CPU instructions as possible, (2) recompile the operating system, then (3) hope that the CPU runs ok at high temperature. If it does then (4) run the CPU in an oven and one by one test the replaced instructions.

Actually it didn't take long to do steps (1) and (2), because the Forth programs already existed to express more complex instructions as sets of simpler ones. Many Forth systems on conventional microprocessor systems were built like that. In the end we had a minimal set of about 24 instructions. So, with the operating system recompiled and installed we put the CPU into the oven and switched on the heat. The system ran perfectly (but a little slower than usual), and continued to run well above the temperature it had previously crashed. A real stroke of luck.

Here's an example of a simple Forth instruction to replace two values on the stack with the smaller of those values, expressed as a Forth program we call MIN
: MIN  OVER OVER > IF SWAP THEN DROP ;
(From my 1983 book The Complete Forth).

From then on it was relatively easy to run small test programs to exercise the other 62 instructions (which were of course still there in the CPU - just not used by the operating system). A couple of days work and we found the rogue 2 instructions that were crashing at temperature. They were - as you might have expected - rather complex instructions. One was (LOOP) an instruction for do loops.

Then debugging those instructions simply required studying the microcode and the big chart with all the CPU delay times, over several pots of coffee. Knowing (or strongly suspecting) that what we were looking for were timing problems, called race hazards, where the data from one part of the CPU just doesn't have time to get to another part in time to be used for the next step of the microcode program. Having identified the suspect timing I then re-wrote the microcode for those instructions to leave a bit more time - by adding one clock cycle to each instruction (50nS).

Then reverting to the old non-patched operating system, it was the moment of truth. Back in the oven, cranking up the temperature, while the CPU was running test programs specifically designed to stress those particular instructions. Yes! The system didn't crash at all, over several days of running at temperature. I recall pushing the temperature above 100°C. Components on the CPU circuit board were melting, but still it didn't crash.

So that's how we debugged code with an oven.

25 comments:

  1. Great story. Thanks for this.

    I remember *Complete Forth*, and I loved the idea of the language, but the Forth implementation I got for ZX Spectrum was really quite painful to use because everything concentrated on the language itself, not the operability. It wasn't very usable, or then the instructions weren't good enough for me to pick up.

    ReplyDelete
    Replies
    1. Thanks pjt - glad you enjoyed. Yes, Forth was (is) more programmer friendly than user friendly and in most Forth implementations the user interface was just a command line. It was a radical concept - the language interpreter/compiler *is* the operating system but - as you say - not user friendly.

      Delete
  2. Thank you for such a great article. Fascinating. I have so many good memories of FORTH, I was spoiled and never really worked well with more elaborated obj. oriented languages. Amazing, so glad I found your blog.

    ReplyDelete
    Replies
    1. Thanks Constantino - very glad you enjoyed my article:)

      Delete
  3. Thank you for a really excellent article.

    I'm sure I would have loved Forth, as "thinking in stacks" came pretty natural having used and programmed HP RPN calculators, but the lack of a good IDE on my system at the time made it a pain, so it was too brief an encounter.
    Your description of microcode as the magic that connects hardware and software is spot on. Learning about microcode and writing a minimal set of instructions was an eye opener for me, and necessary to really understand how a computer works.
    Thank you again!

    ReplyDelete
  4. Hi. Old FORTHy here. FORTH has been my debug embedded tool of choice since about that same time, still have it embedded in my current C code. Very sharp tool - double-edged sword with no handle - concommittent of power. Though I'm moving to Lua now since I need other people working with me.

    I have a practical question for your then-self:

    Why did you spend all that effort on building your own chip when there was the Harris chip available at that time? Maybe not quite as fast in cycle clock, but encoded up to 3 words per instruction. We had it running a real time rendered valley scene with sunrise through sunset shading changes. With the video bit-banged in code. As its wallpaper.

    ReplyDelete
    Replies
    1. Thanks David. Yes, I've not programmed in Forth for 20 years - but have very fond memories of the language.

      Re your question. I'm pretty sure the Harris Forth chip wasn't available until 1988. I recall Harris licensed the Novix chip - the brainchild of Forth inventor Charles Moore. I recall the Novix chip was being talked about in 1986, but only as a prototype. We had been working on our MF1600 since 1984 (earlier if you count the bench prototype built in the Hull University electronics department).

      Delete
    2. Actually, you're right, it was the Novix at first. I was involved (peripherally) in the reference design board in 1985. Which makes a lot more sense. Ancient memory, timeline not matching up perfectly at that distance, or tape dropouts, I guess :)

      Delete
  5. Great story. If someone made a TV series about engineers solving problems like this, I'd watch it. Much more interesting than a cop or lawyer show.

    ReplyDelete
  6. I felt with a similar problem a couple weeks back, except that the chip refused to respond unless heated to 100C or so. Turns out that during the soldering process(we were using that board to learn how to solder BGA parts) the part had been overheated and warped ever so slightly, and only took on the right shape when hot.

    We decided to use leaded solder balls after that.

    ReplyDelete
  7. Great story. It is amazing the methods we use for debugging when we really want to solve a problem. Spark (EMI) generators, heating and cooling. I love doing test setups to recreate that "one in a million" bug that just so happens to have occurred at a very important customers site.

    ReplyDelete
  8. Wow - I remember you guys -- a few of us were seriously into FORTH, what with the Jupiter Ace and so on: when we heard about a chip running it natively, we were excited beyond imagination - it's so nice to hear a story like this from back in the days...

    ReplyDelete
  9. I actually built a signal processing system using AMD bit-slice processors, and had to write quite a bit of microcode. Great story! (This to get my PhD at CMU's Robotics Institute, by the way.)

    I have one for you. Recently, I was using at home an "ethernet over AC wiring" system. It allowed me to have a network over several rooms without relying on WiFi to do communication. My wife and I remodeled a small area, and ended up adding a dimmer switch for new lights. That area has some electronic equipment.

    The electronic equipment would work fine during the day, then just stop altogether at night. Finally, I realized that, to look at what was going on at night, I had to turn on the lights. Lo and behold, the dimmer was interfering with the AC ethernet system! I finally was able to take a laptop, hook it up to an Internet speed test site, and move the speed up or down using the dimmer!

    The solution was to replace all the dimming circuitry with DC instead of AC switches and lights.

    I love debugging. All the best. Rafael.

    ReplyDelete
    Replies
    1. Light dimmers, yeah. Nice mood effects but has an impact on your mood with all the faults they have. Not surprised you discovered this. They are also very unreliable. They often fail when a lamp goes bang. The newer ones which are suppose to reduce interference are complete rubbish.

      Delete
    2. I also love debugging. Fortunately so. My first job was debugging room sized discrete component computers (Elliott 4130) as they rolled off the production line, all hand wired even. To my surprise I was a natural.

      It was the 2901 (IIRC?) that brought me to LA in the first place, in 1975. We had to replace the Data General Nova 1200 with our own because DG suddenly moved delivery dates out 12 months, and the 290x reference design implemented 1200 micro-code. Did the design, then everything changed and the project was dropped. I went on to do a lot of Z80 datacomms products, including the first explicitly named Stat Mux.

      Can you spell "critical section"? Apparently, at that young age, I could not, not perfectly. Mostly, but not perfectly. 3 years after our product launched we got a bug report from a warehouse that they were losing 6 or so pick tickets a day. I spent two days sitting on the floor there with an HP logic analyzer (1615 - remember it? Still have one in a closet somewhere) hooked up to the CPU busses before I found it. Two lines of assembler needed protecting. The window was so tiny it had never been hit before. But this particular use involved sending ticks to the printers continuously, all such messages were the same length, and this resulted in a resonance effect such that the offending interrupt wandered around within a few tens of microseconds around the open window.

      Which story, btw, is why I am a fervent believer in debugging Real Time systems by observing them behav-ing, rather than using debugger and breakpoints, which only leave you a post-mortem. Sadly, todays' system busses bear little resemblance to the actual code execution path so logic analyzers are no longer useful this way. Which us why I moved on to embedding FORTH in my systems. I could explore what needed to be watched without have to stop and restart the system. And FORTH is exceedingly light-weight in both footprint and execution, yet one of the most flexible down at that level of anything else. Just obscure.

      Delete
  10. Forth reminds me my youth where I did programming and learned a lot of things that I later didn't need anymore. It was a happy time being able to count machine cycles...

    ReplyDelete
  11. Did making (loop) and other complex branch constructs single instructions really save that much?

    ReplyDelete
  12. I'm not at all familiar with Forth or what it means for it to be stack based.( I thought the stack was important in most languages)
    But I do love a good debugging story and this was an interesting scenario described. Thanks.

    ReplyDelete
  13. ..and why was the test rack running so hot? :) Just kidding.. great story. Whenever you think you've got it rough, just imagine Charles Babbage doing arithmetic on a mechanical computer the size of your bathroom.

    ReplyDelete
  14. One time I was trying to get an optical fiber network to work right. Every so often, around 4 pm, it wouldn't. Took a while, varied from box to box but not the time of day. Finally realized it was the red dust covers over the second relay connector set. It was a nice early spring in Redondo Beach and we tended to have the lab door open for the evening breeze, and the sun was shining through them and putting a DC offset on the pipes. Switched to all black covers and the problem went away.

    Lots of stories. Lots.

    ReplyDelete
  15. Thank you for sharing this great story! I think it is through such debugging sessions, trying to find such non-trivial bugs, that one gets a much "deeper" understanding of the entire system.

    ReplyDelete
  16. Great article Alan! Inspired me to blog about another odd debugging moment: [http://cc-logic.com/blog/2013/4/4/extreme-debugging-with-a-rental-car]

    ReplyDelete
  17. (Ah, are comments only visible after being approved? If so, the last three from 2013 shouldn't have been - they're comment link spam.)

    ReplyDelete