Tech Talk
Software Debugging
Doug Bell
Principal Engineer
HARMAN Luxury Audio Group
In the software world, there are numerous techniques that can be used to solve problems. Issues originate from timing, logic, code implementation, initialization, sequencing, typos, hardware design issues or errors, code design, etc.
Code in the Mark Levinson world is “embedded software” – software not run on a PC but on specialized hardware with various peripherals.
Techniques – Basic
Not all debugging techniques are straightforward or efficient. There are various hardware and software debuggers that can be employed, though often at costs of expense and time, especially with complex systems. One method is to connect a debugger from a PC to a standard connection called JTAG (named after the Joint Test Action Group which codified it) on a circuit board. Here’s such a debugger that connects a PC (used to develop code) to a circuit board that is designed with a JTAG header. These allow a software developer to see values of variables and to monitor code flow as it happens; it further allows control of code via ability to stop, modify, and resume. It is a tool however, not magic — the user must know what to do with it.
Sometimes hardware can provide test points to allow correlation of hardware to software events which increases observability. Some hardware – Field Programmable Gate Array chips AKA FPGAs for example – is configurable so that test points can be added after the board is manufactured. One can look at test points with an oscilloscope or a logic analyzer for some indications of what is going on in software.
A standard debug method is to print messages to some output in order to see what is happening as the code runs. If there are a lot of messages happening quickly it helps to have these written to a log file for later analysis. Perfect- right? No. Catching and finding the important info and interpreting it can be challenges in themselves. Even if the developer spots some interesting message, he or she might need to know the context in which it occurred.
Further, it can be time consuming to add such messages, rebuild the code, get code from PC onto the system, start up the system, try to initiate the issue, and await some event. Code can be implemented so that various messages can be chosen to be enabled or disabled “on the fly” without rebuilding code as is done in Mark Levinson products. Deeper investigations can require new messages to be added to code.
Where to Start
Confirming assumptions is good practice before diving into a problem. Increasing knowledge about a system is a good starting point – read and understand relevant code to a degree that is sensible. Often it proves useful to first address any anomalies seen before dismissing them as irrelevant - often they are symptoms of some underlying issue that can lead to solution.
Often it can be useful to reduce the original system to a smaller or minimal system, perhaps with standard configuration, and see if the issue persists. Understanding the system (including how the hardware should behave) is an important component of problem solving but it can be impractical to understand all aspects of complex systems, so isolating an issue can be critical. Speaking of isolation…it can be useful to decouple events in a system by changing timing. Delays can be added or modified to change the relation in time of one function to another.
Another method is to compare some type of working system to the failing one. However, one issue here is that the systems might differ sufficiently that a comparison isn’t sensible.
Fun with Debugging
Forcing some appropriate stimulus – hardware or software – to the system, or making some code change, is typical too. Creating such stimulus is often not trivial; proper and timely application of it can also be tricky. Normally best applied with some explicit intention.
Most often in making code changes it is best to change one item at a time, though there can be a balancing act between modifying code and running tests. It can be crucial to maintain specific system configurations and to be able to recall which setup produced which output.
Intermittent Bugs
One particularly difficult challenge is intermittent bugs. Sometimes these involve complex systems that are difficult to even set up, much less to replicate a rare issue. Engineers have likely heard “it worked on my system”. Here attention needs to be especially paid to ensuring similar systems. One technique is to make the problem worse by various means to make it more detectable. Sometimes it comes down to letting a system run overnight while capturing a log to be examined later. May your bugs be big enough to find!
And Then There Were Threads
Some systems have multiple "threads" or “tasks” - sequences of operations that largely run independently of each other. Issues arise in the coordination of threads. Properly designed code uses certain methods to avoid genuine issues. Improperly designed code can lead to issues that are seriously difficult to identify, reproduce, and find.
Observations
Observations at a system level are a good starting point to hint at what is going on in code. For example – say a command like “standby” to some audio product fails. Did it only fail over web interface but not RS232? Parsing log files can be an effective way to arrive at a solution but separating the wheat from the chaff can be some work. Logs can indicate timing which can be key. Messages can indicate code flow too.
Still Can’t Figure it Out?
Sometimes there is no reasonable method that will definitely converge on a solution. Experience and intuition can steer debugging here. It can be useful to list symptoms, what is known, anomalies, expected vs observed behaviors, and to just step back and brainstorm on what is involved with the issue and consider tests to try.
Is it Fixed?
Ideally one can find some obvious flaw in the code, fix it, see the symptom disappear, and happily move on. But for non-trivial code changes, were there ancillary effects of the change? Was any modification significant enough that substantial testing is required? If so, “Regression testing” is done. Typically this is a set of tests that encompasses most of the functionality of a unit. It should ensure that a change didn’t unexpectedly impact proper operation. Good code design can help keep impacts localized.
It can happen that a potential fix doesn’t properly address the issue. Unless the fix was trivial, testing under various conditions (possibly worst-case ) might reveal unexpected results.
Ideal Code
Of course, it is best to write or obtain code that is solid, testable, maintainable and documented in the first place. A huge topic for another day.