Guidelines for Creating Robust Embedded Systems
Part 1 - Introduction
By Bob Japenga
This white paper is intended to provide a brief overview of
lessons learned in creating robust systems over the past 35 years of embedded
systems development (starting with the Intel 4040). It is divided into a number of parts – each
addressing a different issue. This part
introduces the topic. Each subsequent
part will address each one in depth.
Other white papers concentrate on different issues like
or general software architectural issues
in order to create robust embedded systems.
These are of great value and an important part of creating robust
systems. This series of papers will
concentrate on specific lessons we have learned the hard way by creating
non-robust embedded systems when a bullet proof system was required. And we write them down because “wisdom
leaks.” We need to review these
periodically because these pearls of wisdom often get forgotten in the heat of
the development cycle.
We all know how to create non-robust systems (embedded or
otherwise). Not paying attention to
details, working with fuzzy specifications, the unreliability of the hardware
(isn’t it always their problem?!), making changes at the last minute, and when
the complexity exceeds the experience base of the team are just a few of the
ways we can create these systems.
But over the years there have been a few things that we have
learned that have helped us create systems that run 24/7 fifty-two weeks a
year. Here is the list (in no particular order) that we will attempt to flush
out in this paper. Each part of this
white paper will address one of these guidelines:
for and test that there are no memory leaks.
systems where out of bounds memory references are caught and allow the
system to recover.
the availability of I/O to be accessed
handle of out of memory conditions
handle of out of disk conditions
a watchdog on all of the critical tasks to verify that they are all
working properly and taking appropriate action. Watchdogs serviced by a simple interrupt
handler are not of much value because a critical task or thread could have
crashed and the watchdog will continue to be serviced.
for and verify data integrity – Don’t assume that the data written is
always going to be the data read
in data redundancy in critical areas
liberal error logging that includes pruning of logged data
systems that work even after failures have occurred. For example, if an
error forces tripping the watchdog, that same error should not trip the
watchdog 50 times in 50 minutes.
systems that know when to throw in the towel and not go into endless loops
(“You got to know when to hold em,
know when to fold em”)
assume that your system will power up every time – test it through 1000’s
of power cycles
your system so that it can be powered down at every instruction and not
become non-functional (i.e. it may
lose some data but not become non-recoverable)
pre-emptive scheduling if at all possible
sharing variables across threads or tasks.
If you must do this, make sure that the variables are read to or
written in a cycle assembly language instruction.
your system at beyond the boundaries of normal operation (Stress testing).
your system with built in spare time, spare memory and spare disk
space. Measure these “spares” under
carefully planned stress testing.
your system so that typically unattainable boundaries can still be tested.
fully aware of your stack utilization requirements and measure stack
utilization during carefully planned stress testing.
writing driver code, pay careful attention to volatile hardware
registers. With memory mapped I/O and
programming in C, use the “volatile” prefix for all hardware registers
that can be read.
systems for which you can get all of the source code and the means to
automatically generated version generation to allow for field verification
of the version
a design log of all areas of the design that will be difficult if not
impossible to test and create a special test plan for each areas
(analysis, simulation, module testing, etc)
some Software Engineering Methodology that includes:
2 – we will look at Memory Leaks
An excellent example (albeit a bit over-the-top) is available as MISRA-C2004 Guidelines for the use of the C
language in critical systems. See http://www.misra.org.uk/