Guidelines for Creating Robust Embedded Systems
Part 1 - Introduction
By Bob Japenga
Scope
This white paper is intended to provide a brief overview of
lessons learned in creating robust systems over the past 35 years of embedded
systems development (starting with the Intel 4040). It is divided into a number of parts – each
addressing a different issue. This part
introduces the topic. Each subsequent
part will address each one in depth.
Other white papers concentrate on different issues like
coding standards[1]
or general software architectural issues[2]
in order to create robust embedded systems.
These are of great value and an important part of creating robust
systems. This series of papers will
concentrate on specific lessons we have learned the hard way by creating
non-robust embedded systems when a bullet proof system was required. And we write them down because “wisdom
leaks.” We need to review these
periodically because these pearls of wisdom often get forgotten in the heat of
the development cycle.
Summary
We all know how to create non-robust systems (embedded or
otherwise). Not paying attention to
details, working with fuzzy specifications, the unreliability of the hardware
(isn’t it always their problem?!), making changes at the last minute, and when
the complexity exceeds the experience base of the team are just a few of the
ways we can create these systems.
But over the years there have been a few things that we have
learned that have helped us create systems that run 24/7 fifty-two weeks a
year. Here is the list (in no particular order) that we will attempt to flush
out in this paper. Each part of this
white paper will address one of these guidelines:
- Design
for and test that there are no memory leaks.
- Design
systems where out of bounds memory references are caught and allow the
system to recover.
- Limit
the availability of I/O to be accessed
- Gracefully
handle of out of memory conditions
- Gracefully
handle of out of disk conditions
- Providing
a watchdog on all of the critical tasks to verify that they are all
working properly and taking appropriate action. Watchdogs serviced by a simple interrupt
handler are not of much value because a critical task or thread could have
crashed and the watchdog will continue to be serviced.
- Design
for and verify data integrity – Don’t assume that the data written is
always going to be the data read
- Design
in data redundancy in critical areas
- Provide
liberal error logging that includes pruning of logged data
- Design
systems that work even after failures have occurred. For example, if an
error forces tripping the watchdog, that same error should not trip the
watchdog 50 times in 50 minutes.
- Design
systems that know when to throw in the towel and not go into endless loops
(“You got to know when to hold em,
know when to fold em”)
- Don’t
assume that your system will power up every time – test it through 1000’s
of power cycles
- Design
your system so that it can be powered down at every instruction and not
become non-functional (i.e. it may
lose some data but not become non-recoverable)
- Avoid
pre-emptive scheduling if at all possible
- Avoid
sharing variables across threads or tasks.
If you must do this, make sure that the variables are read to or
written in a cycle assembly language instruction.
- Test
your system at beyond the boundaries of normal operation (Stress testing).
- Design
your system with built in spare time, spare memory and spare disk
space. Measure these “spares” under
carefully planned stress testing.
- Design
your system so that typically unattainable boundaries can still be tested.
- Be
fully aware of your stack utilization requirements and measure stack
utilization during carefully planned stress testing.
- When
writing driver code, pay careful attention to volatile hardware
registers. With memory mapped I/O and
programming in C, use the “volatile” prefix for all hardware registers
that can be read.
- Design
systems for which you can get all of the source code and the means to
build them
- Provide
automatically generated version generation to allow for field verification
of the version
- Create
a design log of all areas of the design that will be difficult if not
impossible to test and create a special test plan for each areas
(analysis, simulation, module testing, etc)
- Chose
some Software Engineering Methodology that includes:
- Design
Reviews
- Code
Reviews
- Configuration
control
- Configurable
Tools
- Bug
tracking
In Part
2 – we will look at Memory Leaks