Help! Why are my embedded devices failing?

Memory chip on fire

When devices fail, the problems can be numerous. In conversations with the embedded OEMs we work with, a common issue affects almost every manufacturer – the cost of diagnosing and fixing the causes of field failure. This impacts time-to-market and pulls resources away from development, to be used instead for field diagnostics and post-mortem analysis.

Especially in the situation where the non-volatile memory part is the center of the problem, things can get ugly when data cannot be retrieved or data is partly corrupted and needs some kind of repair with the risk of introducing interpretation errors.

This issue is especially relevant for the following reasons:

1. The need for defect prevention during field operations

The high degree of reliability required for protecting critical data dictates that devices must not fail. To ensure that devices are wear-fail-safe, manufacturers are required to run extensive tests for a range of user scenarios so as to safeguard against edge cases. The analysis of test results can be a daunting task due to several interfaces between hardware, software, and application layers. Hence, there is a need to continuously track these interactions, so that during a failure, any difference in the interactions can be discovered and corrected.

2. Vulnerability of device to wear-related failures

As flash media continues to increase in density and complexity, it’s also becoming more vulnerable to wear-related failures. With the shrinking lithography comes increased ECC requirements, and the move to more bits/cell. With this also comes a concern that what was written to the disk may not in fact be what is read off the disk. However, most applications assume that the data written to the file system will be completely accurate when read back. If the application does not fully validate the data read, there may be errors in the data that cause the application to fail, hang or just misbehave. These complications require checks to validate data read as against the data written, so as to prevent device failures due to data corruption.

3. Complexity of hardware and software integration

The complex nature of hardware and software integration within embedded devices makes finding the cause of failures a painstaking job, one that requires coordination between several hardware and software vendors. For this reason, it often takes OEMs days to investigate causes at the file system layer alone. Problems below that layer can entail more extensive testing and involve multiple vendors. Log messages can help manufacturers pinpoint the location of failure so that the correct vendor can be notified.

This ability to pinpoint the cause of failure is especially helpful when an OEM is:

  • Troubleshooting during the manufacturing and testing process to make sure that their devices do not fail for the given user scenarios.
  • Doing post-mortem analysis on parts returned from their customers, in order to understand the reasons for failures, and possible solutions.
  • Required to maintain a log of interactions between the various parts of the device, for future assistance with failure prevention or optimization.

Identifying the causes and costs of field failure is one thing, but what solutions can OEMs turn to in order to prevent these issues in the first place?

Fighting field failure with transactional file systems

Thankfully, various file systems solutions exist for safeguarding critical data. FAT remains a simple and robust option with decent performance. Unfortunately, it isn’t able to provide the degree of data protection or performance that is sometimes needed. In safety-critical industries like automotive, aerospace, and industrial, basic file systems like FAT are often unable to meet the needed performance and reliability.

Transactional file systems like Tuxera’s Reliance Edge offer a level of reliability, control, and performance for data that is simply too vital to be lost or corrupted. One of the key features of Reliance Edge is that it never overwrites live data, ensuring a backup version of that data remains safe and sound. This helps preserve user data in the event of power loss.

Final thoughts

Correctly finding and identifying the cause of field failures is the first step in tackling them. The next step is choosing the right solution – one that’s optimized to secure your critical data specifically in case of field failure and power loss.

Gevorg Melikdjanjan

Gevorg Melikdjanjan

Security | Reliability | Data Solutions

Let's make your devices wear-fail-safe

No hardware modifications needed. Want to know how it's done?

Contact me