Illustration by Boyoun Kim
Technical

Dendrites and Wrongs–PCBA Failure Analysis at the Factory (or Why We Scrapped Thousands of Boards)

You’ve spent the past six months designing the smallest, cheapest, lowest power product in its class. The prototypes perform beautifully. Its little coin cell battery provides enough juice for hundreds of thousands of uses. The MSP430 is the optimal microcontroller choice. You’ve squeezed the firmware into almost every byte of the 2k of Flash. The whole thing costs a couple of bucks in volume. The system is optimized and ready for production! 

You’re off to China to have it made. One trip, two trips–the frequent flier miles accrue. Everything is going well. You watch the first hundred units come off the production line. You test them all yourself. They are working well! You come back for DVT, then PVT. Your contract manufacturer is cranking them out, a few thousand every day.

As part of reliability testing, you arrange an accelerated life test. One of the tests involves a high temperature, high humidity, 2-day soak–like spending a couple of days in DC in mid-July. The test units come out of the soak, and you test one…it doesn’t work. You open it up and measure the battery voltage. It’s dead. The needle screeches off the record. 

All units are tested on the assembly line to ensure sleep mode power consumption is within range: less than 1 microamp. For that tiny coin-cell battery to last the required two years, the sleep current must be less than 1 microamp. All the batteries had also been tested to make sure they are at full capacity when installed. So what is going on!?

You frantically probe the board to look for issues. Wait a minute–the microcontroller supply bypass capacitor C1 should be an open circuit. But it’s measuring less than 100 ohms; that looks more like a resistor than a capacitor. 100 ohms will drain that battery fast. 

You desolder C1, remove it from the board and measure its resistance again. It’s an open circuit, as it should be. Hrm. You measure between the pads where the capacitor used to be on the board–100 ohms again: wrong. You clean the board with a Q-Tip and some alcohol, and measure again. An open circuit. Cleaning the board under that capacitor fixed the problem. Somehow current was flowing through something on the board. Ruh-roh. 

It’s time to call in the FA (failure analysis) pros: Exponent. They did the failure analysis on the world trade center collapse and the challenger disaster, so they should be able to handle your little capacitor problem. They recommend taking SEM (scanning electron microscope) and x-ray images, and performing ion chromatography to get to the root cause of the short.

You send them some boards, and wait. The report arrives.

Here’s what the X-Ray looks like (while not relevant to your problem, you think it’s pretty cool that you can see the wire bonds inside the microcontroller IC):

x-ray

Here’s what the SEM image looks like:

sem

And finally. the ion chromatography results:

chromatography

The x-ray and chromatography results are worrying. There should be no flux residue on the board, and Cl (Chlorine) and S (sulfur) shouldn’t be there either. What are they doing on your board? Both elements are found in certain solder fluxes. But the flux should be completely cleaned off the board as part of the assembly process. 

Here’s a trio that can spell disaster: 1. A DC voltage, 2. Humidity, 3. Contamination. Guess there’s a fourth element as well: time. Mix these up in the right way, and you will get current to flow. Apparently your boards had the magic (magically terrible) trio. It’s called electrochemical migration or dendritic growth. Dendrites. You hadn’t heard about those since high school chemistry. Now they are back to haunt you.  

The capacitor in question, C1, is a very small surface mount part (an 0402). Its two metal pads are very close together, with just 0.4mm of space between the exposed metal pads. Once the units are assembled, there is a 3V DC voltage (potential) between those pads (DC voltage–trio factor #1). That voltage creates an electromagnetic field, which can start to alter the chemical structure of any materials on the surface of the board that is amenable (in this case flux residue–trio factor #2), especially when water is present (humidity–trio factor #3). Over time, ions start to migrate, electrochemical processes slowly begin. It’s a cascading effect–once a tiny amount of current starts to flow, it creates a path for more current, like a trickle of water eventually forming a river.

There have been 300,000 units already made. You have the accelerated life test run on all of them, and half of them fail. Your CM (contract manufacturer) is on the hook for the cost of the failed units, but you’ve spent a lot of time and energy tracking down the failure. 

You reflect on what you’ve learned: 

  • No part of the board fabrication and assembly process should be ignored.
  • Specify what flux to use, especially in low-power applications with small surface mount components.
  • If you’re not using a no-clean flux, specify how to clean. 
  • Prioritize accelerated life testing earlier in development for super lower power applications.

In the end, this turns out to be a relatively minor blip on the production radar screen. You hire a new CM, validate first hand that they are cleaning the boards, and rerun the accelerated lifetime tests. No failures. The product ships to millions of users over many years. People love them, and you see them being used all over the place.

Still, you will always remember the dendrites.