Inside OCZ's Factory: How SSDs Are Made
by Kristian Vättö on May 20, 2015 8:30 AM ESTAt CES I had the opportunity to sit down with OCZ's CEO, Ralph Schmitt, to discuss the state of OCZ after Toshiba acquired the company in late 2013. We talked about how the company has changed and evolved under the new ownership and how Toshiba has brought in some much needed NAND supply and expertise. In the article we posted summarizing that discussion I also mentioned that I would be taking a closer look at OCZ's manufacturing and validation in the coming months, and today's article focuses on that.
OCZ flew me out to Taiwan to get an in-person look at the factory and I had Jason Ruppert, Senior Vice President of Operations, and Jim Van Patten, Vice President of World Wide Quality, as my hosts and guides throughout the trip, and a big thanks to both of them. A bit of background, to ensure these were the people to speak to: Mr. Ruppert has been with OCZ since March 2012 and before joining OCZ he was the Vice President of Manufacturing Operations & Engineering at Harmonic Inc, which focuses on video delivery infrastructures. Mr. Van Patten was actually Mr. Ruppert's first hire and joined the company in May 2012 from Logitech where he was the Vice President of World Wide Quality Assurance. Mr. Ruppert holds a Master's degree in Systems Engineering from North Carolina State University (for your interest, Anand did his BS in Computer Engineering in the same university), while Mr. Van Patten received his Ph.D. in Instructional Design & Evaluation from Syracuse University. It goes without saying that both Mr. Ruppert and Mr. Van Patten have extensive knowledge and experience within their operational areas, so the two were the best people to guide me through the manufacturing and validation.
The Development Process of an SSD
Before we move on to the actual factory tour and see how SSDs are made, let's outline the development process first. After all, a product must be designed and developed before it can be manufactured and there are some items that show up in both development and manufacturing processes.
As with any product, the development process starts from an idea, which can be practically anything (completely new model, refresh of existing model with new NAND, higher capacity model, different form factor, new software etc.). In phase zero the idea is shaped to become a concept and usually results in a short (1-2 pages) document that describes the opportunity presented by the product.
Once the concept is clear, the marketing and engineering teams will give their initial feedback. Both are very important because a product must be marketable, but at the same time it needs viable to execute from the engineering standpoint. Normally phase one takes about three weeks and results in a more in-depth description of the opportunity, and if it fails at this point it is either scrapped or moved back to a concept stage.
In phase two, OCZ starts to commit more significant resources to the project. The first two phases merely outline the concept and determine its opportunity, so phase three begins the actual planning of the product. The two key documents that are finalized in this phase are the marketing requirements and engineering response documents, but each function group (e.g. quality and supply chain) will also deliver their support plans. Basically, the purpose of phase two is to construct a comprehensive project plan that includes all aspects and teams involved in the product, including the budget.
Phase two is probably the most critical phase because the project plan and budget are used to decide whether OCZ puts hundreds of thousands (or even millions) of dollars behind the product, so all documents must be carefully made and evaluated in order to make the best decision for the company. The length of phase two depends on the complexity of the product and what teams need to be involved, but it typically takes from one to three months to build the final plan and budget.
If the project is funded, OCZ moves to phase three, which is where most of the engineering work is done. OCZ of course wants to keep the exact details of this phase close to its chest, but the ultimate goal is to build the first working prototypes, so the project can move to testing the prototypes. While the engineers are busy with their work, the remaining teams work on their own functions and prepare to manufacture the pilot samples (this includes tasks such as qualifying suppliers, securing long lead time parts, developing preliminary spec sheets and marketing materials). The length of the design and implementation phase depends greatly on the product, but even a drive that uses an existing controller can spend up to a year in this phase. The development of a totally new controller like the JetExpress is obviously a multi-year project given the sheer amount of engineering work required.
The Validation Phase
Phase four, which is validation and qualification, essentially consists of three main parts: the Engineer Verification Test (EVT), the Design Verification Test (DVT) and the Production Verification Test (PVT). EVT is run on the first engineering prototypes and it tests that the drive works in real life as it was designed to work on paper. The test suite is relatively straightforward and tests aspects such as power levels, signals and interface timings to ensure that the prototype works as it was planned to. There is also some preliminary performance testing in EVT phase, but because the firmware is usually far from final the results almost never illustrate the performance of a final product.
DVT is further broken down to two areas: normal DVT and quality/reliability. The normal DVT has a broader set of tests than EVT and more variables (e.g. power, temperature and host variations) are added to the mix to ensure that the drive operates and performs as it was supposed to in a variety of environments. Each test is also run on at least four samples, whereas the initial EVT testing is usually performed on just one or two samples. I'm not going to list and describe every individual test here because the DVT phase consists of dozens of different tests, but there are product compliance, data retention, power loss and die failure tests to mention a few, along with thorough performance testing to evaluate the firmware.
The reason why DVT is split into two parts is because EVT and normal DVT are both performed by the engineering team (who also designed the drive), which can create conflicts of interest during the validation process (in the end, human beings tend to be blind to their own faults and mistakes). Most of the DVT tests are rerun as reliability tests, but in this case the tests are performed by the independent quality team that is lead by Mr. Van Patten. The number of samples is also considerably higher and each test is run for a longer duration to verify the reliability of the design. Again, the full list of tests is several pages long, but aspects such as durability against vibration, shocks and low/high temperatures are tested in addition to the normal DVT tests. Basically every spec that is mentioned in the spec sheet is tested in this phase, including all standard JEDEC tests and certifications.
The first level of PVT tests are also run later during phase four and focus on the reliability and repeatability of the manufacturing process. Basically, the purpose of the PVT tests is to ensure that every drive coming out of the mass production line will be of the same quality and that is done by examining the drives from the production line using dye-and-pry and x-ray to inspect the PCB for any defects caused by the soldering process. The other PVT tests evaluate the readiness of the factory's quality system (incoming and in-process quality inspection, final quality control and out of box inspection) to make sure that all quality control phases are capable of separating good and bad, and that no defective products will get through to the customers. Ongoing Reliability Testing (ORT) is also set up to test a few drives from every production run to guarantee that nothing changes over time.
The total length of the validation phase varies greatly. It can be as short as two months if the design is relatively simple and similar to previous ones, but it can easily take over six months for more complex ones. Usually OCZ creates 2-5 sets of engineering samples during validation as issues are found and fixed, but there isn't really any preset duration for validation -- it always depends on what is found during the verification and how significant modifications are needed. Ultimately a drive cannot move to the next phase until it passes all quality and reliability tests, so setting a strict deadline would be a bad idea (for the company and for consumers) to begin with.
Entering Production
In phase five the drive moves from engineering and verification to operations (i.e. manufacturing), which usually takes 3-6 weeks to complete. Final PVT tests are conducted to ensure that the manufacturing quality meets the specifications and that necessary tests are in place to spot any changes/errors in the production. Other teams also finish up their actions to be ready for the launch and this is also the point when OCZ contacts us and other media about an upcoming product launch and sends out the review samples (i.e. the samples we get are typically manufacturing pilots as the mass production hasn't begun yet).
When the manufacturing side is ready to start putting out the new drive, a public announcement of the new product is made and the mass production as well as shipments to customers begin. As part of this visit, we had an inside look into the mass production side of the equation.
64 Comments
View All Comments
caleblloyd - Tuesday, May 19, 2015 - link
Pagination links are broken, on mobile at least... Can't navigate to page 2 to see the factory :(Kristian Vättö - Wednesday, May 20, 2015 - link
On my end everything seems to work fine (even on mobile). What happens if you try to access the second page directly?http://www.anandtech.com/show/9218/ocz-fab-tour/2
close - Wednesday, May 20, 2015 - link
I have to ask, as some things look surprising to me:1) So every new SSD already has 8 times the capacity of data already written to it? Or is it just QC and natch testing?
2) I always imagined the FW write process as being automated. But this looks like a lot of manual work to connect each drive by hand and write the FW. Again, is this the standard process or only during the initial testing phases?
close - Wednesday, May 20, 2015 - link
And on the same note, I always assumed the labeling process is automated. Either they have really low volume or labor is THAT cheap.menting - Wednesday, May 20, 2015 - link
i'm not the official answer, but it should already have 8 times the capacity of data written in, and then the firmware should zero out the counts.close - Wednesday, May 20, 2015 - link
What I'm not sure is if this happens to all drives or to selected drives, assuming that if a few drives are OK the whole batch must be. Also, the testing is done after writing the FW. Is the FW "pre-configured" to ignore the first 8 writes per LBA or do they go through connecting them to PCs all over again to reset the written data counter?dreamslacker - Wednesday, May 20, 2015 - link
They would do it for every SSD. The actual usable capacity of the modules aren't fixed or a known quantity until you actually test every cell. During this phase, you will also know which cells are 'bad' and to discard/ repair the SSD if the remaining usable cell count is lower than the set limits.The usable cells will then be mapped into the table so the controller knows what cells to avoid using.
This procedure is done on mechanical disk drives too since the actual platter capacity isn't a fixed number either.
As for the write or test process, it depends on the volume and the manufacturer. If volumes are high enough, you might not even have workers handling the F/W write or test process. A fully automated robotic arm and conveyor belt system would handle the drives and label them accordingly. Leaving the workers to package the drives.
MikhailT - Wednesday, May 20, 2015 - link
1. Correct, this is what is known as the "burn-in" period. You have to write to every single NANDs or even hard drive platters a few times to make sure it is working. Many company burn in computers as well, they finish building it and then run a custom automated tool to benchmark it severely for several hours before they can ship it to you.Think about the electronics, 90% of the defects (my experience and others I've talked to) are usually found within the first few days of use. That's usually a sign that the company did not properly burn/test the device in before shipping it to you.
2. It depends on the experience of the company. It cost a lot of money (machines are expensive and you have to hire people to figure these things out) to start automating the stuff and it would actually be cheaper initially to do it by hand as you have less volume to work with. As you get more money from your business revenue and volume starts to ramp, you then hire a few folks to figure out how to automate things, if it is cheaper and worth, you then invest hundred of thousands of dollars or millions to buy these equipment. That's why in the first page, it talks about this in phase 3 about committing the funds in terms of millions of dollars.
close - Thursday, May 21, 2015 - link
I assumed this is done before assembling the product. So you bin the chips, check them for errors, etc. before you solder them to a PCB. This way even if you're not the manufacturer of the NAND you still get to differentiate between chips and put the better ones in better products.If you do the burn in and checking for defects AFTER they're soldered you're basically guaranteeing that all defects will be remedied at extra cost.
Kristian Vättö - Thursday, May 21, 2015 - link
NAND binning is usually done by the NAND manufacturer or packager, but there may (or actually will) still be bad blocks. The purpose of run-in testing is to identify the bad blocks so that the controller won't use them for storage as that could potentially lead to performance issues or data loss.