<<

Heading Towards Big Data Building A Better Data Warehouse For More Data, More Speed, And More Users

Raymond Gardiner Goss [email protected] Kousikan Veeramuthu [email protected] Manufacturing Technology GLOBALFOUNDRIES Malta, NY, USA

Abstract—As a new company, GLOBALFOUNDRIES is determine the caller’s income level and specify to which agent aggressively agile and looking at ways to not just mimic existing to route the call or the switch would timeout. When switches semiconductor manufacturing data management but to leverage were overwhelmed with data, they would drop packets and new technologies and advances in data management without algorithms had to infer states based on most probable current sacrificing performance or scalability. Being a global technology state. Other industries, such as social media, are challenged company that relies on the understanding of data, it is important to centralize the visibility and control of this information, bringing more by unstructured data and need tools to help turn text it to the engineers and customers as they need it. messages and photos into useful information for search engines and marketing purposes. The challenge in the semiconductor Currently, the factories are employing the best practices and world is with the size of the data. Speed becomes a secondary data architectures combined with business intelligence analysis problem because so many sources are needed to be joined and reporting tools. However, the expected growth in data over together in a timely manner. Large recipes, complex output the next several years and the need to deliver more complex data integration for analysis will easily stress the traditional tools from the test floor combined now with more Interface-A trace beyond the limits of the traditional data infrastructure. The data amass terabytes each month that need to be handled for manufacturing systems vendors need to offer new solutions based both real-time SPC, APC, and command and control scenarios on Big Data concepts to reach the new level of information as well as offline yield analyses. Users now require real-time processing that work well with other vendor offerings. access to data from a much larger pool of sources. This paper describes the various states of handling the increasing In this paper, we will show where we are and where we are heading to manage the increasing needs for handling larger complexity and volumes today and the challenges ahead. amounts of data with faster as well as secure access for more users. II. TRADITIONAL SOLUTION, GROWTH AND BIG DATA Keywords—Data Warehousing, Real-Time, Analysis, A. Many types of Big Data Reporting, Scaling, Big Data In the past year, “Big Data” has been gaining more buzz. It isn’t uncommon to hear someone say, “we will scale with a I. INTRODUCTION Big Data solution”, “ does it just fine”, or “vendor x Not long ago, the price of gasoline was less expensive and must have already a Big Data solution in the plans.” we drove cars based on features we desired like roof racks, However, there are different Big Data problems and solutions cargo space, sporty looks, and prestige, but the world is and not all apply or can be used at once. We need to first changing. The cost of fuel has increased and we are more define the term. aware of the environmental concerns. We are switching to Big Data is the territory where our existing traditional vehicles with different engines that go much further with less relational database and file systems processing capacities are energy, but still expect all the new features of built-in GPS, exceeded in high transactional volumes, velocity backup cameras and keyless ignition. The move to Big Data responsiveness, and the quantity and or variety of data. The will be a similar paradigm shift. The principle analysis is the data are too big, move too fast, or don’t fit the strictures of same, but the engines and amount of data are changing. RDBMS architectures. Scaling also becomes a problem. To Various industries have different problems, but most will gain value from these data, we must choose alternative ways to have Big Data needs. When first moving to the semiconductor process them. manufacturing industry, we noticed that the transaction volume B. Complexity Example was a fraction of what we had experienced in the telecom world, where data were optimized into compressed bytes and Big Data covers a range of situations, all with the common streamed over raw sockets and switch responses were expected theme of “more” — more variety, more quantity, more users, in milliseconds. For instance, for a 800 number routing more speed, more complexity. There are currently different scheme, we had less than 250ms to look the phone number up, Big Data solution approaches to each of these. Let us start off with an example of determining root cause and correlation of a

978-1-4673-5007-5/13/$31.00 ©2013 IEEE 220 ASMC 2013 new variety. In one fab, there was a challenge of reticle either by asking for a user account and then creating a stand hazing. It wasn’t hard to determine the culprit of the haze by alone application around the data usage or by first creating sending it off to the lab but other details were not as easy. independent MS Access, PHP, or Perl applications that then From a few of the facility’s air quality sensors, it could be seen “urgently” need to be connected to factory systems to solve that there were traces of an oxidizing agent detected in the air some pressing need. and the fab had started the practice of inspecting the reticles after every 200 wafers. While the risk to the wafers was In Fig. 1, a standard factory system with a well organized mitigated, the transports became high and potentially created SOA architecture is shown, with the introduction of Ad Hoc more exposure of contamination while in the Automated data consumers and generators. These ad hoc systems could Material Handling System (AMHS). Where were the reticles eventually be migrated into or become new core systems but picking up the haze? Was it from outgases in the tool or while serve as a reminder that our systems today have a long way to they were in transit to the inspection tools or the stocker? In go to offer universal and consistent data integration. order to solve the problem, we needed temporal data from the D. Yesterday’s Technology Today process, metrology and inspection tools, MES, facilities, MCS, Traditionally, GLOBALFOUNDRIES was comprised of and AMHS to be brought together in one place for analysis. Up systems from Chartered Ltd and AMD which were focused on until this point, the data warehouse had not yet contained all self-contained processes and reports. Analyses were either such sources. From problems like this, we needed a solution, limited to information placed in a data warehouse using the which resulted in the creation of the General Engineering and older batched Extract Transfer and Load (ETL) paradigm that Manufacturing Data warehouse (GEM-D). could have a lag of hours or to specialized reports generated Data performance is becoming equally ripe for from separate run-time systems. The data warehouse, for improvements. In our factories, command and control is example, had data from the SiView MES, inline SPC results, continually leveraging more data sources and visualization of and engineering data from WET and SORT, but lacked data real-time data to make decisions. For instance, before shipping from advanced reticle handling and preventative maintenance wafers, a fab out inspection occurs comparing all experiments, activities. Direct queries to quality systems were often incidents, quality reviews, and prior holds from SPC events. performed as a side activity, and not well correlated with the All these data are made available not only to systems but also other data. The data warehouse in a single fab housed 40+ to the users so they can rectify any outstanding issues. The terabytes and yet did not house any of the newer Interface-A real-time systems need to support nearly instant responses. At tool data. Similarly, reporting focused on MES WIP status and the same time we have data retention and archiving lacked access to data other than through some random run-time requirements to keep much of these data online for some time systems that were supposed to be used for decision services. and then in a dearchivable state. The staff used tools like APF RTD reports, which are familiar to dispatch writers but not well-suited for analysis or traditional C. Ad Hoc Organic Data Organization and Growth scheduled static reports using applications like SAP Business Nearly every engineering university graduate has had some Objects. Both the data warehouse and WIP reporting did not programming experience and understands how to use a stand up to the demands of more voluminous and real-time database. Even though we have clear requirements for data. The number of systems that have relevant data for each architectural review for new system connections and use case continues to increase. introductions, a large percentage of connections are created

Ad Hoc Ad Hoc Ad Hoc App

Factory Systems

MES Siview UI SPC Other Fabs/Corporate FDC Setup Decison framew ork Systems Services & Integration

MQ Message Bus Replication Data Warehouse (GEM-D)

EI Dispatch & APC RMS CMMS Scheduling eTEST Business Analysis and Reporting

Fig. 1. Factory Systems with Ad Hoc applications

221 ASMC 2013 The landscape has now changed. GLOBALFOUNDRIES C. Automated Algorithms is focused on gathering data in real time, with less than 10 Big Data can feed advanced analytics and algorithms to second latencies, in a new General Engineering and vastly improve the decision making process and identify Manufacturing Data Warehouse (GEM-D, see Fig. 2) taking valuable insights which were previously hidden or not easily advantage of Oracle GoldenGate feeds that have minimal available. Fabs can adjust production lines automatically to impact to run-time systems and provide the data in easily optimize efficiency, reduce waste, and avoid dangerous joinable schemas. These feeds were previously only used for conditions. At GLOBALFOUNDRIES, we are already using decision services, but now are scaling across the enterprise. controlled experiments to make better decisions by embedding New compression techniques are also utilized to reduce storage real-time, highly granular data from networked sensors in the volumes. GEM-D offers one stop shopping for users with supply chain and production processes. Automating the specific organized data marts built on top. New Business analysis of the data reported by sensors embedded in complex Intelligence tools are used to empower the users to sort and sift products combined with the tool owners input enables through the newly accumulated data with in-memory manufacturers to create proactive smart preventive associative technology in a compressed form with associations maintenance service. Service personnel can perform the PM defined between data items. However, we are just beginning to operations before there is an equipment failure which may enter the Big Data world. As masking layers increase, cause costly fab disruptions. Also this enables the fab to have transistors shrink, tool data increase, and wafer sizes move to a new business model like leasing the portions of fab space for 450mm, there are challenges ahead for such data-centric specific customers based on the sensor data. companies. D. Virtual Factory.

Taking product development and historical data and real-time III. WHY DO WE NEED BIG DATA CAPABILITIES? inputs from MES data, fabs can apply advanced computational Big Data analytics can reveal insights previously hidden by methods to create a digital model of the entire manufacturing data that were too costly to process, such as sensor logs and process. This model can be used to design and simulate the wafer maps in conjunction with other factory information. most efficient production system. Some of the applications of Being able to process every required item in a reasonable virtual factory include: 1. validation of designed production amount of time could increase throughput of the factory, skip concept; 2. processes verification before start of production; 3. sampling operations, ensure tool performance between optimization of production equipment allocation; 4. bottlenecks maintenance cycles, promote an investigative approach to data, and collisions analysis; 5. better utilization of existing in contrast to the somewhat static nature of running resources; 6. eliminating errors in the production line. Fab predetermined reports. A large amount of data is already kept engineers can leverage the power of Big Data to simulate these in the FAB archives unused since there is no cost effective way operations with millions of different combinations to optimally to process it and get value out of it. Some of the Big Data use schedule and dispatch WIP. cases are explained below.

A. Transparency IV. BIG DATA SOUTIONS Making Big Data more easily accessible to relevant As mentioned, there are several types of Big Data problems stakeholders in a timely way can create tremendous value. Data to be solved. The computer science industry offers a few are often compartmentalized within a single group in the fab. models. Described here are the main variants and applicable Several departments have their own IT systems and uses. A summary pros and cons list is provided in Table 1 and unfortunately frequently store and maintain redundant data. a corresponding radar chart, which includes traditional We need to have a free interchange of data among different RDBMS, similar to GEM-D is shown in Fig. 3. The desirable department systems. Here at GLOBALFOUNDRIES a lot of solution would be the one which gets high score in all or most teams were trying to gain access to the business process of the axes. request information and wanted to use it for automation but that infrastructure only supported access to a copy refreshed A. The Data Appliance every six hours. Real-time information for decisions was Data appliance offerings include Oracle Exadata, IBM impossible. Also integrating data from R&D, engineering, and Netezza, and Teradata’s platforms. These solutions offer a manufacturing units in the same fab and between fabs can complete closed solution with optimized hardware accelerators significantly reduce the wasteful redundancy and improve that scale according to the rack space. These appliances offer much faster turnaround time for resolving issues and accelerate access to data via traditional SQL, but indexing and queries are time to market. optimized by proprietary architectures that consist of query B. Experimental Analysis distributions and select statements offloaded from the CPU to specialized chips. Experimental analysis is critical in many areas of our fabs. Ability to process huge amounts of data will open up the possibility of conducting new tests and analysis which were unimaginable earlier. This will have a dramatic effect in areas like R&D and will help achieve faster time to market.

222 ASMC 2013

Current State Future State Additions and Considerations

Factory Source Systems Untapped Factory Source Systems

EI logs, bitmaps, system logs, Real time data test data feeds

GEM-D Staging Layer

Integrated Identity Mgmt

Evaluating compression, GEM-D Logical In-memory analysis and layer hardware solutions

Real time and New BI automated New Tools & analytics Analysis Reports Tools

Fig. 2. The GEM-D Model

223 ASMC 2013 • Work together to offer a standard nomenclature of objects to better tie system data together. B. The Hadoop Derivative In order to scale to the petabytes of unstructured data (not • Expect to work with new analytic tools for modeled in tables with well defined rows and columns), analysis and reporting not requiring proprietary Google introduced the Map Reduce paradigm that later was reporting platforms and perhaps providing made available in the Hadoop Open Source project. This has components or models for BI solutions. been extended with tools like HBase, Pig, Hive, etc. which B. Big Data Environment improve the usability and reduce the complexity of the Map Reduce coding. Using commodity hardware as a foundation, We are looking to provide vendors the opportunity to test Hadoop provides a layer of software that spans the entire grid, their Big Data platforms in our lab. Our state of the art test turning it into a single system. Hadoop based solutions are environment has been an environment where vendors have provided by companies like Cloudera, Hortonworks and introduced hardware and software solutions and see how they MapR. work together with other factory applications. C. Massive In Memory Database While the previous two models offer ways to scale to support larger amounts of data, a new model attempts to make larger amounts of data available in real time by placing the database in a highly available cluster of servers that keep all the data in memory. This eliminates the need for indexing and I/O on the traditional drive storage is completely eliminated. Some systems, such as SAP Hana, also flush data to disk for recovery scenarios. It is by far the fastest solution for critical structured data but comes with price tag. D. Solid State Disk (SSD) There is yet another improvement to any of the other Big Data solutions or even the traditional model that addresses the speed of access. The hard drive storage itself can be moved to solid state. These “flash drives” can be leveraged to all or just a subset of the data. SSD based solutions can be used with Fig. 3. Comparison chart of Big Data technologies Hadoop systems to address the big data problems. Companies like Fusion-io, NetApp, EMC, etc. provide SSD based solutions. VI. CONCLUSION

We are entering a new realm of data management. V. WHERE DO WE GO FROM HERE? Solutions will perhaps take several forms. As the complexity The path for the next level of data management is not yet of our needs scale, we need our suppliers to move away from defined. Reporting and analysis has already moved beyond the stand alone proprietary infrastructure, reporting and offer plug- single system or data set. The factory data volumes are and-play components. expanding rapidly. The reader can see from Fig. 3 that there are significant tradeoffs with using these solutions. The ACKNOWLEDGMENT appliance covers a subset of the traditional RDBMS whereas The authors would like to thank the Data Integration and IT the Hadoop paradigm covers new territory. DBA teams at GLOBALFOUNDRIES in deploying GEM-D and making the current solutions a reality. A. A call for action We appeal to our vendors to work together to leverage Big Data strategies within their product offerings. Several things can be done to help including: • Provide schemas that are portable and not locked into specific RDBS providers. • Provide Map-Reduce stubs that can be grouped together with other such routines. This applies to tool vendors log files, test file output and otherwise untapped data today. • Leverage SOA architectures using interchangeable message buses for communication.

224 ASMC 2013 TABLE I. TECHNOLOGY PROS AND CONS

Technology Pros Cons

• Fault Tolerance is only at transaction level; • Very mature. Long history of successful cannot survive node failure. installation. • Supports only structured data. • Can be used by multiple user types, from • Homogenous hardware - all nodes in the Appliance Business users using reporting tools, installation must be the same. RDBMS through SQL novices and expert users. • Disk based data storage. Limited real-time • Vendors like Teradata support data volume analysis capabilities compared to in-memory in petabyes. technologies. • Custom hardware accelerates query and • OLTP and OLAP layers are separate transformations. • More expensive compared to open source solutions.

• Data volumes in Petabytes. • Currently it only has batch processing • Open source. capabilities. Real-time data processing is still • Commodity hardware and inexpensive. under works. • Fault tolerant, designed to survive multiple • Requires programming skills to work with. Hadoop node failures. Smaller pool of individuals capable of • Supports both structured and unstructured performing these tasks. data. Data can be operated on in native • Relatively new technology. The source is format. • continuously under development with new Heterogeneous hardware, the nodes in the features being added. installation can be different.

• Data volume in terabytes. Doesn't scale to • In-memory. Real-time analysis. petabyte size currently. In Memory DB • Single foundation for OLTP+OLAP. • Very expensive. Enterprise level toolset. • No need for Indexing. Proprietary hardware and software. • Limited fault tolerance capabilities.

• Very high performance. • Very expensive. About ten times that of a • No spinning Disks. standard hard drive. • Consumes less power and space. • Limited no. of writes, thus limiting the life of SSD • Leverage existing software architecture the device. and data base systems, improving virtually • Cannot scale out. any disk-based solution. • Relatively less reliable.

[5] http://blogs.gartner.com/merv-adrian/2013/02/10/hadoop-and-di-a- platform-is-not-a-solution/ REFERENCES [1] J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh, A H [6] http://www.saphana.com/community/blogs/blog/2012/04/30/what- Byers, McKinsey Global Institute Publication "Big data: The next oracle-wont-tell-you-about-sap-hana frontier for innovation, competition, and productivity" [7] http://www.dbms2.com/2012/08/20/in-memory-hybrid-memory- [2] J G Kobielus, “The Forrester Wave™: Enterprise Hadoop Solutions, centric/ Q1 2012” [8] http://hadoop.apache.org/ [3] A McClean, R. C. Conceição, M O’Halloran, "A Comparison of [9] http://en.wikipedia.org/wiki/MapReduce MapReduce and Parallel Database Management Systems," ICONS [10] http://en.wikipedia.org/wiki/Big_data 2013, The Eighth International Conference on Systems. [11] http://www.datanami.com/datanami/2012-02- [4] "Big Data Now: 2012 Edition" O’Reilly Media, Inc. 13/big_data_and_the_ssd_mystique.html

225 ASMC 2013