Developing and Adopting Big Data Machine Learning Practice

7/31/2014

Introduction

In this blog, let me give you an overview of the steps involved in Big Data ML (Machine Learning) adoption and deployment practice at your organization. Here is a bit of history at first.

When I worked at Intel 23 years back, we deployed Simulated Annealing to optimize transistor circuit sizes for IO ( Input/Output) cells in Standard Cell Library, for fastest propagation delay ( low to high and high to low ) at minimum noise ( inductive voltage due to power supply and ground ). R version of the technique we applied then is available now in optim package. Here, you will find a matrix example, similar to delay and noise constraint equations we used then.

Algebraically, both delay and noise are function of circuit sizes for given process and operating condition. If you want to know more about they are related, please refer to the presentation . For how noise is related to circuit sizing, please refer to the article . Before we deployed simulated annealing based optimization method thus automating circuit design to a large extent, we would manually design each circuit, simulate it under different operating conditions and make sure highest speed is achieved with least noise. As the chip manufacturing process technology evolved along with need of new features on the chip, library size evolved, very similar to how phenomenon has changed from Data to Big Data. That's when Intel R&D decided to automate the circuit sizing process. Under new framework, circuit delay and noise models were built with initial size sets from past experience. These equations i.e. constraints were predictive models. Real simulation delay and noise measurements were taken to verify the models. We tweaked the model depending on feedback. If the design requirement required bigger circuit for faster speed, we would plugin the target values in the model, and derive approximate circuit sizes. That cut down design time considerably.

Simulated annealing technique was adopted for Cell Library characterization, and it reduced man months in library development, making Intel truly agile. Today,we hear a lot about big data and machine learning. I started comparing: What we did 23 years back and what is done now. I see striking similarity. That is the theme of my blog.

ML Steps and Tasks

Let us now discuss steps and processes of machine learning adoption and deployment.

Steps
From my past experience and what I have found researching Big Data Machine Learning platforms at Google, Netflix and others is that machine learning adoption and deployment, is a 7 step process that consists of:

1. Data selection - Data may need to be cleaned and prepossessed.

2. Feature selection - Size (dimensionality) of the data can be big. Document classification may have too many words. Use features that are easier to extract and less sensitive to noise. Divide the dataset into a training dataset and a testing dataset.

3. Model selection - A lot of guessing goes here. From experience and domain expertise, select the model (or model set) and error function. Select the simplest model first, then select another class of model. We need to avoid over-fitting.

4. Learning - Train the learner or model. Find the parameter values by minimizing the error function.

5. Evaluation - Learner is evaluated on the testing dataset. You may need to select another model, or switch to a different set of features if error is not acceptable.

6. Application - Apply the learning model. For example, perform prediction with new, unseen data using learned model.

7. Production - Deploy the model in production, tune the model over time, as business model and stakeholders evolve.

Tasks
If application of machine learning is crucial to your business success similar to what we implemented at Intel for Standard Cell Library characterization, the organization needs to evolve and adapt. Stakeholders have to concur and work toward making it a success. Every party benefits in the end. Let me list the tasks that need to be in practice. The ML team should be at least proficient in:

Different learning problems and their solutions
Choosing a right model for a learning problem
Finding good parameter values for the models
Selecting good features or parameters as input to a model
The evaluation of a machine learner
Sharing results and collecting feedback
Continuously improving model and business results

It is absolutely a team task. It is agile. Team that collaborates most gets best results and contributes to bottom line of a business that thrives on power of ML. There is not doubt about it.

ML Platform, Package and Building

If you know by now. ML adoption and deployment carries a practice that consists of platforms. The platforms would embody package of modules. Together, they help us building the model.

ML Platform
Google, Netflix and all successful ML practitioners deploy ML platforms and packages. Their platform and package patterns are proprietary. They have also successful teams in place. Why platform is crucial?

We need a platform that team can rely on. As discussed in my previous blog, recommendation engine platform is also a ML platform. It has built in components that team can rely on daily basis. Are there open sources platforms? In that regard, KIJI Framework is a Hbase/Cassandra based ML framework. Before KIJI, machine learning algorithms were ported to run on Hadoop platform, thanks to Mahout. Hadoop distribution vendors like Cloudera, MapR and Pivotal also offer ML platforms.

At its epitome, Google has deployed Sybil which is a proven ML platform.

ML Packages
We know that proper modeling is fundamental to success of a ML platform and deployment. I have already covered modeling algorithms in my blog on Basics of Machine Learning. While there are commercial packages sold by Matlab, SAS, SPSS and oner vendors, there is open source R Caret package for any one to learn modeling, evaluating and tuning aspects of ML. This package has several functions that attempt to streamline the model building, evaluation and selection process. There are other open source packages like Sci-Kit ( Python ) and Scala/MLib ( Scala ) packages available for your consideration. I have not seen one comparable to Caret yet.

In Caret, for example, train function can be used to evaluate, using re-sampling, the effect of model tuning parameters on performance. Then, you choose the "optimal" model across these parameters while continuously measuring model performance from a training set. It is what we did at Intel 23 years back. We did few steps manually, but the process was 100% same.

ML Model Building
How do we build model? First, a specific model must be chosen. Then, tuning of model parameters comes into play. Let us see how we do it using Caret package.

The first step in tuning the model is to choose a set of parameters to evaluate. For example, if fitting a Partial Least Squares (PLS) model, the number of PLS components to evaluate must be specified. This is same as fitting some approximate sizes to model equation of circuit delay as we did at Intel.

Here is pseudo-algorithm.

Define sets of model parameters to evaluate
For each model parameter set
For each sample set of data
Collect hold out ( test ) data
Prepare data set for analysis
Fit The Model
Done
Evaluate The Model with hold-out ( test ) data
Done
Calculate best parameter set for given model

Currently, 150+ such models are available in Caret; there are lists of tuning parameters that can potentially be optimized. User-defined models can also be created. Once the model and tuning parameter values have been defined, the type of resampling should be also be specified. After resampling, the train function can produce a profile of performance measures and is available to guide the user as to which tuning parameter values should be chosen. By default, the validation function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used. Please refer to training process for more information.

Conclusion

In this blog, we covered how my own experience of circuit design automation at Intel still applies today, in the form of tasks and processes in building and tuning Big Data ML models on platforms using packages. Developing a tuned ML model using one of the packages listed above is fundamental task, but there is lot that goes before and after the building process, to make ML deployment truly successful in an organization.

It all depends on your business objective, stake holders and acceptable latency to serve the business need; most importantly on matching infrastructure and resources to handle the compute/io/storage/network demand if the model needs to be deployed in production at scale.

I will be speaking at Silicon Valley Code Camp 2014. If you are in San Francisco Bay Area, please attend the session. It is FREE. Please find the session details at Developing Real Time Recommendation Engine

2 Comments

Big Data Training in Chennai link

11/28/2015 02:49:44 am

Hi, Thanks for sharing such an informative blog about big data. You have done really a great job. I gathered lot of new information about big data from your blog. Keep posting.

srinu link

6/2/2016 11:43:07 pm

thanks for sharing information,nice article

Developing and Adopting Big Data Machine Learning Practice

Leave a Reply.

Author

Archives

Categories