- Data Science PyData
- Alexander Kagoshima: A Data Science Operationalization Framework
Click on text below to jump to the corresponding location in the video (except iOS)
Speaker introduction. My name is Alex Kagoshima and I work as a data scientist at Pivotal. Pivotal has a couple of big data platform, we have an open source PaaS platform and app development and data science. I don't work on internal data, I help customers realize use cases to get value out of their data. I develop models on top of their data to show what we can do with big data.
Data Science Engagements TodayCustomer has a lot of siloed data sources and siloed systems. Their landscape is often complex. We pull their data and put it in some kind of distributed big data platform usually Hadoop. Within the big data platform, I do the data extract and develop models on top of it, analyzing historical data, gathering intelligence. Then I create results of my model and show what it can do for their business.
The big data frameworks like Spark etc are very good at taking the data and producing results, but there are no principled ways of getting results back into the customers' existing systems. This is a big problem that we as data scientists should really work on, otherwise we are stuck in the hype cycle.
Data science engagements today - 2-We have scripting instead of production ready models
-training required on how to run created scripts
-long setup times, especially if it is on premise
-bad way of encapsulating our work for others to reuse
Try it differentlyWe need an easy way to attach our models to existing legacy systems. As data scientists, we might choose the fancy algorithm instead of the simple one. But the simple one is easier to operationalize. I leveraged Cloud Foundry to spin up environments quickly.
DS Operationalization Framework PrototypeI created a small framework. Think of it as exposing an API. I used Flask as server framework. Send data as JSON, results are sent in JSON. I used redis as model and data storage backend.
I implemented two models to show how it works - linear regression and online linear regression which is deployable to Cloud Foundry.
What is cloud foundry?Think of it as Heroku, but you can set up your own Heroku. This is really good for enterprises as they don't want to use the cloud. It is open source.
Framework prototypeYou have an API you can talk to. You send JSON requests creating a model. In the backend it creates a model and stores it in Redis. You can then send data to the model as JSON, API will ingest the data and kick off retraining. There is also some visualization tools. The idea is to have an encapsulated model. We can do a demo with this.
DemoTo create a model, do the curl request with data. Then we send data to the model. After 10 data points, the model has been trained. I created a visualization which looks a little bit better. This shows the last 100 data points sent in. The blue line is the trained model which it is learning. But the underlying behavior can change, and the model adapts to that.
Advantages of this approach-API provides a simple way to expose models to other software
-Framework can free the data scientist from tedious work - keeping track of results for different modelling, keeping track of model performance after retraining
-dynamically create new model instances via API, e.g. for different classes in the data or different versions of current models
Framework prototype - modelThe model is a class that needs to implement three functions: train(), score() and get_parameters()
Data Science in the Future?You can expand a framework like this to work in the enterprise landscape. You need additional modules like data ingest, cleaning, transformation, aggregation. You also need a proper big data store - redis is not that scalable. The nice thing about PaaS is that you can get some kind of data exploration module in there.
Open questions-deploy the trainer and scorer on the same instance or on different?
-where and how can we use automatic scaling abilities of a PaaS?
-automatically spin up multiple instances for multiple classes in the data?
-should the framework be able to do feature transformation/feature engineering?
-what about future generators and models that only work in batches?
-if real time data for scoring was sent to the framework, how to send a label to that data later?
Similar projects-Velox. From the people who developed Spark.
-Google Prediction API
-Prediction IO - most similar to what I did
QuestionsGive us a real life example and how we can use it?I worked on an engagement recently for detecting malware communication within proxy logs. We analyzed the domain name, and we can use the framework for that.
In the API, you were sending label directly. Would it not make sense to decouple the data and the model?Yes, you can do that. I invite to do a pull request. Video outline created using VideoJots. Click and drag lower right corner to resize video. On iOS devices you cannot jump to video location by clicking on the outline.