Videos about data science from PyData Conferences

Alexander Kagoshima: A Data Science Operationalization Framework

5/29/2015 [00:31:49]

In a lot of our Data Science customer engagements at Pivotal, the question comes up how to put the developed Data Science models into production. Usually, the code produced by the Data Scientist is a bunch of scripts that go from data loading over data cleansing to feature extraction and then model training. There is rarely much thought put into how the resulting model can be used by other pieces of software and this is generally not a good practice of encapsulating the Data Scientist's work for others to re-use.What we as Data Scientists want is to create models that drive automated decision-making but there is clearly a mismatch to the above way of going about Big Data projects. Considering these challenges, we created a small prototype for a Data Science operationalization framework. This allows the Data Scientist to implement a model which is exposed by the framework as a REST API for easy access by software developers.The difference to other predictive APIs is that this framework allows for automatic periodic retraining of the implemented model on incoming streaming data and is able to free the Data Scientist of some tedious work - like Ÿkeeping track of results for different modelling and feature engineering approaches, basic visualization of model performance and the creation of multiple model instances for different data streams. It is written by practitioning Data Scientists for Data Scientists.Moreover, the framework will be released this year under an Open Source license which means that unlike other predictive APIs which only host one instance for Data Scientists to push their models to, this allows Data Scientists to completely control their own model codebase. In addition, it is deployable on Cloud Foundry and Heroku and can thus use some features of PaaS, which means less work in thinking about how to deploy and scale a model in production. The model is implemented in Python and uses Flask to expose the REST API and the current prototype uses Redis as backend storage for the trained models. Models can be either custom-written or use existing Python ML libraries like scikit-learn. The framework is currently geared towards online learning, but it is possible to hook it up to a Spark backend to realize model training in batch on large datasets. Alexander Kagoshima

View Outline

Peadar Coyle: Probabilistic Programming in Sports Analytics

5/30/2015 [00:23:43]

Probabilistic Programming and Bayesian Methods are called by some a new paradigm. There are numerous interesting applications such as to Quantitative Finance.I'll discuss what probabilistic programming is, why should you care and how to use PyMC and PyMC3 from Python to implement these methods. I'll be applying these methods to studying the problem of 'rugby sports analytics' particularly how to model the winning team in the recent Six Nations in Rugby. I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert. Peadar Coyle

View Outline

Brian Carter: Lifecycle of Web Text Mining: Scrape to Sense

5/30/2015 [00:27:54]

Pillreports.net is an on-line database of reviews of Ecstasy pills. In consumer theory illicit drugs are experience goods, in that the contents are not known until the time of consumption. Websites like Pillreports.net, may be viewed as an attempt to bridge that gap, as well as highlighting instances, where a particular pill is producing undesirable effects. This talk will present the experiences and insights from a text mining project using data scraped from the Pillreports.net site.The setting up and the benefits, ease of using BeautifulSoup package and pymnogo to store the data in MongoDB will be outlined.A brief overview of some interesting parts of data cleansing will be detailed.Insights and understanding of the data gained from applying classification and clustering techniques will be outlined. In particular visualizations of decision boundaries in classification using "most important variables". Similarly visualizations of PCA projections for understanding cluster separation will be detailed to illustrate cluster separation. The talk will be presented in the iPython notebook and all relevant datasets and code will be supplied. Python Packages Used: (bs4, matplotlib, nltk, numpy, pandas, re, seaborn, sklearn, scipy, urllib2) Brian Carter

View Outline

Paul Balzer: Running, walking, sitting or biking? - Motion prediction with acceleration and rotation

5/29/2015 [00:22:20]

A lot of devices can measure acceleration and rotationrates. With the right features, Machine Learning can predict, weather you are sitting, running, walking or going by bike. This talk will show you, how to calculate features with Pandas and set up a real time classifier with SciKit-Learn. Including hardware demo. Paul Balzer

View Outline

Radim Řehůřek - Faster than Google? Optimization lessons in Python.

7/27/2014 [00:27:30]

View slides for this presentation here: http://www.slideshare.net/PyData/radim-ehek PyData Berlin 2014 Lessons from translating Google's deep learning algorithm into Python. Can a Python port compete with Google's tightly optimized C code? Spoiler: making use of Python and its vibrant ecosystem (generators, NumPy, Cython...), the optimized Python port is cleaner, more readable and clocks in—somewhat astonishingly—4x faster than Google's C. This is 12,000x faster than a naive, pure Python implementation and 100x faster than an optimized NumPy implementation. The talk will go over what went well (data streaming to process humongous datasets, parallelization and avoiding GIL with Cython, plugging into BLAS) as well as trouble along the way (BLAS idiosyncrasies, Cython issues, dead ends). The quest is also documented on my blog.

View Outline

Jose Luis Lopez Pino: Lessons learned from applying PyData to our marketing organization

5/30/2015 [00:32:21]

For all e-commerce sites, marketing is a big part of the business and marketing efficiency and effectiveness are critical to their success. Companies must make many data-driven decisions in order to reach customers that their competitors don’t, maximize the revenue of each click, decide wisely what are the costs to cut, enter new markets, etc.GetYourGuide has been working for more than two years on building a marketing intelligence that allows us growing our marketing efforts in the travel market without building a huge team or buying extremely expensive tools.All the decisions are supported by a dedicated system that runs on the PyData stack that allows marketers to extract valuable insights from data and performs critical marketing tasks: keyword mining, campaign automation, predictive modeling, omni-channel marketing data integration, customer segmentation, pattern mining from click data, etc.As a result of this, we were able to scale up 3 times our marketing efforts, launch campaigns in 13 markets and automate 75% of our work only in the last 8 months. But this is not the end of our journey, GetYourGuide is building a Data Science team to understand travelers needs and wants and make our Customers' trips amazing. Jose Luis Lopez Pino

View Outline