Click on text below to jump to the corresponding location in the video (except iOS)

My name is Brian Carter and I work as a data scientist in IBM. All IPython notebooks are up on my GitHub account. Previously I worked for the Irish Police Force. This presentation is an offshoot of a project I worked on there. The web scrape focuses on It is an online database of reviews of Ecstasy pills. The review sites for experience goods may be viewed as a bridge between the knowledge gap of what people think they are buying and what they actually experience. Was it possible to identify a pill which was producing undesirable effects?
Example of review
People put in a description of the field. There are two fields related to consumed and warning.
Example: Yellow Instagram

Scraping the data

Data Exploration


1. Scraping the data
Use urllib2 and BeautifulSoup
The pillreports is a simple website

I set up a database in mongodb.

Two collections in Mongo DB: reports and actual comments.

2. Cleaning the data
Usual stuff. Rename columns, clean up dates, geocoding etc.
langdetect is a python port of a Google Java library for language detection.

3. Visualization data exploration
Doing visualization in Python was a steep learning curve for me. It takes a bit longer to get to the point where you want to visualize.
What is the breakdown between number of reports with warning and those that didn't have any?

4a. Classification
A very simple naive bayes model using the description field and warning label as target label.
Visualize the top predictors based on their chi-square statistic

Break down of chi scores

Stop words have a high chi score as they feature in phrases often.

A zoomed in plot gives insight into domain knowledge.
Some of the words associated with warning = yes are taste, sour etc.
4b. PCA and clustering
See if there is any interesting stuff in the user report field. I created a binary occurence vector representation of stemmed words and normalized it with tf-idf. I created a scatter plot and checked for interesting patterns.
The last thing when looking at text data is the obligatory word cloud.

Were you able to make any suggestions about the provenance of these drugs?When I started the project I was in the police force, but later I moved on to a different job but continued the project as a fun project. You can maybe investigate the images and try and group them.
Were you actually positing data from NLTK and pushing it into sklearn? How did you do it?
I didn't do any processing in NLTK.
Is it possible to use other features as predictors - size, shape, image, color?
In terms of classification, the presentation used the description. When I fed more stuff it didn't have any impact.
I have heard about scrapy. Why did you choose with urllib?
I am not familiar with scrapy.
Video outline created using VideoJots. Click and drag lower right corner to resize video. On iOS devices you cannot jump to video location by clicking on the outline.