- Data Science PyData
- Brian Carter: Lifecycle of Web Text Mining: Scrape to Sense
Click on text below to jump to the corresponding location in the video (except iOS)
My name is Brian Carter and I work as a data scientist in IBM. All IPython notebooks are up on my GitHub account. Previously I worked for the Irish Police Force. This presentation is an offshoot of a project I worked on there. The web scrape focuses on pillreports.net. It is an online database of reviews of Ecstasy pills. The review sites for experience goods may be viewed as a bridge between the knowledge gap of what people think they are buying and what they actually experience. Was it possible to identify a pill which was producing undesirable effects?
Example of reviewPeople put in a description of the field. There are two fields related to consumed and warning.
Example: Yellow Instagram
OutlineScraping the data
1. Scraping the dataUse urllib2 and BeautifulSoup
The pillreports is a simple website
I set up a database in mongodb.
Two collections in Mongo DB: reports and actual comments.
2. Cleaning the dataUsual stuff. Rename columns, clean up dates, geocoding etc.
langdetect is a python port of a Google Java library for language detection.
3. Visualization data explorationDoing visualization in Python was a steep learning curve for me. It takes a bit longer to get to the point where you want to visualize.
What is the breakdown between number of reports with warning and those that didn't have any?
4a. ClassificationA very simple naive bayes model using the description field and warning label as target label.
Visualize the top predictors based on their chi-square statistic
Break down of chi scores
Stop words have a high chi score as they feature in phrases often.
A zoomed in plot gives insight into domain knowledge. Some of the words associated with warning = yes are taste, sour etc.
4b. PCA and clusteringSee if there is any interesting stuff in the user report field. I created a binary occurence vector representation of stemmed words and normalized it with tf-idf. I created a scatter plot and checked for interesting patterns.
The last thing when looking at text data is the obligatory word cloud.
QuestionsWere you able to make any suggestions about the provenance of these drugs?When I started the project I was in the police force, but later I moved on to a different job but continued the project as a fun project. You can maybe investigate the images and try and group them.
Were you actually positing data from NLTK and pushing it into sklearn? How did you do it?I didn't do any processing in NLTK.
Is it possible to use other features as predictors - size, shape, image, color?In terms of classification, the presentation used the description. When I fed more stuff it didn't have any impact.
I have heard about scrapy. Why did you choose with urllib?I am not familiar with scrapy. Video outline created using VideoJots. Click and drag lower right corner to resize video. On iOS devices you cannot jump to video location by clicking on the outline.