Beyond the tutorial : Text Classification with sklearn

The goal of this article is to show how you can go beyond the simple pipeline presented in the sklearn tutorial. I assume you've read and coded the tutorial. I start by reviewing the code, recalling how it works and pointing out few problems you might encounter when you want to develop a more complex pipeline to process text data.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, shuffle=True,
                                  random_state=42)

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories, shuffle=True,
                                 random_state=42)
predicted = text_clf.predict(twenty_test.data)
metrics.classification_report(twenty_test.target, predicted,
                              target_names=twenty_test.target_names)

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502

For now, we will ignore the part about parameter tuning using grid search and focus on the pipeline itself. We can identify 4 steps :

Fetch the data to be used by the classifier.
Build a pipeline that will processed the data.
Train a model with the pipeline.
Test and evaluate the pipeline.

These steps are fairly generic when working with text data. What wil change depending on your task are mostly the first and second steps. In the first step, you may need to rework and adapt your data a little bit so that they fit the format required by the pipeline, that is, an array of documents.

The real deal happens in the pipeline definition. In this simple example, the following happens :

CountVectorizer tokenizes your text (convert a raw string to an array of tokens) and count the words in it. For each document, the results is a vector a number corresponding to the number words observed.
TfidfTransformer converts the raw numbers with the tfidf formula.
Finally, a Naive Bayes classifier uses this vectors to predict the class of a document based on those vectors.

Each step uses the information (vector of data) produced by the previous step. That means that the initial information is lost. For example, at the TfidfTransformer, we do not have the text documents anymore.

But what happen for example, if you want to add new features based on the text data and not refining the vectors? For this, you'll need to use a special pipeline called FeatureUnion. Let's say we want to use both the frequency and tfidf of words. Here is how we will do it.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, shuffle=True,
                                  random_state=42)

text_clf = Pipeline([
    ('text features', FeatureUnion(
        [('vect', CountVectorizer()),
         ('tfidf', TfidfVectorizer())])),
    ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories, shuffle=True,
                                 random_state=42)
predicted = text_clf.predict(twenty_test.data)
metrics.classification_report(twenty_test.target, predicted,
                              target_names=twenty_test.target_names)

                        precision    recall  f1-score   support

           alt.atheism       0.93      0.87      0.90       319
         comp.graphics       0.96      0.94      0.95       389
               sci.med       0.96      0.91      0.93       396
soc.religion.christian       0.88      0.97      0.92       398

           avg / total       0.93      0.93      0.93      1502

The use of FeatureUnion is quite simple. Each of the steps inside the FeatureUnion will receive the same input and the output of each steps will be combined in one feature space to be feed to the next step, here the classifier. As we can see, the combination of both features lead increases the performances of our classifier.

Instead of one feature, a step inside a FeatureUnion can be a pipeline as well. Here is a pipeline strickly equivalent to the previous one :

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, shuffle=True,
                                  random_state=42)

text_clf = Pipeline([
    ('text features', FeatureUnion(
        [('vect', CountVectorizer()),
         ('tfidf', Pipeline([('count', CountVectorizer()),
                             ('tfidf', TfidfTransformer())]))])),
    ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories, shuffle=True,
                                 random_state=42)
predicted = text_clf.predict(twenty_test.data)
metrics.classification_report(twenty_test.target, predicted,
                              target_names=twenty_test.target_names)

References

The Sklearn Tutorial: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html