Beyond the tutorial : Text Classification with sklearn
The goal of this article is to show how you can go beyond the simple pipeline presented in the sklearn tutorial. I assume you've read and coded the tutorial. I start by reviewing the code, recalling how it works and pointing out few problems you might encounter when you want to develop a more complex pipeline to process text data.
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline from sklearn.naive_bayes import MultinomialNB from sklearn import metrics categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())]) text_clf = text_clf.fit(twenty_train.data, twenty_train.target) twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) predicted = text_clf.predict(twenty_test.data) metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names)
precision recall f1-score support alt.atheism 0.97 0.60 0.74 319 comp.graphics 0.96 0.89 0.92 389 sci.med 0.97 0.81 0.88 396 soc.religion.christian 0.65 0.99 0.78 398 avg / total 0.88 0.83 0.84 1502
For now, we will ignore the part about parameter tuning using grid search and focus on the pipeline itself. We can identify 4 steps :
- Fetch the data to be used by the classifier.
- Build a pipeline that will processed the data.
- Train a model with the pipeline.
- Test and evaluate the pipeline.
These steps are fairly generic when working with text data. What wil change depending on your task are mostly the first and second steps. In the first step, you may need to rework and adapt your data a little bit so that they fit the format required by the pipeline, that is, an array of documents.
The real deal happens in the pipeline definition. In this simple example, the following happens :
CountVectorizer
tokenizes your text (convert a raw string to an array of tokens) and count the words in it. For each document, the results is a vector a number corresponding to the number words observed.TfidfTransformer
converts the raw numbers with the tfidf formula.- Finally, a Naive Bayes classifier uses this vectors to predict the class of a document based on those vectors.
Each step uses the information (vector of data) produced by the previous step. That means that the initial information is lost. For example, at the TfidfTransformer
, we do not have the text documents anymore.
But what happen for example, if you want to add new features based on the text data and not refining the vectors?
For this, you'll need to use a special pipeline called FeatureUnion
.
Let's say we want to use both the frequency and tfidf of words. Here is how we will do it.
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.naive_bayes import MultinomialNB from sklearn import metrics categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) text_clf = Pipeline([ ('text features', FeatureUnion( [('vect', CountVectorizer()), ('tfidf', TfidfVectorizer())])), ('clf', MultinomialNB())]) text_clf = text_clf.fit(twenty_train.data, twenty_train.target) twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) predicted = text_clf.predict(twenty_test.data) metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names)
precision recall f1-score support alt.atheism 0.93 0.87 0.90 319 comp.graphics 0.96 0.94 0.95 389 sci.med 0.96 0.91 0.93 396 soc.religion.christian 0.88 0.97 0.92 398 avg / total 0.93 0.93 0.93 1502
The use of FeatureUnion
is quite simple. Each of the steps inside the FeatureUnion
will receive the same input and the output of each steps will be combined in one feature space to be feed to the next step, here the classifier.
As we can see, the combination of both features lead increases the performances of our classifier.
Instead of one feature, a step inside a FeatureUnion
can be a pipeline as well. Here is a pipeline strickly equivalent to the previous one :
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.naive_bayes import MultinomialNB from sklearn import metrics categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) text_clf = Pipeline([ ('text features', FeatureUnion( [('vect', CountVectorizer()), ('tfidf', Pipeline([('count', CountVectorizer()), ('tfidf', TfidfTransformer())]))])), ('clf', MultinomialNB())]) text_clf = text_clf.fit(twenty_train.data, twenty_train.target) twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) predicted = text_clf.predict(twenty_test.data) metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names)