case example 4 - N-gram and complex pipline

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('TrainingData.csv', index_col=0)
df.shape
Out[1]:
(400277, 25)
In [11]:
from time import time

text preprocessing

● NLP tricks for text data

● Tokenize on punctuation to avoid hyphens, underscores, etc.

● Include unigrams and bi-grams in the model to capture important information involving multiple tokens - e.g., ‘middle school’

vec = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1, 2))

N-gram range in scikit-learn

In order to look for ngram relationships at multiple scales, you will use the ngram_range parameter

Special functions: You'll notice a couple of new steps provided in the pipeline in this and many of the remaining exercises. Specifically, the dim_red step following the vectorizer step , and the scale step preceeding the clf (classification) step.

To make sure the models perform as the expert competition winner intended, we have to apply a dimensionality reduction technique, which is what the dim_red step does, and we have to scale the features to lie between -1 and 1, which is what the scale step does.

The dim_red step uses a scikit-learn function called SelectKBest(), applying something called the chi-squared test to select the K "best" features. The scale step uses a scikit-learn function called MaxAbsScaler() in order to squash the relevant features into the interval -1 to 1.

You won't need to do anything extra with these functions here, just complete the vectorizing pipeline steps below. However, notice how easy it was to add more processing steps to our pipeline!

In [16]:
NUMERIC_COLUMNS = ['Total','FTE']
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type','Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']

NON_LABELS = [c for c in df.columns if c not in LABELS]
NON_LABELS
Out[16]:
['Object_Description',
 'Text_2',
 'SubFund_Description',
 'Job_Title_Description',
 'Text_3',
 'Text_4',
 'Sub_Object_Description',
 'Location_Description',
 'FTE',
 'Function_Description',
 'Facility_or_Department',
 'Position_Extra',
 'Total',
 'Program_Description',
 'Fund_Description',
 'Text_1']
In [6]:
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)
In [31]:
from sklearn.linear_model import SGDClassifier


# Import pipeline
from sklearn.pipeline import Pipeline

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import other preprocessing modules
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import chi2, SelectKBest

# Select 300 best features
chi_k = 300

# Import functional utilities
from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
from sklearn.pipeline import FeatureUnion

# Perform preprocessing
get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

text_fillna = FunctionTransformer(lambda x: x.fillna('No class type'), validate=False)

# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('fillna',text_fillna),
                    # 2-gram
                    ('vectorizer', CountVectorizer(
                                                   ngram_range=(1, 2))),
                    
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(SGDClassifier()))
    ])
In [32]:
pl
Out[32]:
Pipeline(steps=[('union', FeatureUnion(n_jobs=1,
       transformer_list=[('numeric_features', Pipeline(steps=[('selector', FunctionTransformer(accept_sparse=False,
          func=<function <lambda> at 0x0000000035236C18>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y=False, validate=Fa...r_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False),
          n_jobs=1))])
In [33]:
import numpy as np
dummy_labels = pd.get_dummies(df[LABELS])
dummy_labels.shape
Out[33]:
(400277, 104)
In [34]:
from multi_split import multilabel_train_test_split
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS], dummy_labels, 0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[34]:
((320222, 16), (80055, 16), (320222, 104), (80055, 104))
In [ ]:
 
In [43]:
start = time()
# Fit to the training data
pl.fit(X_train, y_train)
print 'used: {:.2f}s'.format(time()-start)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)
In [ ]:
 

Adding interaction features with scikit-learn

In [36]:
from sklearn.preprocessing import PolynomialFeatures
In [37]:
interaction = PolynomialFeatures(degree=2,
                                 interaction_only=True,
                                 include_bias=False)
In [39]:
# PolynomialFeatures?

● Bias term allows model to have non-zero y value when x value is zero

Implementing the hashing trick in scikit-learn

HashingVectorizer acts just like CountVectorizer in that it can accept token_pattern and ngram_range parameters.

The important difference is that it creates hash values from the text, so that we get all the computational advantages of hashing!

Learning from the expert: hashing trick

● Adding new features may cause enormous increase in array size

● Hashing is a way of increasing memory efficiency

PETRO VEND FUEL AND FLUIDS

2954 9384 4569 1197 8947

● Hash function limits possible outputs, fixing array size

When to use the hashing trick

● Want to make array of features as small as possible

● Dimensionality reduction

● Particularly useful on large datasets

● e.g., lots of text data!

In [41]:
from sklearn.feature_extraction.text import HashingVectorizer

vec = HashingVectorizer(norm=None,
                        non_negative=True,
                        token_pattern=TOKENS_ALPHANUMERIC,
                        ngram_range=(1, 2))
In [ ]:
 

the winning model

In [1]:
# Import the hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer

# Instantiate the winning model pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                     non_negative=True, norm=None, binary=False,
                                                     ngram_range=(1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])
In [ ]: