case example 2 - model exploration and NLP¶

Splitting the multi-class dataset

● Recall: Train-test split

● Will not work here

● May end up with labels in test set that never appear in training set

● Solution: StratifiedShuffleSplit

● Only works with a single target variable

● We have many target variables

● multilabel_train_test_split()

https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py

multilabel_sample() is in the end of this page, the githug source is above¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time
%matplotlib inline

df = pd.read_csv('TrainingData.csv', index_col=0)
df.shape

(400277, 25)

num_col = []
for i,j in zip(df.dtypes,df): 
    if i !='object':
        num_col.append(j)
num_col

['FTE', 'Total']

df[num_col].isnull().sum()

FTE      274206
Total      4555
dtype: int64

numeric_data_only = df[num_col].fillna(-1000)
numeric_data_only.isnull().sum()

FTE      0
Total    0
dtype: int64

type(numeric_data_only)

pandas.core.frame.DataFrame

dummy variables¶

LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']

df[LABELS].shape

(400277, 9)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])
label_dummies.shape

(400277, 104)

label_dummies.head(3)

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)
X_train.shape, y_train.shape, X_test.shape,y_test.shape

((320222, 2), (320222, 104), (80055, 2), (80055, 104))

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Data columns (total 2 columns):
FTE      320222 non-null float64
Total    320222 non-null float64
dtypes: float64(2)
memory usage: 7.3 MB

y_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: float64(104)
memory usage: 256.5 MB

Training the model¶

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

from sklearn.metrics import accuracy_score

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
clf

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

start = time()
clf.fit(X_train, y_train)
print 'Used {:.2f}s'.format(time()-start)

Used 193.80

Predicting¶

● If .predict() was used instead:

● Output would be 0 or 1

● Log loss penalizes being confident and wrong

● Worse performance compared to .predict_proba()

test_pred = clf.predict_proba(X_test)

test_pred.shape

(80055L, 104L)

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS],prefix_sep='---').columns,
                             index=X_test.index,
                             data=test_pred)

prediction_df.head()

NLP¶

Representing text numerically¶

● Bag-of-words

● Simple way to represent text in machine learning

● Discards information about grammar and word order

● Computes frequency of occurrence

Scikit-learn tools for bag-of-words¶

● CountVectorizer()

● Tokenizes all the strings

● Builds a ‘vocabulary’

● Counts the occurrences of each token in the vocabulary

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

df.Position_Extra.head()

134338                 KINDERGARTEN 
206341                  UNDESIGNATED
326408                       TEACHER
364634    PROFESSIONAL-INSTRUCTIONAL
47683     PROFESSIONAL-INSTRUCTIONAL
Name: Position_Extra, dtype: object

df.Position_Extra.nunique()

581

df.Position_Extra.isnull().sum()

135513

# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('', inplace=True)

# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
vec_alphanumeric

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
        tokenizer=None, vocabulary=None)

# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
        tokenizer=None, vocabulary=None)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"

print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])

There are 385 tokens in Position_Extra if we split on non-alpha numeric
[u'1st', u'2nd', u'3rd', u'4th', u'56', u'5th', u'9th', u'a', u'ab', u'accountability', u'adaptive', u'addit', u'additional', u'adm', u'admin']

Combining text columns for tokenization¶

use combine_text_columns to convert all training text data in your DataFrame to a single vector that can be passed to to the vectorizer object and made into a bag-of-words using the .fit_transform() method.

df[['Use','Sharing','Reporting']].head(3)

text_data = df[['Use','Sharing','Reporting']]
# Replace nans with blanks
text_data.fillna("", inplace=True)

text_data.head(5)

# Join all text items in a row that have a space in between
text_data.apply(lambda x: " ".join(x), axis=1).head()

134338    Instruction School Reported School
206341            NO_LABEL NO_LABEL NO_LABEL
326408    Instruction School Reported School
364634    Instruction School Reported School
47683     Instruction School Reported School
dtype: object

text_combined = text_data.apply(lambda x: " ".join(x), axis=1)

set((1,2,3)) & set((2,3,4))

{2, 3}

' '.join(['4','7'])

'4 7'

ttt = np.arange(18).reshape(3,6)
ttt

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

ttt.sum(0)

array([18, 21, 24, 27, 30, 33])

ttt.sum(1)

array([15, 51, 87])

text_combined.head()

134338    Instruction School Reported School
206341            NO_LABEL NO_LABEL NO_LABEL
326408    Instruction School Reported School
364634    Instruction School Reported School
47683     Instruction School Reported School
dtype: object

# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the basic token pattern
TOKENS_BASIC = '\\S+(?=\\s+)'

# Create the alphanumeric token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer: vec_basic
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = text_combined

# Fit and transform vec_basic
vec_basic.fit_transform(text_vector)

# Print number of tokens of vec_basic
print("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))

# Fit and transform vec_alphanumeric
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric
print("There are {} alpha-numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))

There are 20 tokens in the dataset
There are 19 alpha-numeric tokens in the dataset

print vec_basic.get_feature_names()

[u'&', u'budget', u'budgets', u'business', u'central', u'enrichment', u'instruction', u'ispd', u'leadership', u'management', u'no_label', u'o&m', u'on', u'pupil', u'reported', u'school', u'services', u'set-aside', u'shared', u'untracked']

print vec_alphanumeric.get_feature_names()

[u'aside', u'budget', u'budgets', u'business', u'central', u'enrichment', u'instruction', u'ispd', u'label', u'leadership', u'm', u'management', u'on', u'pupil', u'reported', u'school', u'services', u'shared', u'untracked']

multilabel_sample¶

import numpy as np
import pandas as pd

def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

	Function---Aides Compensation	Function---Career & Academic Counseling	Function---Communications	Function---Curriculum Development	Function---Data Processing & Information Services	Function---Development & Fundraising	Function---Enrichment	Function---Extended Time & Tutoring	Function---Facilities & Maintenance	Function---Facilities Planning	...	Object_Type---Rent/Utilities	Object_Type---Substitute Compensation	Object_Type---Supplies/Materials	Object_Type---Travel & Conferences	Pre_K---NO_LABEL	Pre_K---Non PreK	Pre_K---PreK	Operating_Status---Non-Operating	Operating_Status---Operating, Not PreK-12	Operating_Status---PreK-12 Operating
206341	0.035848	0.006466	0.000830	0.023919	0.008916	0.000173	0.032078	0.024406	0.052102	0.000048	...	0.010728	0.036952	0.116162	0.017361	0.831233	0.141041	0.027751	0.169607	0.019930	0.810552
275539	0.035883	0.006465	0.000830	0.023923	0.008915	0.000173	0.032085	0.024410	0.052119	0.000048	...	0.010726	0.037619	0.116387	0.017363	0.831181	0.141101	0.027762	0.169575	0.019932	0.810608
330504	0.035886	0.006465	0.000830	0.023923	0.008915	0.000173	0.032085	0.024411	0.052120	0.000048	...	0.010726	0.037663	0.116402	0.017364	0.831177	0.141105	0.027763	0.169573	0.019932	0.810612
18698	0.035885	0.006465	0.000830	0.023923	0.008915	0.000173	0.032085	0.024410	0.052120	0.000048	...	0.010726	0.037647	0.116397	0.017364	0.831178	0.141103	0.027763	0.169574	0.019932	0.810610
291539	0.122566	0.009044	0.001548	0.028673	0.016024	0.018099	0.043998	0.031789	0.114389	0.017294	...	0.005603	0.175223	0.139555	0.016081	0.500542	0.473756	0.099110	0.095670	0.051075	0.928683

	...	Pre_K_NO_LABEL	Pre_K_Non PreK	Operating_Status_Non-Operating	Operating_Status_PreK-12 Operating
134338	...	1.0	0.0	0.0	1.0
206341	...	1.0	0.0	1.0	0.0
326408	...	0.0	1.0	0.0	1.0

	Use	Sharing	Reporting
134338	Instruction	School Reported	School
206341	NO_LABEL	NO_LABEL	NO_LABEL
326408	Instruction	School Reported	School