case example 2 - model exploration and NLP

Splitting the multi-class dataset

● Recall: Train-test split

● Will not work here

● May end up with labels in test set that never appear in training set

● Solution: StratifiedShuffleSplit

● Only works with a single target variable

● We have many target variables

● multilabel_train_test_split()

https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py

multilabel_sample() is in the end of this page, the githug source is above

In [ ]:
 
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from time import time
%matplotlib inline

df = pd.read_csv('TrainingData.csv', index_col=0)
df.shape
Out[2]:
(400277, 25)
In [9]:
num_col = []
for i,j in zip(df.dtypes,df): 
    if i !='object':
        num_col.append(j)
num_col
Out[9]:
['FTE', 'Total']
In [18]:
df[num_col].isnull().sum()
Out[18]:
FTE      274206
Total      4555
dtype: int64
In [19]:
numeric_data_only = df[num_col].fillna(-1000)
numeric_data_only.isnull().sum()
Out[19]:
FTE      0
Total    0
dtype: int64
In [21]:
type(numeric_data_only)
Out[21]:
pandas.core.frame.DataFrame

dummy variables

In [22]:
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']
In [24]:
df[LABELS].shape
Out[24]:
(400277, 9)
In [25]:
# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])
label_dummies.shape
Out[25]:
(400277, 104)
In [26]:
label_dummies.head(3)
Out[26]:
Function_Aides Compensation Function_Career & Academic Counseling Function_Communications Function_Curriculum Development Function_Data Processing & Information Services Function_Development & Fundraising Function_Enrichment Function_Extended Time & Tutoring Function_Facilities & Maintenance Function_Facilities Planning ... Object_Type_Rent/Utilities Object_Type_Substitute Compensation Object_Type_Supplies/Materials Object_Type_Travel & Conferences Pre_K_NO_LABEL Pre_K_Non PreK Pre_K_PreK Operating_Status_Non-Operating Operating_Status_Operating, Not PreK-12 Operating_Status_PreK-12 Operating
134338 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
206341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
326408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0

3 rows × 104 columns

In [29]:
# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)
X_train.shape, y_train.shape, X_test.shape,y_test.shape
Out[29]:
((320222, 2), (320222, 104), (80055, 2), (80055, 104))
In [32]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Data columns (total 2 columns):
FTE      320222 non-null float64
Total    320222 non-null float64
dtypes: float64(2)
memory usage: 7.3 MB
In [31]:
y_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: float64(104)
memory usage: 256.5 MB

Training the model

In [56]:
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

from sklearn.metrics import accuracy_score
In [35]:
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
clf
Out[35]:
OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)
In [52]:
start = time()
clf.fit(X_train, y_train)
print 'Used {:.2f}s'.format(time()-start)
Used 193.80

Predicting

● If .predict() was used instead:

● Output would be 0 or 1

● Log loss penalizes being confident and wrong

● Worse performance compared to .predict_proba()

In [53]:
test_pred = clf.predict_proba(X_test)
In [64]:
test_pred.shape
Out[64]:
(80055L, 104L)
In [71]:
# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS],prefix_sep='---').columns,
                             index=X_test.index,
                             data=test_pred)
In [77]:
prediction_df.head()
Out[77]:
Function---Aides Compensation Function---Career & Academic Counseling Function---Communications Function---Curriculum Development Function---Data Processing & Information Services Function---Development & Fundraising Function---Enrichment Function---Extended Time & Tutoring Function---Facilities & Maintenance Function---Facilities Planning ... Object_Type---Rent/Utilities Object_Type---Substitute Compensation Object_Type---Supplies/Materials Object_Type---Travel & Conferences Pre_K---NO_LABEL Pre_K---Non PreK Pre_K---PreK Operating_Status---Non-Operating Operating_Status---Operating, Not PreK-12 Operating_Status---PreK-12 Operating
206341 0.035848 0.006466 0.000830 0.023919 0.008916 0.000173 0.032078 0.024406 0.052102 0.000048 ... 0.010728 0.036952 0.116162 0.017361 0.831233 0.141041 0.027751 0.169607 0.019930 0.810552
275539 0.035883 0.006465 0.000830 0.023923 0.008915 0.000173 0.032085 0.024410 0.052119 0.000048 ... 0.010726 0.037619 0.116387 0.017363 0.831181 0.141101 0.027762 0.169575 0.019932 0.810608
330504 0.035886 0.006465 0.000830 0.023923 0.008915 0.000173 0.032085 0.024411 0.052120 0.000048 ... 0.010726 0.037663 0.116402 0.017364 0.831177 0.141105 0.027763 0.169573 0.019932 0.810612
18698 0.035885 0.006465 0.000830 0.023923 0.008915 0.000173 0.032085 0.024410 0.052120 0.000048 ... 0.010726 0.037647 0.116397 0.017364 0.831178 0.141103 0.027763 0.169574 0.019932 0.810610
291539 0.122566 0.009044 0.001548 0.028673 0.016024 0.018099 0.043998 0.031789 0.114389 0.017294 ... 0.005603 0.175223 0.139555 0.016081 0.500542 0.473756 0.099110 0.095670 0.051075 0.928683

5 rows × 104 columns

NLP

Representing text numerically

● Bag-of-words

● Simple way to represent text in machine learning

● Discards information about grammar and word order

● Computes frequency of occurrence

Scikit-learn tools for bag-of-words

● CountVectorizer()

● Tokenizes all the strings

● Builds a ‘vocabulary’

● Counts the occurrences of each token in the vocabulary

In [78]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
In [88]:
df.Position_Extra.head()
Out[88]:
134338                 KINDERGARTEN 
206341                  UNDESIGNATED
326408                       TEACHER
364634    PROFESSIONAL-INSTRUCTIONAL
47683     PROFESSIONAL-INSTRUCTIONAL
Name: Position_Extra, dtype: object
In [94]:
df.Position_Extra.nunique()
Out[94]:
581
In [83]:
df.Position_Extra.isnull().sum()
Out[83]:
135513
In [89]:
# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('', inplace=True)
In [91]:
# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
vec_alphanumeric
Out[91]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
        tokenizer=None, vocabulary=None)
In [92]:
# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)
Out[92]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
        tokenizer=None, vocabulary=None)
In [93]:
# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"

print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])
There are 385 tokens in Position_Extra if we split on non-alpha numeric
[u'1st', u'2nd', u'3rd', u'4th', u'56', u'5th', u'9th', u'a', u'ab', u'accountability', u'adaptive', u'addit', u'additional', u'adm', u'admin']

Combining text columns for tokenization

  • use combine_text_columns to convert all training text data in your DataFrame to a single vector that can be passed to to the vectorizer object and made into a bag-of-words using the .fit_transform() method.
In [118]:
df[['Use','Sharing','Reporting']].head(3)
Out[118]:
Use Sharing Reporting
134338 Instruction School Reported School
206341 NO_LABEL NO_LABEL NO_LABEL
326408 Instruction School Reported School
In [120]:
text_data = df[['Use','Sharing','Reporting']]
# Replace nans with blanks
text_data.fillna("", inplace=True)
In [125]:
text_data.head(5)
Out[125]:
Use Sharing Reporting
134338 Instruction School Reported School
206341 NO_LABEL NO_LABEL NO_LABEL
326408 Instruction School Reported School
364634 Instruction School Reported School
47683 Instruction School Reported School
In [124]:
# Join all text items in a row that have a space in between
text_data.apply(lambda x: " ".join(x), axis=1).head()
Out[124]:
134338    Instruction School Reported School
206341            NO_LABEL NO_LABEL NO_LABEL
326408    Instruction School Reported School
364634    Instruction School Reported School
47683     Instruction School Reported School
dtype: object
In [128]:
text_combined = text_data.apply(lambda x: " ".join(x), axis=1)
In [100]:
set((1,2,3)) & set((2,3,4))
Out[100]:
{2, 3}
In [102]:
' '.join(['4','7'])
Out[102]:
'4 7'
In [112]:
ttt = np.arange(18).reshape(3,6)
ttt
Out[112]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [113]:
ttt.sum(0)
Out[113]:
array([18, 21, 24, 27, 30, 33])
In [114]:
ttt.sum(1)
Out[114]:
array([15, 51, 87])
In [ ]:
 
In [133]:
text_combined.head()
Out[133]:
134338    Instruction School Reported School
206341            NO_LABEL NO_LABEL NO_LABEL
326408    Instruction School Reported School
364634    Instruction School Reported School
47683     Instruction School Reported School
dtype: object
In [129]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the basic token pattern
TOKENS_BASIC = '\\S+(?=\\s+)'

# Create the alphanumeric token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate basic CountVectorizer: vec_basic
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)

# Instantiate alphanumeric CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Create the text vector
text_vector = text_combined

# Fit and transform vec_basic
vec_basic.fit_transform(text_vector)

# Print number of tokens of vec_basic
print("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))

# Fit and transform vec_alphanumeric
vec_alphanumeric.fit_transform(text_vector)

# Print number of tokens of vec_alphanumeric
print("There are {} alpha-numeric tokens in the dataset".format(len(vec_alphanumeric.get_feature_names())))
There are 20 tokens in the dataset
There are 19 alpha-numeric tokens in the dataset
In [135]:
print vec_basic.get_feature_names()
[u'&', u'budget', u'budgets', u'business', u'central', u'enrichment', u'instruction', u'ispd', u'leadership', u'management', u'no_label', u'o&m', u'on', u'pupil', u'reported', u'school', u'services', u'set-aside', u'shared', u'untracked']
In [136]:
print vec_alphanumeric.get_feature_names()
[u'aside', u'budget', u'budgets', u'business', u'central', u'enrichment', u'instruction', u'ispd', u'label', u'leadership', u'm', u'management', u'on', u'pupil', u'reported', u'school', u'services', u'shared', u'untracked']
In [ ]:
 

multilabel_sample

In [1]:
import numpy as np
import pandas as pd

def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])
In [ ]: