Manipulating DataFrames with pandas

  • Extracting, filtering, and transforming data from DataFrames
  • Advanced indexing with multiple levels
  • Tidying, rearranging and restructuring your data
  • Pivoting, melting, and stacking DataFrames
  • Identifying and spli!ing DataFrames by groups

Indexing and column rearrangement

In [1]:
# Import pandas
import pandas as pd

# Read in filename and set the index: election
df = pd.read_csv('tweets.csv', index_col='created_at', parse_dates=True)

# Create a separate dataframe with the columns ['entities', 'favorited', 'text']: results
results = df[['entities', 'favorited', 'text']]

# Print the output of results.head()
# print(results.head())
results.head(3)
Out[1]:
entities favorited text
created_at
2017-02-10 05:24:51 {u'user_mentions': [{u'indices': [3, 16], u'sc... False RT @MeckeringBoy: Does Pauline secretly wear h...
2017-02-10 05:24:57 {u'user_mentions': [{u'indices': [3, 15], u'sc... False RT @lordaedonis: if you a grown man afraid of ...
2017-02-10 05:25:38 {u'user_mentions': [{u'indices': [3, 19], u'sc... False RT @alaskantexanQCT: I'm not "White" but still...
In [2]:
df.index
Out[2]:
DatetimeIndex(['2017-02-10 05:24:51', '2017-02-10 05:24:57',
               '2017-02-10 05:25:38', '2017-02-10 05:25:42',
               '2017-02-10 05:25:48', '2017-02-10 05:25:53',
               '2017-02-10 05:25:58', '2017-02-10 05:26:12',
               '2017-02-10 05:26:27', '2017-02-10 05:26:30',
               ...
               '2017-02-10 04:27:03', '2017-02-10 04:27:18',
               '2017-02-10 04:27:31', '2017-02-10 04:27:34',
               '2017-02-10 04:27:36', '2017-02-10 04:27:39',
               '2017-02-10 04:27:41', '2017-02-10 04:27:54',
               '2017-02-10 04:28:25', '2017-02-10 04:28:36'],
              dtype='datetime64[ns]', name=u'created_at', length=615, freq=None)

Slicing and indexing a Series

  • .loc[] if NO "," used, will slice index
In [3]:
df.loc['02-10-17 05:24':'02-10-17 05:24']
Out[3]:
contributors coordinates display_text_range entities extended_entities extended_tweet favorite_count favorited filter_level geo ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
created_at
2017-02-10 05:24:51 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 16], u'sc... NaN NaN 0 False low NaN ... 8.299017e+17 8.299017e+17 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/#!/download/ipad" ... RT @MeckeringBoy: Does Pauline secretly wear h... 1486704291616 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:57 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False low NaN ... NaN NaN 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/download/iphone" r... RT @lordaedonis: if you a grown man afraid of ... 1486704297304 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:00 NaN NaN [0, 5] {u'user_mentions': [], u'symbols': [], u'hasht... NaN NaN 0 False low NaN ... 8.299058e+17 8.299058e+17 0 False NaN <a href="http://twitter.com/download/iphone" r... Pussy https://t.co/c2Az0vCuGq 1486704240067 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:00 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False low NaN ... 8.298700e+17 8.298700e+17 0 False {u'contributors': None, u'truncated': True, u'... <a href="http://twitter.com/download/android" ... RT @alaskantexanQCT: I'm not "White" but still... 1486704240511 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:51 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 16], u'sc... NaN NaN 0 False low NaN ... 8.299017e+17 8.299017e+17 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/#!/download/ipad" ... RT @MeckeringBoy: Does Pauline secretly wear h... 1486704291616 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:57 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False low NaN ... NaN NaN 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/download/iphone" r... RT @lordaedonis: if you a grown man afraid of ... 1486704297304 False {u'follow_request_sent': None, u'profile_use_b...

6 rows × 32 columns

use :num for filtering and ordering

In [4]:
df.loc['02-10-17 05:24':'02-10-17 05:24':-2]
Out[4]:
contributors coordinates display_text_range entities extended_entities extended_tweet favorite_count favorited filter_level geo ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
created_at
2017-02-10 05:24:57 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False low NaN ... NaN NaN 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/download/iphone" r... RT @lordaedonis: if you a grown man afraid of ... 1486704297304 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:00 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False low NaN ... 8.298700e+17 8.298700e+17 0 False {u'contributors': None, u'truncated': True, u'... <a href="http://twitter.com/download/android" ... RT @alaskantexanQCT: I'm not "White" but still... 1486704240511 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:57 NaN NaN NaN {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False low NaN ... NaN NaN 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/download/iphone" r... RT @lordaedonis: if you a grown man afraid of ... 1486704297304 False {u'follow_request_sent': None, u'profile_use_b...

3 rows × 32 columns

In [5]:
df.iloc[1:7:2,4:14:3]
Out[5]:
extended_entities favorited id in_reply_to_status_id
created_at
2017-02-10 05:24:57 NaN False 829924091456950272 NaN
2017-02-10 05:25:42 NaN False 829924279940558850 NaN
2017-02-10 05:25:53 NaN False 829924328879771650 8.298568e+17
In [6]:
[df.entities]==[df['entities']]
Out[6]:
True
In [7]:
df['text'][2:7]
Out[7]:
created_at
2017-02-10 05:25:38    RT @alaskantexanQCT: I'm not "White" but still...
2017-02-10 05:25:42    We dont discriminate like Libs, divede by sex,...
2017-02-10 05:25:48    @TMattis7 @babsmarshall1 @Randazzoj @realDonal...
2017-02-10 05:25:53    @Trumpliwood @heifer109 she's just mad because...
2017-02-10 05:25:58    RT @lordaedonis: if you a grown man afraid of ...
Name: text, dtype: object

Using .loc[]

In [8]:
df.ix[3:7,'entities':'favorited']
Out[8]:
entities extended_entities extended_tweet favorite_count favorited
created_at
2017-02-10 05:25:42 {u'user_mentions': [], u'symbols': [], u'hasht... NaN NaN 0 False
2017-02-10 05:25:48 {u'user_mentions': [{u'indices': [0, 9], u'scr... NaN NaN 0 False
2017-02-10 05:25:53 {u'user_mentions': [{u'indices': [0, 12], u'sc... NaN NaN 0 False
2017-02-10 05:25:58 {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False
In [9]:
df.loc[:,'entities':'favorited'].head()
Out[9]:
entities extended_entities extended_tweet favorite_count favorited
created_at
2017-02-10 05:24:51 {u'user_mentions': [{u'indices': [3, 16], u'sc... NaN NaN 0 False
2017-02-10 05:24:57 {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False
2017-02-10 05:25:38 {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False
2017-02-10 05:25:42 {u'user_mentions': [], u'symbols': [], u'hasht... NaN NaN 0 False
2017-02-10 05:25:48 {u'user_mentions': [{u'indices': [0, 9], u'scr... NaN NaN 0 False
In [10]:
df.loc['2-10-17 05:20','entities':'favorited']
Out[10]:
entities extended_entities extended_tweet favorite_count favorited
created_at
2017-02-10 05:20:01 {u'user_mentions': [], u'symbols': [], u'hasht... NaN NaN 0 False
2017-02-10 05:20:20 {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False
2017-02-10 05:20:31 {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False
2017-02-10 05:20:57 {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False
In [11]:
df.loc['2-10-17 05:20','entities':'favorited']
Out[11]:
entities extended_entities extended_tweet favorite_count favorited
created_at
2017-02-10 05:20:01 {u'user_mentions': [], u'symbols': [], u'hasht... NaN NaN 0 False
2017-02-10 05:20:20 {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False
2017-02-10 05:20:31 {u'user_mentions': [{u'indices': [3, 15], u'sc... NaN NaN 0 False
2017-02-10 05:20:57 {u'user_mentions': [{u'indices': [3, 19], u'sc... NaN NaN 0 False

enclude list

Subselecting DataFrames with lists

In [12]:
df.loc['2-10-17 05:24',['entities','text','favorited']]
Out[12]:
entities text favorited
created_at
2017-02-10 05:24:51 {u'user_mentions': [{u'indices': [3, 16], u'sc... RT @MeckeringBoy: Does Pauline secretly wear h... False
2017-02-10 05:24:57 {u'user_mentions': [{u'indices': [3, 15], u'sc... RT @lordaedonis: if you a grown man afraid of ... False
2017-02-10 05:24:00 {u'user_mentions': [], u'symbols': [], u'hasht... Pussy https://t.co/c2Az0vCuGq False
2017-02-10 05:24:00 {u'user_mentions': [{u'indices': [3, 19], u'sc... RT @alaskantexanQCT: I'm not "White" but still... False
2017-02-10 05:24:51 {u'user_mentions': [{u'indices': [3, 16], u'sc... RT @MeckeringBoy: Does Pauline secretly wear h... False
2017-02-10 05:24:57 {u'user_mentions': [{u'indices': [3, 15], u'sc... RT @lordaedonis: if you a grown man afraid of ... False

Using .iloc[]

In [13]:
df.iloc[7:11,11:14]
Out[13]:
id_str in_reply_to_screen_name in_reply_to_status_id
created_at
2017-02-10 05:26:12 829924408428945408 juliahosack 8.298853e+17
2017-02-10 05:26:27 829924471314141184 NaN NaN
2017-02-10 05:26:30 829924483867697154 warier163 8.299240e+17
2017-02-10 05:26:35 829924504683974657 NaN NaN
In [14]:
df.iloc[[3,74,233],[11,14,7]]
Out[14]:
id_str in_reply_to_status_id_str favorited
created_at
2017-02-10 05:25:42 829924279940558850 NaN False
2017-02-10 05:51:15 829930712815919104 NaN False
2017-02-10 05:12:36 829920984362684416 NaN False

comparison

series and df

In [15]:
df['text'].head()
Out[15]:
created_at
2017-02-10 05:24:51    RT @MeckeringBoy: Does Pauline secretly wear h...
2017-02-10 05:24:57    RT @lordaedonis: if you a grown man afraid of ...
2017-02-10 05:25:38    RT @alaskantexanQCT: I'm not "White" but still...
2017-02-10 05:25:42    We dont discriminate like Libs, divede by sex,...
2017-02-10 05:25:48    @TMattis7 @babsmarshall1 @Randazzoj @realDonal...
Name: text, dtype: object
In [16]:
df[['text']].head()
Out[16]:
text
created_at
2017-02-10 05:24:51 RT @MeckeringBoy: Does Pauline secretly wear h...
2017-02-10 05:24:57 RT @lordaedonis: if you a grown man afraid of ...
2017-02-10 05:25:38 RT @alaskantexanQCT: I'm not "White" but still...
2017-02-10 05:25:42 We dont discriminate like Libs, divede by sex,...
2017-02-10 05:25:48 @TMattis7 @babsmarshall1 @Randazzoj @realDonal...

Filtering DataFrames

Creating a Boolean Series

In [17]:
df['truncated']==1
Out[17]:
created_at
2017-02-10 05:24:51    False
2017-02-10 05:24:57    False
2017-02-10 05:25:38    False
2017-02-10 05:25:42    False
2017-02-10 05:25:48    False
2017-02-10 05:25:53    False
2017-02-10 05:25:58    False
2017-02-10 05:26:12    False
2017-02-10 05:26:27    False
2017-02-10 05:26:30    False
2017-02-10 05:26:35    False
2017-02-10 05:26:48    False
2017-02-10 05:27:56    False
2017-02-10 05:28:28    False
2017-02-10 05:28:28    False
2017-02-10 05:28:40    False
2017-02-10 05:28:55    False
2017-02-10 05:30:06    False
2017-02-10 05:30:18    False
2017-02-10 05:30:20    False
2017-02-10 05:30:53    False
2017-02-10 05:30:55     True
2017-02-10 05:31:41    False
2017-02-10 05:32:20    False
2017-02-10 05:32:23    False
2017-02-10 05:32:32    False
2017-02-10 05:34:11    False
2017-02-10 05:34:17    False
2017-02-10 05:36:07    False
2017-02-10 05:38:17     True
                       ...  
2017-02-10 04:23:54    False
2017-02-10 04:23:58    False
2017-02-10 04:24:06    False
2017-02-10 04:24:07    False
2017-02-10 04:24:17    False
2017-02-10 04:24:31    False
2017-02-10 04:24:32    False
2017-02-10 04:24:35    False
2017-02-10 04:24:48    False
2017-02-10 04:25:06    False
2017-02-10 04:25:09    False
2017-02-10 04:25:15    False
2017-02-10 04:25:23    False
2017-02-10 04:25:48    False
2017-02-10 04:25:50    False
2017-02-10 04:26:02    False
2017-02-10 04:26:13    False
2017-02-10 04:26:16     True
2017-02-10 04:26:23    False
2017-02-10 04:26:52    False
2017-02-10 04:27:03    False
2017-02-10 04:27:18    False
2017-02-10 04:27:31    False
2017-02-10 04:27:34    False
2017-02-10 04:27:36    False
2017-02-10 04:27:39    False
2017-02-10 04:27:41     True
2017-02-10 04:27:54    False
2017-02-10 04:28:25    False
2017-02-10 04:28:36    False
Name: truncated, dtype: bool
In [18]:
df[df['truncated']==1].head(3)
Out[18]:
contributors coordinates display_text_range entities extended_entities extended_tweet favorite_count favorited filter_level geo ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
created_at
2017-02-10 05:30:55 NaN NaN [0, 140] {u'user_mentions': [], u'symbols': [], u'hasht... NaN {u'display_text_range': [0, 130], u'entities':... 0 False low NaN ... 8.298905e+17 8.298905e+17 0 False NaN <a href="http://twitter.com/download/android" ... They're whole party is imploding. They're all ... 1486704655983 True {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:38:17 NaN NaN [0, 140] {u'user_mentions': [], u'symbols': [], u'hasht... NaN {u'display_text_range': [0, 139], u'entities':... 0 False low NaN ... 8.206567e+17 8.206567e+17 0 False NaN <a href="http://twitter.com" rel="nofollow">Tw... Just another spineless pussy, too afraid to st... 1486705097961 True {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:41:06 NaN NaN [0, 140] {u'user_mentions': [], u'symbols': [], u'hasht... NaN {u'display_text_range': [0, 136], u'entities':... 0 False low NaN ... 8.299271e+17 8.299271e+17 0 False NaN <a href="http://twitter.com" rel="nofollow">Tw... One swift move, and Trump proves he is both ig... 1486705266575 True {u'follow_request_sent': None, u'profile_use_b...

3 rows × 32 columns

Combining filters

&

|

In [35]:
df[(df['text'].str.len()>140) & (df['truncated']==1)].head(3)
Out[35]:
contributors coordinates display_text_range entities extended_entities extended_tweet favorite_count favorited filter_level geo ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
created_at
2017-02-10 05:30:55 NaN NaN [0, 140] NaN NaN {u'display_text_range': [0, 130], u'entities':... 0 False low NaN ... 8.298905e+17 8.298905e+17 0 False NaN <a href="http://twitter.com/download/android" ... They're whole party is imploding. They're all ... 1486704655983 True {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:38:17 NaN NaN [0, 140] NaN NaN {u'display_text_range': [0, 139], u'entities':... 0 False low NaN ... 8.206567e+17 8.206567e+17 0 False NaN <a href="http://twitter.com" rel="nofollow">Tw... Just another spineless pussy, too afraid to st... 1486705097961 True {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:41:06 NaN NaN [0, 140] NaN NaN {u'display_text_range': [0, 136], u'entities':... 0 False low NaN ... 8.299271e+17 8.299271e+17 0 False NaN <a href="http://twitter.com" rel="nofollow">Tw... One swift move, and Trump proves he is both ig... 1486705266575 True {u'follow_request_sent': None, u'profile_use_b...

3 rows × 32 columns

all()

any()

Select columns with all nonzeros

  • df.all()
In [20]:
df.all()
Out[20]:
contributors                  True
coordinates                   True
display_text_range            True
entities                      True
extended_entities             True
extended_tweet                True
favorite_count               False
favorited                    False
filter_level                  True
geo                           True
id                            True
id_str                        True
in_reply_to_screen_name       True
in_reply_to_status_id         True
in_reply_to_status_id_str     True
in_reply_to_user_id           True
in_reply_to_user_id_str       True
is_quote_status              False
lang                          True
place                         True
possibly_sensitive           False
quoted_status                 True
quoted_status_id              True
quoted_status_id_str          True
retweet_count                False
retweeted                    False
retweeted_status              True
source                        True
text                          True
timestamp_ms                  True
truncated                    False
user                          True
dtype: bool

Select columns with any nonzeros

In [21]:
df.any()
Out[21]:
contributors                 False
coordinates                  False
display_text_range            True
entities                      True
extended_entities             True
extended_tweet                True
favorite_count               False
favorited                    False
filter_level                  True
geo                          False
id                            True
id_str                        True
in_reply_to_screen_name       True
in_reply_to_status_id         True
in_reply_to_status_id_str     True
in_reply_to_user_id           True
in_reply_to_user_id_str       True
is_quote_status               True
lang                          True
place                         True
possibly_sensitive            True
quoted_status                 True
quoted_status_id              True
quoted_status_id_str          True
retweet_count                False
retweeted                    False
retweeted_status              True
source                        True
text                          True
timestamp_ms                  True
truncated                     True
user                          True
dtype: bool

Select columns with any NaNs

  • df.loc[:, df.isnull().any()]
In [22]:
df.loc[:, df.isnull().all()]
Out[22]:
contributors coordinates geo
created_at
2017-02-10 05:24:51 NaN NaN NaN
2017-02-10 05:24:57 NaN NaN NaN
2017-02-10 05:25:38 NaN NaN NaN
2017-02-10 05:25:42 NaN NaN NaN
2017-02-10 05:25:48 NaN NaN NaN
2017-02-10 05:25:53 NaN NaN NaN
2017-02-10 05:25:58 NaN NaN NaN
2017-02-10 05:26:12 NaN NaN NaN
2017-02-10 05:26:27 NaN NaN NaN
2017-02-10 05:26:30 NaN NaN NaN
2017-02-10 05:26:35 NaN NaN NaN
2017-02-10 05:26:48 NaN NaN NaN
2017-02-10 05:27:56 NaN NaN NaN
2017-02-10 05:28:28 NaN NaN NaN
2017-02-10 05:28:28 NaN NaN NaN
2017-02-10 05:28:40 NaN NaN NaN
2017-02-10 05:28:55 NaN NaN NaN
2017-02-10 05:30:06 NaN NaN NaN
2017-02-10 05:30:18 NaN NaN NaN
2017-02-10 05:30:20 NaN NaN NaN
2017-02-10 05:30:53 NaN NaN NaN
2017-02-10 05:30:55 NaN NaN NaN
2017-02-10 05:31:41 NaN NaN NaN
2017-02-10 05:32:20 NaN NaN NaN
2017-02-10 05:32:23 NaN NaN NaN
2017-02-10 05:32:32 NaN NaN NaN
2017-02-10 05:34:11 NaN NaN NaN
2017-02-10 05:34:17 NaN NaN NaN
2017-02-10 05:36:07 NaN NaN NaN
2017-02-10 05:38:17 NaN NaN NaN
... ... ... ...
2017-02-10 04:23:54 NaN NaN NaN
2017-02-10 04:23:58 NaN NaN NaN
2017-02-10 04:24:06 NaN NaN NaN
2017-02-10 04:24:07 NaN NaN NaN
2017-02-10 04:24:17 NaN NaN NaN
2017-02-10 04:24:31 NaN NaN NaN
2017-02-10 04:24:32 NaN NaN NaN
2017-02-10 04:24:35 NaN NaN NaN
2017-02-10 04:24:48 NaN NaN NaN
2017-02-10 04:25:06 NaN NaN NaN
2017-02-10 04:25:09 NaN NaN NaN
2017-02-10 04:25:15 NaN NaN NaN
2017-02-10 04:25:23 NaN NaN NaN
2017-02-10 04:25:48 NaN NaN NaN
2017-02-10 04:25:50 NaN NaN NaN
2017-02-10 04:26:02 NaN NaN NaN
2017-02-10 04:26:13 NaN NaN NaN
2017-02-10 04:26:16 NaN NaN NaN
2017-02-10 04:26:23 NaN NaN NaN
2017-02-10 04:26:52 NaN NaN NaN
2017-02-10 04:27:03 NaN NaN NaN
2017-02-10 04:27:18 NaN NaN NaN
2017-02-10 04:27:31 NaN NaN NaN
2017-02-10 04:27:34 NaN NaN NaN
2017-02-10 04:27:36 NaN NaN NaN
2017-02-10 04:27:39 NaN NaN NaN
2017-02-10 04:27:41 NaN NaN NaN
2017-02-10 04:27:54 NaN NaN NaN
2017-02-10 04:28:25 NaN NaN NaN
2017-02-10 04:28:36 NaN NaN NaN

615 rows × 3 columns

In [51]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 615 entries, 2017-02-10 05:24:51 to 2017-02-10 04:28:36
Data columns (total 32 columns):
contributors                 0 non-null float64
coordinates                  0 non-null float64
display_text_range           161 non-null object
entities                     0 non-null float64
extended_entities            4 non-null object
extended_tweet               32 non-null object
favorite_count               615 non-null int64
favorited                    615 non-null bool
filter_level                 615 non-null object
geo                          0 non-null float64
id                           615 non-null int64
id_str                       615 non-null int64
in_reply_to_screen_name      83 non-null object
in_reply_to_status_id        67 non-null float64
in_reply_to_status_id_str    67 non-null float64
in_reply_to_user_id          83 non-null float64
in_reply_to_user_id_str      83 non-null float64
is_quote_status              615 non-null bool
lang                         615 non-null object
place                        12 non-null object
possibly_sensitive           135 non-null object
quoted_status                281 non-null object
quoted_status_id             281 non-null float64
quoted_status_id_str         281 non-null float64
retweet_count                615 non-null int64
retweeted                    615 non-null bool
retweeted_status             405 non-null object
source                       615 non-null object
text                         615 non-null object
timestamp_ms                 615 non-null int64
truncated                    615 non-null bool
user                         615 non-null object
dtypes: bool(4), float64(10), int64(5), object(13)
memory usage: 141.7+ KB

drop ROWS with NAN values

In [23]:
df.shape
Out[23]:
(615, 32)
In [24]:
df.dropna(how='all').shape
Out[24]:
(615, 32)
In [25]:
df.dropna(how='any').shape
Out[25]:
(0, 32)

use the thresh= 20 keyword argument to drop rows from full dataset that have more than 20 missing values.

In [38]:
df.dropna(thresh=20).shape
Out[38]:
(102, 32)

drop COLUMNS with NAN values

on y-axis

In [26]:
df.dropna(how='all',axis=1).shape
Out[26]:
(615, 29)
In [40]:
df.dropna(how='any',axis=1).shape
Out[40]:
(615, 14)

use the thresh= 100 keyword argument to drop columns from full dataset that have more than 100 missing values.

In [50]:
df.dropna(how='any',axis='columns',thresh=100).shape
Out[50]:
(615, 20)

assign NAN values

In [28]:
import numpy as np

df.entities=np.nan
In [29]:
df.head(3)
Out[29]:
contributors coordinates display_text_range entities extended_entities extended_tweet favorite_count favorited filter_level geo ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
created_at
2017-02-10 05:24:51 NaN NaN NaN NaN NaN NaN 0 False low NaN ... 8.299017e+17 8.299017e+17 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/#!/download/ipad" ... RT @MeckeringBoy: Does Pauline secretly wear h... 1486704291616 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:24:57 NaN NaN NaN NaN NaN NaN 0 False low NaN ... NaN NaN 0 False {u'contributors': None, u'truncated': False, u... <a href="http://twitter.com/download/iphone" r... RT @lordaedonis: if you a grown man afraid of ... 1486704297304 False {u'follow_request_sent': None, u'profile_use_b...
2017-02-10 05:25:38 NaN NaN NaN NaN NaN NaN 0 False low NaN ... 8.298700e+17 8.298700e+17 0 False {u'contributors': None, u'truncated': True, u'... <a href="http://twitter.com/download/iphone" r... RT @alaskantexanQCT: I'm not "White" but still... 1486704338285 False {u'follow_request_sent': None, u'profile_use_b...

3 rows × 32 columns

In [30]:
df.loc[df['text'].str.len()>144,'text']
Out[30]:
created_at
2017-02-10 05:32:20    RT @kimthecynic: Grab 'em by the pussy-bandet ...
2017-02-10 06:02:52    RT @alii_babygirl: Horny😘 #sex #horny #snapch...
2017-02-10 06:04:56    RT @alii_babygirl: This is perfect😍 #sex #hor...
2017-02-10 05:32:20    RT @kimthecynic: Grab 'em by the pussy-bandet ...
2017-02-10 06:02:52    RT @alii_babygirl: Horny😘 #sex #horny #snapch...
2017-02-10 06:04:56    RT @alii_babygirl: This is perfect😍 #sex #hor...
2017-02-10 03:29:45    #Hannity,if that was Hillary R Obama,yall wuld...
2017-02-10 03:34:52    Gotta support u on that my friend!  Should be ...
2017-02-10 04:08:06    omg its a giant commy liberal entitled pussy w...
2017-02-10 04:21:52    RT @Alanna_T7: If I add any more to this it'll...
Name: text, dtype: object
In [31]:
(df['text'].str.len()>144).sum()
Out[31]:
10

Transforming DataFrames

  • ### broadcasting
  • ### use build-in function
  • ### use apply() on DataFrame
In [54]:
test = df[['timestamp_ms']].head()
In [55]:
test
Out[55]:
timestamp_ms
created_at
2017-02-10 05:24:51 1486704291616
2017-02-10 05:24:57 1486704297304
2017-02-10 05:25:38 1486704338285
2017-02-10 05:25:42 1486704342242
2017-02-10 05:25:48 1486704348065
In [56]:
test.timestamp_ms.floordiv(7**11)
Out[56]:
created_at
2017-02-10 05:24:51    751
2017-02-10 05:24:57    751
2017-02-10 05:25:38    751
2017-02-10 05:25:42    751
2017-02-10 05:25:48    751
Name: timestamp_ms, dtype: int64
In [58]:
test['timestamp_ms'].apply(lambda x: x//7**11)
Out[58]:
created_at
2017-02-10 05:24:51    751
2017-02-10 05:24:57    751
2017-02-10 05:25:38    751
2017-02-10 05:25:42    751
2017-02-10 05:25:48    751
Name: timestamp_ms, dtype: int64
In [59]:
test.timestamp_ms.floordiv(7**11) == test['timestamp_ms'].apply(lambda x: x//7**11)
Out[59]:
created_at
2017-02-10 05:24:51    True
2017-02-10 05:24:57    True
2017-02-10 05:25:38    True
2017-02-10 05:25:42    True
2017-02-10 05:25:48    True
Name: timestamp_ms, dtype: bool

use string method on DataFrame

In [64]:
text = df[['text']].head(10)
text
Out[64]:
text
created_at
2017-02-10 05:24:51 RT @MeckeringBoy: Does Pauline secretly wear h...
2017-02-10 05:24:57 RT @lordaedonis: if you a grown man afraid of ...
2017-02-10 05:25:38 RT @alaskantexanQCT: I'm not "White" but still...
2017-02-10 05:25:42 We dont discriminate like Libs, divede by sex,...
2017-02-10 05:25:48 @TMattis7 @babsmarshall1 @Randazzoj @realDonal...
2017-02-10 05:25:53 @Trumpliwood @heifer109 she's just mad because...
2017-02-10 05:25:58 RT @lordaedonis: if you a grown man afraid of ...
2017-02-10 05:26:12 @juliahosack @Marina_Sirtis your the problem ...
2017-02-10 05:26:27 RT @alaskantexanQCT: I'm not "White" but still...
2017-02-10 05:26:30 @warier163 @StefanMolyneux From who? Obviously...
In [69]:
text['text'] = text['text'].str.upper()
text
Out[69]:
text
created_at
2017-02-10 05:24:51 RT @MECKERINGBOY: DOES PAULINE SECRETLY WEAR H...
2017-02-10 05:24:57 RT @LORDAEDONIS: IF YOU A GROWN MAN AFRAID OF ...
2017-02-10 05:25:38 RT @ALASKANTEXANQCT: I'M NOT "WHITE" BUT STILL...
2017-02-10 05:25:42 WE DONT DISCRIMINATE LIKE LIBS, DIVEDE BY SEX,...
2017-02-10 05:25:48 @TMATTIS7 @BABSMARSHALL1 @RANDAZZOJ @REALDONAL...
2017-02-10 05:25:53 @TRUMPLIWOOD @HEIFER109 SHE'S JUST MAD BECAUSE...
2017-02-10 05:25:58 RT @LORDAEDONIS: IF YOU A GROWN MAN AFRAID OF ...
2017-02-10 05:26:12 @JULIAHOSACK @MARINA_SIRTIS YOUR THE PROBLEM ...
2017-02-10 05:26:27 RT @ALASKANTEXANQCT: I'M NOT "WHITE" BUT STILL...
2017-02-10 05:26:30 @WARIER163 @STEFANMOLYNEUX FROM WHO? OBVIOUSLY...

for index

* No apply method

* Use Map() method

  • example:
  • use map() to transform index's type Datetime to String
In [86]:
text.index
Out[86]:
DatetimeIndex(['2017-02-10 05:24:51', '2017-02-10 05:24:57',
               '2017-02-10 05:25:38', '2017-02-10 05:25:42',
               '2017-02-10 05:25:48', '2017-02-10 05:25:53',
               '2017-02-10 05:25:58', '2017-02-10 05:26:12',
               '2017-02-10 05:26:27', '2017-02-10 05:26:30'],
              dtype='datetime64[ns]', name=u'created_at', freq=None)
In [85]:
text.index.map(str)
Out[85]:
array(['2017-02-10 05:24:51', '2017-02-10 05:24:57', '2017-02-10 05:25:38',
       '2017-02-10 05:25:42', '2017-02-10 05:25:48', '2017-02-10 05:25:53',
       '2017-02-10 05:25:58', '2017-02-10 05:26:12', '2017-02-10 05:26:27',
       '2017-02-10 05:26:30'], dtype=object)
In [81]:
text.index.weekday_name
Out[81]:
array(['Friday', 'Friday', 'Friday', 'Friday', 'Friday', 'Friday',
       'Friday', 'Friday', 'Friday', 'Friday'], dtype=object)

apply() on DataFrame with multiple columns

In [91]:
df[['timestamp_ms','quoted_status_id']].apply(lambda x: x%7).head()
Out[91]:
timestamp_ms quoted_status_id
created_at
2017-02-10 05:24:51 5 4.0
2017-02-10 05:24:57 2 NaN
2017-02-10 05:25:38 5 2.0
2017-02-10 05:25:42 0 4.0
2017-02-10 05:25:48 6 NaN

Using .map() with a dictionary

The .map() method is used to transform values according to a Python dictionary look-up.

In [93]:
pre = {True: 'truth', False: 'lie'}
In [99]:
df['truncated'].map(pre).head(30)
Out[99]:
created_at
2017-02-10 05:24:51      lie
2017-02-10 05:24:57      lie
2017-02-10 05:25:38      lie
2017-02-10 05:25:42      lie
2017-02-10 05:25:48      lie
2017-02-10 05:25:53      lie
2017-02-10 05:25:58      lie
2017-02-10 05:26:12      lie
2017-02-10 05:26:27      lie
2017-02-10 05:26:30      lie
2017-02-10 05:26:35      lie
2017-02-10 05:26:48      lie
2017-02-10 05:27:56      lie
2017-02-10 05:28:28      lie
2017-02-10 05:28:28      lie
2017-02-10 05:28:40      lie
2017-02-10 05:28:55      lie
2017-02-10 05:30:06      lie
2017-02-10 05:30:18      lie
2017-02-10 05:30:20      lie
2017-02-10 05:30:53      lie
2017-02-10 05:30:55    truth
2017-02-10 05:31:41      lie
2017-02-10 05:32:20      lie
2017-02-10 05:32:23      lie
2017-02-10 05:32:32      lie
2017-02-10 05:34:11      lie
2017-02-10 05:34:17      lie
2017-02-10 05:36:07      lie
2017-02-10 05:38:17    truth
Name: truncated, dtype: object

Using vectorized functions

When performance is paramount, you should avoid using .apply() and .map()

  • ### because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame.
  • ### By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)!
  • ### NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).
    • You can even write your own vectorized functions
  • In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

    • nstead of using .apply() as you did in the earlier exercises, the zscore UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame
In [2]:
import pandas as pd

f = pd.read_csv('census.csv',chunksize=10000)
df2=f.next()
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3193 entries, 0 to 3192
Data columns (total 100 columns):
SUMLEV                   3193 non-null int64
REGION                   3193 non-null int64
DIVISION                 3193 non-null int64
STATE                    3193 non-null int64
COUNTY                   3193 non-null int64
STNAME                   3193 non-null object
CTYNAME                  3193 non-null object
CENSUS2010POP            3193 non-null int64
ESTIMATESBASE2010        3193 non-null int64
POPESTIMATE2010          3193 non-null int64
POPESTIMATE2011          3193 non-null int64
POPESTIMATE2012          3193 non-null int64
POPESTIMATE2013          3193 non-null int64
POPESTIMATE2014          3193 non-null int64
POPESTIMATE2015          3193 non-null int64
NPOPCHG_2010             3193 non-null int64
NPOPCHG_2011             3193 non-null int64
NPOPCHG_2012             3193 non-null int64
NPOPCHG_2013             3193 non-null int64
NPOPCHG_2014             3193 non-null int64
NPOPCHG_2015             3193 non-null int64
BIRTHS2010               3193 non-null int64
BIRTHS2011               3193 non-null int64
BIRTHS2012               3193 non-null int64
BIRTHS2013               3193 non-null int64
BIRTHS2014               3193 non-null int64
BIRTHS2015               3193 non-null int64
DEATHS2010               3193 non-null int64
DEATHS2011               3193 non-null int64
DEATHS2012               3193 non-null int64
DEATHS2013               3193 non-null int64
DEATHS2014               3193 non-null int64
DEATHS2015               3193 non-null int64
NATURALINC2010           3193 non-null int64
NATURALINC2011           3193 non-null int64
NATURALINC2012           3193 non-null int64
NATURALINC2013           3193 non-null int64
NATURALINC2014           3193 non-null int64
NATURALINC2015           3193 non-null int64
INTERNATIONALMIG2010     3193 non-null int64
INTERNATIONALMIG2011     3193 non-null int64
INTERNATIONALMIG2012     3193 non-null int64
INTERNATIONALMIG2013     3193 non-null int64
INTERNATIONALMIG2014     3193 non-null int64
INTERNATIONALMIG2015     3193 non-null int64
DOMESTICMIG2010          3193 non-null int64
DOMESTICMIG2011          3193 non-null int64
DOMESTICMIG2012          3193 non-null int64
DOMESTICMIG2013          3193 non-null int64
DOMESTICMIG2014          3193 non-null int64
DOMESTICMIG2015          3193 non-null int64
NETMIG2010               3193 non-null int64
NETMIG2011               3193 non-null int64
NETMIG2012               3193 non-null int64
NETMIG2013               3193 non-null int64
NETMIG2014               3193 non-null int64
NETMIG2015               3193 non-null int64
RESIDUAL2010             3193 non-null int64
RESIDUAL2011             3193 non-null int64
RESIDUAL2012             3193 non-null int64
RESIDUAL2013             3193 non-null int64
RESIDUAL2014             3193 non-null int64
RESIDUAL2015             3193 non-null int64
GQESTIMATESBASE2010      3193 non-null int64
GQESTIMATES2010          3193 non-null int64
GQESTIMATES2011          3193 non-null int64
GQESTIMATES2012          3193 non-null int64
GQESTIMATES2013          3193 non-null int64
GQESTIMATES2014          3193 non-null int64
GQESTIMATES2015          3193 non-null int64
RBIRTH2011               3193 non-null float64
RBIRTH2012               3193 non-null float64
RBIRTH2013               3193 non-null float64
RBIRTH2014               3193 non-null float64
RBIRTH2015               3193 non-null float64
RDEATH2011               3193 non-null float64
RDEATH2012               3193 non-null float64
RDEATH2013               3193 non-null float64
RDEATH2014               3193 non-null float64
RDEATH2015               3193 non-null float64
RNATURALINC2011          3193 non-null float64
RNATURALINC2012          3193 non-null float64
RNATURALINC2013          3193 non-null float64
RNATURALINC2014          3193 non-null float64
RNATURALINC2015          3193 non-null float64
RINTERNATIONALMIG2011    3193 non-null float64
RINTERNATIONALMIG2012    3193 non-null float64
RINTERNATIONALMIG2013    3193 non-null float64
RINTERNATIONALMIG2014    3193 non-null float64
RINTERNATIONALMIG2015    3193 non-null float64
RDOMESTICMIG2011         3193 non-null float64
RDOMESTICMIG2012         3193 non-null float64
RDOMESTICMIG2013         3193 non-null float64
RDOMESTICMIG2014         3193 non-null float64
RDOMESTICMIG2015         3193 non-null float64
RNETMIG2011              3193 non-null float64
RNETMIG2012              3193 non-null float64
RNETMIG2013              3193 non-null float64
RNETMIG2014              3193 non-null float64
RNETMIG2015              3193 non-null float64
dtypes: float64(30), int64(68), object(2)
memory usage: 2.4+ MB
In [3]:
df3 =df2[['CENSUS2010POP','POPESTIMATE2015']]
df3.head()
Out[3]:
CENSUS2010POP POPESTIMATE2015
0 4779736 4858979
1 54571 55347
2 182265 203709
3 27457 26489
4 22915 22583
In [27]:
# Import zscore from scipy.stats
from scipy.stats import zscore
import numpy as np

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(df3['CENSUS2010POP'])

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
df3['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
print(df3.head())
<type 'numpy.ndarray'>
   CENSUS2010POP  POPESTIMATE2015  turnout_zscore
0        4779736          4858979        3.899899
1          54571            55347       -0.118039
2         182265           203709       -0.009457
3          27457            26489       -0.141095
4          22915            22583       -0.144957
C:\Program Files\Anaconda2\lib\site-packages\ipykernel\__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [8]:
df3.head()
Out[8]:
CENSUS2010POP POPESTIMATE2015 turnout_zscore
0 4779736 4858979 3.899899
1 54571 55347 -0.118039
2 182265 203709 -0.009457
3 27457 26489 -0.141095
4 22915 22583 -0.144957
In [25]:
%%timeit
turnout_zscore = zscore(df3['CENSUS2010POP'])
10000 loops, best of 3: 49.4 µs per loop
In [33]:
%%timeit
a = df3['CENSUS2010POP'].std()
10000 loops, best of 3: 24.9 µs per loop
In [34]:
%%timeit
b = np.std(df3['CENSUS2010POP'])
10000 loops, best of 3: 26.7 µs per loop

Sorting

In [39]:
df3.sort_values('turnout_zscore',ascending=False)
Out[39]:
CENSUS2010POP POPESTIMATE2015 turnout_zscore
190 37253956 39144818 31.513627
2566 25145561 27469114 21.217524
1860 19378102 19795791 16.313294
329 18801310 20271272 15.822832
608 12830632 12859995 10.745799
2282 12702379 12802503 10.636742
2078 11536504 11613423 9.645366
1253 9883640 9922576 8.239890
209 9818605 10170292 8.184589
397 9687653 10214860 8.073237
1923 9535483 10042802 7.943843
1804 8791894 8958013 7.311548
2866 8001024 8382993 6.639050
3000 6724540 7170351 5.553620
1238 6547629 6794422 5.403188
711 6483802 6619680 5.348914
98 6392017 6828065 5.270867
2470 6346105 6600299 5.231826
1508 5988927 6083672 4.928108
1213 5773552 6006401 4.744969
3096 5686986 5771337 4.671359
1337 5303925 5489594 4.345632
624 5194675 5238216 4.252733
249 5029196 5456574 4.112022
0 4779736 4858979 3.899899
2356 4625364 4896146 3.768633
1131 4533372 4670724 3.690409
1010 4339367 4425092 3.525441
2667 4092459 4538028 3.315489
2245 3831074 4028977 3.093226
... ... ... ...
2763 929 916 -0.163652
1643 884 827 -0.163691
277 843 774 -0.163726
1733 824 804 -0.163742
1773 818 750 -0.163747
2698 808 764 -0.163755
1780 783 829 -0.163777
2028 783 936 -0.163777
1738 763 777 -0.163794
1727 736 732 -0.163817
2068 727 767 -0.163824
1676 718 697 -0.163832
290 712 726 -0.163837
2722 707 820 -0.163841
306 699 701 -0.163848
1838 695 698 -0.163851
1685 690 788 -0.163856
96 662 613 -0.163879
1767 647 684 -0.163892
2583 641 648 -0.163897
1739 632 585 -0.163905
1719 614 641 -0.163920
1740 539 475 -0.163984
1659 494 475 -0.164022
1686 478 487 -0.164036
1684 460 456 -0.164051
2697 416 407 -0.164089
2701 286 282 -0.164199
560 90 89 -0.164366
2717 82 112 -0.164373

3193 rows × 3 columns

In [ ]: