Iterators, load file in chunks¶

Iterators vs Iterables¶

an iterable is an object that can return an iterator¶

Examples: lists, strings, dictionaries, file connections

An object with an associated iter() method

Applying iter() to an iterable creates an iterator

an iterator is an object that keeps state and produces the next value when you call next() on it.¶

Produces next value with next()

a=[1,2,3,4]
b=iter([1,2,3,4])
c=iter([5,6,7,8])

print a
print b
print next(b),next(b),next(b),next(b)
print list(c)

[1, 2, 3, 4]
<listiterator object at 0x00000000044B5A90>
1 2 3 4
[5, 6, 7, 8]

Iterating over iterables¶

Python 2 does NOT work
range() doesn't actually create the list; instead, it creates a range object with an iterator that produces the values until it reaches the limit
- If range() created the actual list, calling it with a value of 10^100 may not work, especially since a number as big as that may go over a regular computer's memory. The value 10^100 is actually what's called a Googol which is a 1 followed by a hundred 0s. That's a huge number!
  - calling range() with 10^100 won't actually pre-create the list.

# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-15-57ef632b6db1> in <module>()
      1 # Create an iterator for range(10 ** 100): googol
----> 2 googol = iter(range(10 ** 100))
      3 

OverflowError: range() result has too many items

Iterating over dictionaries¶

a={1:9, 'what':'why?'}

for key,value in a.items(): print key,value

1 9
what why?

Iterating over file connections¶

f = open('university_towns.txt')
type(f)

file

iter(f)

<open file 'university_towns.txt', mode 'r' at 0x00000000041F9F60>

iter(f)==f

True

next(f)

'Florence (University of North Alabama)\n'

next(iter(f))

'Jacksonville (Jacksonville State University)[2]\n'

Using enumerate¶

enumerate() returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.¶

# Create a list of strings: mutants
mutants = ['charles xavier',  'bobby drake', 'kurt wagner',  'max eisenhardt',  'kitty pride']
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))
# Print the list of tuples
print(mutant_list)
print 
# Unpack and print the tuple pairs
for index1, value1 in enumerate(mutants):
    print(index1, value1)

print "\nChange the start index\n"
for index2, value2 in enumerate(mutants, start=3):
    print(index2, value2)

[(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pride')]

(0, 'charles xavier')
(1, 'bobby drake')
(2, 'kurt wagner')
(3, 'max eisenhardt')
(4, 'kitty pride')

Change the start index

(3, 'charles xavier')
(4, 'bobby drake')
(5, 'kurt wagner')
(6, 'max eisenhardt')
(7, 'kitty pride')

Using zip¶

zip(), which takes any number of iterables and returns a zip object that is an iterator of tuples.¶

If you wanted to print the values of a zip object, you can convert it into a list and then print it.
Printing just a zip object will not return the values unless you unpack it first.

In Python 2 , zip() returns a list¶

Docstring: zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element from each of the argument sequences. The returned list is truncated in length to the length of the shortest argument sequence.

aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy','thermokinesis','teleportation','magnetokinesis','intangibility']

# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))

# Print the list of tuples
print(mutant_data)

print 
# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(type(mutant_zip))

# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip:
    print(value1, value2, value3)

[('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pride', 'shadowcat', 'intangibility')]

<type 'list'>
('charles xavier', 'prof x', 'telepathy')
('bobby drake', 'iceman', 'thermokinesis')
('kurt wagner', 'nightcrawler', 'teleportation')
('max eisenhardt', 'magneto', 'magnetokinesis')
('kitty pride', 'shadowcat', 'intangibility')

Loading data in chunks¶

There can be too much data to hold in memory
Solution: load data in chunks!

Pandas function: read_csv()
- Specify the chunk: chunksize

import pandas as pd
from time import time

start = time()

df = pd.read_csv('kamcord_data.csv')

print 'used {:.2f} s'.format(time()-start)

print df.shape
df.head(1)

used 0.40 s
(357404, 6)

explore¶

a=pd.read_csv('kamcord_data.csv',chunksize=4)
b=pd.read_csv('kamcord_data.csv',iterator=True)

a.next()

x=a.next()
y=a.next()

y.append(x, ignore_index=True)

pd.concat([x,y], ignore_index=True)

1st way of loading data in chunks¶

start = time()

c=0
for chuck in pd.read_csv('kamcord_data.csv',chunksize=50000):
    if c==0:
        df=chuck
        c+=1
    else:
        df=df.append(chuck, ignore_index=True)
        c+=1
print c

print 'used {:.2f} s'.format(time()-start)

print df.shape
df.head(1)

8
used 0.48 s
(357404, 6)

2ed way of loading data in chunks¶

start = time()

want=[]

for chuck in pd.read_csv('kamcord_data.csv',chunksize=50000):
    want.append(chuck)

print len(want)

df=pd.concat(want, ignore_index=True)

print 'used {:.2f} s'.format(time()-start)

print df.shape
df.head(1)

8
used 0.43 s
(357404, 6)

3rd way of loading data in chunks¶

start = time()

want=[]

f = pd.read_csv('kamcord_data.csv',iterator = True)

go = True
while go:
    try:
        want.append(f.get_chunk(50000))
    except Exception as e:
        print type(e)
        go = False
    
print len(want)

df=pd.concat(want, ignore_index=True)

print 'used {:.2f} s'.format(time()-start)

print df.shape
df.head(1)

<type 'exceptions.StopIteration'>
8
used 0.43 s
(357404, 6)

Processing large amounts of Twitter data by chunks¶

import pandas as pd

# Import package
import json

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
h=open('tweets.txt','r')

# Read in tweets and store in list: tweets_data
for i in h:
    try:
        print 'O',
        tmp=json.loads(i)
        tweets_data.append(tmp)
    except:
        print 'X',
h.close()


t_df = pd.DataFrame(tweets_data)
print 
print t_df.shape

t_df.to_csv('tweets.csv',index=False, encoding= 'utf-8')
t_df.head(1)

O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O X O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
(615, 33)

Processing large amounts of data by chunks¶

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):

    # Iterate over the column in dataframe
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

{'fr': 1, 'en': 597, 'und': 14, 'sv': 2, 'es': 1}

Extracting information for large amounts of Twitter data¶

reusable
def func

# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in dataframe
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')

# Print result_counts
print(result_counts)

{'fr': 1, 'en': 597, 'und': 14, 'sv': 2, 'es': 1}

	Unnamed: 0	user_id	event_name	event_time	os_name	app_version
0	12	d078c3a4-9a80-4b12-9ca7-95873799f4be	APP_CLOSED	2016-09-18 14:11:29	ios	6.4.1
1	13	a1ac31cb-6d06-401a-a33f-66f91abf1550	APP_CLOSED	2016-09-27 16:22:06	ios	6.4.1
2	14	48a70e65-205e-4ab9-9232-3bafa6fb9496	APP_CLOSED	2016-09-19 14:45:08	android	2.5.1
3	15	e8330f1a-eac6-4add-89a1-f3545b8189e7	SHOT_RECORDED	2016-09-18 12:52:17	android	2.5.1

	Unnamed: 0	user_id	event_name	event_time	os_name	app_version
0	20	d04b2d7a-d847-4790-b8ec-a975e7ba56a4	APP_CLOSED	2016-09-19 04:23:31	android	2.5.1
1	21	8dc251b8-03b6-4671-8780-389cd3bc3004	APP_CLOSED	2016-09-11 14:01:04	ios	6.4.1
2	22	e97f8a1a-bdcd-4d38-ac73-63d2b0105395	APP_CLOSED	2016-09-16 19:08:45	android	2.5.1
3	23	beb48c53-d807-4e1a-b1b6-cc20eebf679c	SHOT_RECORDED	2016-09-11 06:30:35	android	2.5.1
4	16	e1f9a1cd-605d-4b94-9dfb-a011f9ec2e0d	APP_OPEN	2016-09-25 21:17:22	ios	6.4.1
5	17	95b9becf-fa38-4c4a-b265-8bf2594b911a	APP_OPEN	2016-09-24 16:58:35	android	2.6
6	18	19836371-f0f0-4db0-b027-a3fa2d0dbf35	SHOT_RECORDED	2016-09-23 12:15:03	ios	6.4.1
7	19	c39eeee3-6605-4970-95b8-0ddb21c81589	SHOT_RECORDED	2016-09-24 04:26:03	android	2.6

	Unnamed: 0	user_id	event_name	event_time	os_name	app_version
0	16	e1f9a1cd-605d-4b94-9dfb-a011f9ec2e0d	APP_OPEN	2016-09-25 21:17:22	ios	6.4.1
1	17	95b9becf-fa38-4c4a-b265-8bf2594b911a	APP_OPEN	2016-09-24 16:58:35	android	2.6
2	18	19836371-f0f0-4db0-b027-a3fa2d0dbf35	SHOT_RECORDED	2016-09-23 12:15:03	ios	6.4.1
3	19	c39eeee3-6605-4970-95b8-0ddb21c81589	SHOT_RECORDED	2016-09-24 04:26:03	android	2.6
4	20	d04b2d7a-d847-4790-b8ec-a975e7ba56a4	APP_CLOSED	2016-09-19 04:23:31	android	2.5.1
5	21	8dc251b8-03b6-4671-8780-389cd3bc3004	APP_CLOSED	2016-09-11 14:01:04	ios	6.4.1
6	22	e97f8a1a-bdcd-4d38-ac73-63d2b0105395	APP_CLOSED	2016-09-16 19:08:45	android	2.5.1
7	23	beb48c53-d807-4e1a-b1b6-cc20eebf679c	SHOT_RECORDED	2016-09-11 06:30:35	android	2.5.1

Data Science Notebook

Python iterators, loading data in chunks with Pandas

Iterators, load file in chunks¶

Iterators vs Iterables¶

an iterable is an object that can return an iterator¶

an iterator is an object that keeps state and produces the next value when you call next() on it.¶

Iterating over iterables¶

Iterating over dictionaries¶

Iterating over file connections¶

Using enumerate¶

enumerate() returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.¶

Using zip¶

zip(), which takes any number of iterables and returns a zip object that is an iterator of tuples.¶

In Python 2 , zip() returns a list¶

Loading data in chunks¶

explore¶

1st way of loading data in chunks¶

2ed way of loading data in chunks¶

3rd way of loading data in chunks¶

Processing large amounts of Twitter data by chunks¶

Processing large amounts of data by chunks¶

Extracting information for large amounts of Twitter data¶