loading data case using generators and chunks example not using Pandas

loading data case using generators and chunks

example not using Pandas

This is a study note summary of some courses from DataCamp 🙂

dataset: World Development Indicators

World bank data

  • Data on world economies for over half a century
    • Indicators
      • Population
      • Electricity consumption
      • CO2 emissions
      • Literacy rates
      • Unemployment
In [2]:
import pandas as pd
!dir
 Volume in drive D is VM
 Volume Serial Number is 023B-9433

 Directory of D:\Dropbox\datacamp\toolbox 2

02/11/2017  04:05 PM    <DIR>          .
02/11/2017  04:05 PM    <DIR>          ..
02/11/2017  03:58 PM    <DIR>          .ipynb_checkpoints
02/11/2017  04:05 PM             1,098 bank case.ipynb
02/11/2017  02:34 AM           303,160 ch1_slides.pdf
02/11/2017  03:28 PM           421,057 ch2_slides.pdf
02/11/2017  04:03 PM           161,340 ch3_slides.pdf
11/22/2016  11:39 AM        30,959,939 kamcord_data.csv
02/11/2017  03:56 PM            11,670 list.ipynb
02/11/2017  03:24 PM            40,789 tool2.ipynb
02/11/2017  03:21 PM         3,909,340 tweets.csv
02/10/2017  01:17 AM         4,092,402 tweets.txt
10/02/2016  01:30 PM            26,623 university_towns.txt
12/21/2016  02:34 PM           139,169 WDI_Country.csv
12/28/2016  03:29 PM           743,579 WDI_CS_Notes.csv
02/01/2017  03:20 PM       187,559,624 WDI_Data.csv
02/01/2017  03:19 PM            36,410 WDI_Description.csv
12/28/2016  03:28 PM        57,084,594 WDI_Footnotes.csv
12/21/2016  02:34 PM         3,512,261 WDI_Series.csv
12/21/2016  02:34 PM            32,781 WDI_ST_Notes.csv
              17 File(s)    289,035,836 bytes
               3 Dir(s)  35,920,375,808 bytes free
In [43]:
f = pd.read_csv('WDI_Data.csv',chunksize=10000)
df = f.next()
df.shape
Out[43]:
(10000, 61)
In [44]:
print df.columns
df=df.iloc[:,:5].dropna()
Index([u'Country Name', u'Country Code', u'Indicator Name', u'Indicator Code',
       u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
       u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
       u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
       u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
       u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
       u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
       u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
       u'2016'],
      dtype='object')
In [45]:
df.shape
Out[45]:
(725, 5)
In [48]:
df[df['Indicator Code']=='SP.ADO.TFRT']
Out[48]:
Country Name Country Code Indicator Name Indicator Code 1960
48 Arab World ARB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 133.555013
1500 Caribbean small states CSS Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 162.871212
2952 Central Europe and the Baltics CEB Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 46.716752
4404 Early-demographic dividend EAR Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 116.406607
5856 East Asia & Pacific EAS Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 66.015974
7308 East Asia & Pacific (excluding high income) EAP Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 75.043631
8760 East Asia & Pacific (IDA & IBRD countries) TEA Adolescent fertility rate (births per 1,000 wo... SP.ADO.TFRT 76.409849
In [55]:
content = df[df['Indicator Code']=='SP.ADO.TFRT'].iloc[0,]
In [65]:
row = list(content.values)
row
Out[65]:
['Arab World',
 'ARB',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'SP.ADO.TFRT',
 133.55501327768999]
In [64]:
names = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value']

Dictionaries for data science

In [66]:
# Zip lists: zipped_lists
zipped_lists = zip(names, row)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)
{'CountryName': 'Arab World', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'CountryCode': 'ARB', 'Year': 133.55501327768999}

Writing a function

In [68]:
# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(names, row)

# Print rs_fxn
print(rs_fxn)
{'CountryName': 'Arab World', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'CountryCode': 'ARB', 'Year': 133.55501327768999}

Using a list comprehension

In [81]:
# Print the first two lists in row_lists
print(df.iloc[0,:])
print 
print(df.iloc[1,:])
print 
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(names, sublist) for sublist in df.values]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
Country Name                                             Arab World
Country Code                                                    ARB
Indicator Name    Adolescent fertility rate (births per 1,000 wo...
Indicator Code                                          SP.ADO.TFRT
1960                                                        133.555
Name: 48, dtype: object

Country Name                                             Arab World
Country Code                                                    ARB
Indicator Name    Age dependency ratio (% of working-age populat...
Indicator Code                                          SP.POP.DPND
1960                                                        87.7992
Name: 55, dtype: object

{'CountryName': 'Arab World', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'CountryCode': 'ARB', 'Year': 133.55501327769}
{'CountryName': 'Arab World', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'CountryCode': 'ARB', 'Year': 87.79923459912621}

Turning this all into a DataFrame

In [86]:
# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(names, sublist) for sublist in df.values]

# Turn list of dicts into a dataframe: df
df2 = pd.DataFrame(list_of_dicts)

print df2.shape
# Print the head of the dataframe
df2.head()
(725, 5)
Out[86]:
CountryCode CountryName IndicatorCode IndicatorName Year
0 ARB Arab World SP.ADO.TFRT Adolescent fertility rate (births per 1,000 wo... 133.555013
1 ARB Arab World SP.POP.DPND Age dependency ratio (% of working-age populat... 87.799235
2 ARB Arab World SP.POP.DPND.OL Age dependency ratio, old (% of working-age po... 6.635328
3 ARB Arab World SP.POP.DPND.YG Age dependency ratio, young (% of working-age ... 81.024250
4 ARB Arab World ER.FSH.AQUA.MT Aquaculture production (metric tons) 4600.000000

Using Python generators for streaming data

Processing data in chunks

example not using Pandas

  • with open(path) as name:
    do sth
In [90]:
# Open a connection to the file
with open('WDI_Data.csv') as f:

    # Skip the column names
    f.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(0, 1000):

        # Split the current line into a list: line
        line = f.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)
{'Arab World': 1000}

In the previous exercise, you processed a file line by line for a given number of lines. What if, however, we want to to do this for the entire file?

In this case, it would be useful to use generators. Generators allow users to lazily evaluate data.

  • This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

define a generator function read_large_file() that produces a generator object which yields a single line from a file each time next() is called on it.

In [92]:
# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data

# Open a connection to the file
with open('WDI_Data.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))
Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016

Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per international $)",PA.NUS.PPP.05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Arab World,ARB,"2005 PPP conversion factor, private consumption (LCU per international $)",PA.NUS.PRVT.PP.05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

  • You've just created a generator function that you can use to help you process large files.
  • You will process the file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset.
  • you'll process the entire dataset!
In [93]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Open a connection to the file
with open('WDI_Data.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print            
print(counts_dict)
{'Canada': 1452, 'Sao Tome and Principe': 1452, 'Turkmenistan': 1452, 'Lao PDR': 1452, 'Arab World': 1452, 'Lithuania': 1452, 'Cambodia': 1452, 'Switzerland': 1452, 'Ethiopia': 1452, 'Saudi Arabia': 1452, 'OECD members': 1452, 'Swaziland': 1452, 'South Asia': 1452, 'Argentina': 1452, 'Bolivia': 1452, 'Cameroon': 1452, 'Burkina Faso': 1452, 'Bahrain': 1452, 'Middle East & North Africa (IDA & IBRD countries)': 1452, 'Rwanda': 1452, 'South Asia (IDA & IBRD)': 1452, '"Egypt': 1452, 'Japan': 1452, 'Channel Islands': 1452, 'American Samoa': 1452, 'Northern Mariana Islands': 1452, 'Slovenia': 1452, 'East Asia & Pacific (IDA & IBRD countries)': 1452, 'IDA total': 1452, 'Bosnia and Herzegovina': 1452, 'Guinea': 1452, 'Russian Federation': 1452, 'World': 1452, 'St. Lucia': 1452, 'Dominica': 1452, 'Liberia': 1452, 'Maldives': 1452, 'Pakistan': 1452, 'Virgin Islands (U.S.)': 1452, 'Oman': 1452, 'Tanzania': 1452, 'Early-demographic dividend': 1452, 'Cabo Verde': 1452, 'Mauritania': 1452, 'Greenland': 1452, 'Gabon': 1452, 'Monaco': 1452, 'New Zealand': 1452, 'Spain': 1452, 'European Union': 1452, '"Venezuela': 1452, 'Jamaica': 1452, 'Albania': 1452, 'Samoa': 1452, 'Slovak Republic': 1452, 'Kazakhstan': 1452, 'Guam': 1452, 'Uruguay': 1452, 'India': 1452, 'Azerbaijan': 1452, 'Lesotho': 1452, 'Middle East & North Africa': 1452, 'Europe & Central Asia (IDA & IBRD countries)': 1452, 'United Arab Emirates': 1452, 'Latin America & Caribbean': 1452, 'Aruba': 1452, 'Upper middle income': 1452, 'Tajikistan': 1452, 'Pacific island small states': 1452, 'Turkey': 1452, 'Afghanistan': 1452, 'Bangladesh': 1452, 'East Asia & Pacific': 1452, 'Solomon Islands': 1452, 'Turks and Caicos Islands': 1452, 'Palau': 1452, 'San Marino': 1452, 'French Polynesia': 1452, 'France': 1452, 'Syrian Arab Republic': 1452, 'Bermuda': 1452, 'Somalia': 1452, 'Peru': 1452, 'Vanuatu': 1452, 'Nauru': 1452, 'Seychelles': 1452, 'Late-demographic dividend': 1452, "Cote d'Ivoire": 1452, 'West Bank and Gaza': 1452, 'Benin': 1452, 'Other small states': 1452, '"Gambia': 1452, 'Cuba': 1452, 'Montenegro': 1452, 'Low & middle income': 1452, 'Togo': 1452, 'China': 1452, 'Armenia': 1452, 'Jordan': 1452, 'Timor-Leste': 1452, 'Dominican Republic': 1452, '"Hong Kong SAR': 1452, 'Ukraine': 1452, 'Ghana': 1452, 'Tonga': 1452, 'Finland': 1452, 'Colombia': 1452, 'Libya': 1452, 'Cayman Islands': 1452, 'Central African Republic': 1452, 'North America': 1452, 'Liechtenstein': 1452, 'Belarus': 1452, 'British Virgin Islands': 1452, 'Kenya': 1452, 'Sweden': 1452, 'Poland': 1452, 'Bulgaria': 1452, 'Mauritius': 1452, 'Romania': 1452, 'Angola': 1452, 'Central Europe and the Baltics': 1452, 'Chad': 1452, 'South Africa': 1452, 'St. Vincent and the Grenadines': 1452, 'Cyprus': 1452, 'Caribbean small states': 1452, 'Brunei Darussalam': 1452, 'Qatar': 1452, 'Pre-demographic dividend': 1452, 'Middle income': 1452, 'Austria': 1452, 'Vietnam': 1452, 'Mozambique': 1452, 'Uganda': 1452, 'Kyrgyz Republic': 1452, 'Hungary': 1452, 'Niger': 1452, 'Isle of Man': 1452, 'United States': 1452, 'Brazil': 1452, 'Sub-Saharan Africa (IDA & IBRD countries)': 1452, '"Macao SAR': 1452, 'Faroe Islands': 1452, 'Europe & Central Asia (excluding high income)': 1452, 'Panama': 1452, 'Mali': 1452, 'Costa Rica': 1452, 'Luxembourg': 1452, 'St. Kitts and Nevis': 1452, 'Andorra': 1452, 'Norway': 1452, 'Euro area': 1452, 'Gibraltar': 1452, 'Ireland': 1452, 'Italy': 1452, 'Nigeria': 1452, 'Lower middle income': 1452, 'Ecuador': 1452, 'IDA & IBRD total': 1452, 'Australia': 1452, 'Algeria': 1452, 'El Salvador': 1452, 'Tuvalu': 1452, 'IDA only': 1452, 'Guatemala': 1452, 'Czech Republic': 1452, 'Sub-Saharan Africa': 1452, 'Middle East & North Africa (excluding high income)': 1452, 'Chile': 1452, 'Marshall Islands': 1452, 'Belgium': 1452, 'Kiribati': 1452, 'Haiti': 1452, 'Belize': 1452, 'Fragile and conflict affected situations': 1452, 'Sierra Leone': 1452, 'Georgia': 1452, '"Yemen': 1452, 'Denmark': 1452, 'Post-demographic dividend': 1452, 'Puerto Rico': 1452, 'Moldova': 1452, 'Morocco': 1452, 'Croatia': 1452, 'Mongolia': 1452, 'Guinea-Bissau': 1452, 'Thailand': 1452, 'Namibia': 1452, 'Grenada': 1452, 'Latin America & Caribbean (excluding high income)': 1452, 'Iraq': 1452, 'Portugal': 1452, 'Estonia': 1452, 'Kosovo': 1452, 'Mexico': 1452, 'Lebanon': 1452, '"Congo': 2904, 'Uzbekistan': 1452, 'Djibouti': 1452, 'Country Name': 1, 'Antigua and Barbuda': 1452, 'Low income': 1452, 'High income': 1452, 'Burundi': 1452, 'Least developed countries: UN classification': 1452, 'IDA blend': 1452, 'Barbados': 1452, 'Madagascar': 1452, 'Sub-Saharan Africa (excluding high income)': 1452, 'Curacao': 1452, 'Bhutan': 1452, 'Sudan': 1452, 'Nepal': 1452, 'Malta': 1452, '"Micronesia': 1452, 'Netherlands': 1452, '"Bahamas': 1452, '"Macedonia': 1452, 'Kuwait': 1452, 'Europe & Central Asia': 1452, 'United Kingdom': 1452, 'Israel': 1452, 'Indonesia': 1452, 'Malaysia': 1452, 'Iceland': 1452, 'Zambia': 1452, 'Senegal': 1452, 'Papua New Guinea': 1452, 'Malawi': 1452, 'Suriname': 1452, 'Trinidad and Tobago': 1452, 'Zimbabwe': 1452, 'Germany': 1452, 'St. Martin (French part)': 1452, 'East Asia & Pacific (excluding high income)': 1452, 'Philippines': 1452, '"Iran': 1452, 'Eritrea': 1452, 'Small states': 1452, 'New Caledonia': 1452, 'Sri Lanka': 1452, 'Not classified': 1452, 'Latvia': 1452, 'South Sudan': 1452, '"Korea': 2904, 'Guyana': 1452, 'IBRD only': 1452, 'Honduras': 1452, 'Myanmar': 1452, 'Equatorial Guinea': 1452, 'Tunisia': 1452, 'Nicaragua': 1452, 'Singapore': 1452, 'Serbia': 1452, 'Comoros': 1452, 'Latin America & the Caribbean (IDA & IBRD countries)': 1452, 'Sint Maarten (Dutch part)': 1452, 'Greece': 1452, 'Paraguay': 1452, 'Fiji': 1452, 'Botswana': 1452, 'Heavily indebted poor countries (HIPC)': 1452}

Writing an iterator to load data in chunks

In [ ]:
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('WDI_Data.csv', chunksize=1000)

# Get the first dataframe chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the dataframe
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['Country Code'] == 'CEB']

# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], 
            df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)
In [ ]:
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('WDI_Data.csv', chunksize=1000)

# Get the first dataframe chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], 
            df_pop_ceb['Urban population (% of total)'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Use list comprehension to create new dataframe column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]

# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
In [ ]:
# Define plot_pop()
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty dataframe: data
    data = pd.DataFrame()
    
    # Iterate over each dataframe chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip dataframe columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new dataframe column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
    
        # Append dataframe chunk to data: data
        data = data.append(df_pop_ceb)

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop(fn, 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop(fn, 'ARB')