Monthly Archives: October 2016
R odds and ends R basics
Description | R | Comments | Python |
---|---|---|---|
exam type | class() | type() | |
vector | c(...) | one-dimension arrays same type | |
name vector | names(vect)<-c(...) | ||
slicing vector | vect[3] vect[c(3,5,6)] | starting from 1 compares to 0 in Python | |
2:5 | !ERROR! unexpected operator '=' | includes 5 | |
vect[3:5] | !ERROR! illegal character '[' | ||
use names as index | vec[c('name1','name2',...)] | ||
calculate average | mean() | in Python, have to import other libraries | np.mean() |
vectors comparison | c(2,3,4,5)>3 | in Python, have to in numpy, pandas | |
logical selection | vect[c(...)>n] vect[vect2(logical)] | in Python, pandas is common | |
matrix | matrix() matrix(1:9, byrow = TRUE, nrow = 3) | two-dimensional same data type | np.matrix() |
Naming a matrix | rownames(my_matrix) <- row_names_vector colnames(my_matrix) <- col_names_vector | ||
dimnames = list(rowname, columnname) | |||
Sum of values of each row | rowSums(some_matrix) | ndarray.sum(axis=1) df.sum(axis=1) |
|
add column(s) to a matrix | bigger<- cbind(matrix1, matrix2, ...) | pd.concat([df1,df2],axis=1) | |
Adding a row(s) to a matrix | rbind(matrix1, matrix2, ...) | pd.concat([df1,df2],axis=0) df1.append(df2) |
|
Sum of values of each column | ndarray.sum(axis=0) df.sum(axis=0) |
||
slicing Matrix | matrix[row,col] my_matrix[1,2] my_matrix[1:3,2:4] my_matrix[ ,1] my_matrix[2, ] | ||
factors | factor() | categorical | |
Convert vector to factor | my_factor<-(vector, | ||
order/ non-order | temp_vector <- c("High", "Low", "High","Low", "Medium") factor_temp_vector <- factor(temp_vector, order = TRUE, levels = c("Low", "Medium", "High")) | nominal categorical variable ordinal categorical variable. | s = pd.Series(["a","b","c","a"], dtype="category") raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"], .... ordered=False) |
Factor levels | levels() levels(factor_vector) <- c("name1", "name2",...) | ||
summary() | summary(my_var) | df.describe() Series.value_counts() |
|
ordered | factor_speed_vector <-factor(speed_vector,ordered=TRUE,levels=c('slow','fast','insane')) | ordered factor can be compared | |
data frame | head(df) tail(df) | each column must be same data type | |
examine structure of a dataframe | str(df) | ||
create data frame | data.frame(vectors) | ||
slicing | df[rows,columns] df[row2,] entire row2 df[,column3] entire column3 | ||
use name slicing | df[2:5, 'name'] df['name', ] df[ ,'name'] | ||
subset() create | subset(planets_df, diameter<1) == planets_df[planets_df[,'diameter']<1,] | ||
sorting | order() returns ranked index not values values: a[order(a)] | ||
sorting df | indexes=order(df$column3) df[indexes, ] | ||
list | my_list <- list(comp1, comp2 ...) | ||
Creating a named list | my_list <- list(name1 = your_comp1, name2 = your_comp2) | ||
same as above | my_list <- list(your_comp1, your_comp2) names(my_list) <- c("name1", "name2") | ||
selecting elements from a list | shining_list[["reviews"]] == shining_list$reviews | ||
list[[2]][1] | |||
add data to list | ext_list <- c(my_list , my_val) | ||
comparison | & and | or ! not | double sign only compares the first element && || | |
if syntax in R | if (condition) {do sth} else if (condition) {do sth} else {do sth} | ||
read data | read.table read.delim read.csv read.csv2 | hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"), colClasses = c("factor", "NULL", "numeric")) | |
check environment | environment(func) | ||
specify func without a name | function(x){x+1}(2) => 3 | ||
mean() | mean(c(1:9, NA),trim=0.1,na.rm=TRUE) | trim -> remove outliers | |
environment | > f<-function () x > x<-99 > f() [1] 99 | ||
exists() | a<-5 exists("a") TRUE | ||
vector properties | typeoff() length() | ||
nun value in R | NULL (absent of entire vector) NA (absent of one value in vector) | ||
check nun | is.na() | ||
sequence | seq(1,10) 1:10 | ||
merge vector | c(vector1, vector2, singlevalue, ...) | ||
paste() paste0() | paste() sep=" " paste0 sep="" | string.join(list) | |
paste0("year_", 1:5) | [1] "year_1" "year_2" "year_3" "year_4" "year_5" | ||
plotting | hist(one_dim_data) hist(df$column) boxplot(multi_dim_data) boxplot(df) | ||
loading data case using generators and chunks example not using Pandas
loading data case using generators and chunks
example not using Pandas
This is a study note summary of some courses from DataCamp 🙂
import pandas as pd
!dir
f = pd.read_csv('WDI_Data.csv',chunksize=10000)
df = f.next()
df.shape
print df.columns
df=df.iloc[:,:5].dropna()
df.shape
df[df['Indicator Code']=='SP.ADO.TFRT']
content = df[df['Indicator Code']=='SP.ADO.TFRT'].iloc[0,]
row = list(content.values)
row
names = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value']
Dictionaries for data science¶
# Zip lists: zipped_lists
zipped_lists = zip(names, row)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Print the dictionary
print(rs_dict)
Writing a function¶
# Define lists2dict()
def lists2dict(list1, list2):
"""Return a dictionary where list1 provides
the keys and list2 provides the values."""
# Zip lists: zipped_lists
zipped_lists = zip(list1, list2)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Return the dictionary
return rs_dict
# Call lists2dict: rs_fxn
rs_fxn = lists2dict(names, row)
# Print rs_fxn
print(rs_fxn)
Using a list comprehension¶
# Print the first two lists in row_lists
print(df.iloc[0,:])
print
print(df.iloc[1,:])
print
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(names, sublist) for sublist in df.values]
# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])
Turning this all into a DataFrame¶
# Import the pandas package
import pandas as pd
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(names, sublist) for sublist in df.values]
# Turn list of dicts into a dataframe: df
df2 = pd.DataFrame(list_of_dicts)
print df2.shape
# Print the head of the dataframe
df2.head()
# Open a connection to the file
with open('WDI_Data.csv') as f:
# Skip the column names
f.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = f.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)
In the previous exercise, you processed a file line by line for a given number of lines. What if, however, we want to to do this for the entire file?¶
In this case, it would be useful to use generators. Generators allow users to lazily evaluate data.¶
- This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.
define a generator function read_large_file() that produces a generator object which yields a single line from a file each time next() is called on it.¶
# Define read_large_file()
def read_large_file(file_object):
"""A generator function to read a large file lazily."""
# Loop indefinitely until the end of the file
while True:
# Read a line from the file: data
data = file_object.readline()
# Break if this is the end of the file
if not data:
break
# Yield the line of data
yield data
# Open a connection to the file
with open('WDI_Data.csv') as file:
# Create a generator object for the file: gen_file
gen_file = read_large_file(file)
# Print the first three lines of the file
print(next(gen_file))
print(next(gen_file))
print(next(gen_file))
- You've just created a generator function that you can use to help you process large files.
- You will process the file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset.
- you'll process the entire dataset!
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Open a connection to the file
with open('WDI_Data.csv') as file:
# Iterate over the generator from read_large_file()
for line in read_large_file(file):
row = line.split(',')
first_col = row[0]
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
else:
counts_dict[first_col] = 1
# Print
print(counts_dict)
Writing an iterator to load data in chunks¶
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('WDI_Data.csv', chunksize=1000)
# Get the first dataframe chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out the head of the dataframe
print(df_urb_pop.head())
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['Country Code'] == 'CEB']
# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Print pops_list
print(pops_list)
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('WDI_Data.csv', chunksize=1000)
# Get the first dataframe chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new dataframe column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
# Define plot_pop()
def plot_pop(filename, country_code):
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv(filename, chunksize=1000)
# Initialize empty dataframe: data
data = pd.DataFrame()
# Iterate over each dataframe chunk
for df_urb_pop in urb_pop_reader:
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]
# Zip dataframe columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new dataframe column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
# Append dataframe chunk to data: data
data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()
# Set the filename: fn
fn = 'ind_pop_data.csv'
# Call plot_pop for country code 'CEB'
plot_pop(fn, 'CEB')
# Call plot_pop for country code 'ARB'
plot_pop(fn, 'ARB')
list comprehension and generators
list comprehension and generators
list comprehensions and generators¶
Nested list comprehensions¶
- [[output expression] for iterator variable in iterable]
- Collapse for loops for building lists into a single line
- Components
- Iterable
- Iterator variable (represent members of iterable)
- Output expression
- Components
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for row in range(5)]
# Print the matrix
for row in matrix:
print(row)
pair_2=[(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
pair_2
Using conditionals in comprehensions¶
- [ output expression for iterator variable in iterable if predicate expression ].
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member) >= 7]
# Print the new list
print(new_fellowship)
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member if len(member) >= 7 else '' for member in fellowship]
# Print the new list
print(new_fellowship)
Dict comprehensions¶
- Recall that the main difference between a list comprehension and a dict comprehension is the use of curly braces {} instead of []. Additionally, members of the dictionary are created using a colon :, as in key:value
- Create dictionaries
- Use curly braces {} instead of brackets []
# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create dict comprehension: new_fellowship
new_fellowship = {member:len(member) for member in fellowship}
# Print the new list
print(new_fellowship)
Generator expressions¶
- Recall list comprehension
- Use ( ) instead of [ ]
g = (2 * num for num in range(10))
g
List comprehensions vs. generators¶
- List comprehension - returns a list
- Generators - returns a generator object
- Both can be iterated over
(num for num in range(10*1000000) if num % 2 == 0)
Generator functions¶
Generator functions are functions that, like generator expressions, yield a series of values, instead of returning a single value. A generator function is defined as you do a regular function, but whenever it generates a value, it uses the keyword yield instead of return.¶
- Produces generator objects when called
- Defined like a regular function - def
- Yields a sequence of values instead of returning a single value
- Generates a value with yield keyword
def num_sequence(n):
"""Generate values from 0 to n."""
i = 0
while i < n:
yield i
i += 1
test=num_sequence(7)
print type(test)
next(test)
test.next()
- Extract the column 'created_at' from df and assign the result to tweet_time. Fun fact: the extracted column in tweet_time here is a Series data structure!
- reate a list comprehension that extracts the time from each row in tweet_time. Each row is a string that represents a timestamp, and you will access the 11th to 18th characters in the string to extract the time. Use entry as the iterator variable and assign the result to tweet_clock_time.
import pandas as pd
df = pd.read_csv('tweets.csv')
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time]
# Print the extracted times
print(tweet_clock_time[:100])
Conditional list comprehesions for time-stamped data¶
- add a conditional expression to the list comprehension so that you only select the times in which entry[17:19] is equal to '19'
# Extract the created_at column from df: tweet_time
tweet_time = df['created_at']
# Extract the clock time: tweet_clock_time
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']
# Print the extracted times
print(tweet_clock_time)
Python iterators, loading data in chunks with Pandas
Python iterators
loading data in chunks with pandas
Iterators, load file in chunks¶
Iterators vs Iterables¶
an iterable is an object that can return an iterator¶
- Examples: lists, strings, dictionaries, file connections
- An object with an associated iter() method
- Applying iter() to an iterable creates an iterator
an iterator is an object that keeps state and produces the next value when you call next() on it.¶
- Produces next value with next()
a=[1,2,3,4]
b=iter([1,2,3,4])
c=iter([5,6,7,8])
print a
print b
print next(b),next(b),next(b),next(b)
print list(c)
Iterating over iterables¶
- Python 2 does NOT work
range() doesn't actually create the list; instead, it creates a range object with an iterator that produces the values until it reaches the limit
If range() created the actual list, calling it with a value of 10^100 may not work, especially since a number as big as that may go over a regular computer's memory. The value 10^100 is actually what's called a Googol which is a 1 followed by a hundred 0s. That's a huge number!
- calling range() with 10^100 won't actually pre-create the list.
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))
Iterating over dictionaries¶
a={1:9, 'what':'why?'}
for key,value in a.items(): print key,value
Iterating over file connections¶
f = open('university_towns.txt')
type(f)
iter(f)
iter(f)==f
next(f)
next(iter(f))
# Create a list of strings: mutants
mutants = ['charles xavier', 'bobby drake', 'kurt wagner', 'max eisenhardt', 'kitty pride']
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))
# Print the list of tuples
print(mutant_list)
print
# Unpack and print the tuple pairs
for index1, value1 in enumerate(mutants):
print(index1, value1)
print "\nChange the start index\n"
for index2, value2 in enumerate(mutants, start=3):
print(index2, value2)
Using zip¶
zip(), which takes any number of iterables and returns a zip object that is an iterator of tuples.¶
- If you wanted to print the values of a zip object, you can convert it into a list and then print it.
- Printing just a zip object will not return the values unless you unpack it first.
In Python 2 , zip() returns a list¶
Docstring: zip(seq1 [, seq2 [...]]) -> [(seq1[0], seq2[0] ...), (...)]
Return a list of tuples, where each tuple contains the i-th element from each of the argument sequences. The returned list is truncated in length to the length of the shortest argument sequence.
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy','thermokinesis','teleportation','magnetokinesis','intangibility']
# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))
# Print the list of tuples
print(mutant_data)
print
# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)
# Print the zip object
print(type(mutant_zip))
# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip:
print(value1, value2, value3)
Loading data in chunks¶
- There can be too much data to hold in memory
- Solution: load data in chunks!
- Pandas function: read_csv()
- Specify the chunk: chunksize
import pandas as pd
from time import time
start = time()
df = pd.read_csv('kamcord_data.csv')
print 'used {:.2f} s'.format(time()-start)
print df.shape
df.head(1)
explore¶
a=pd.read_csv('kamcord_data.csv',chunksize=4)
b=pd.read_csv('kamcord_data.csv',iterator=True)
a.next()
x=a.next()
y=a.next()
y.append(x, ignore_index=True)
pd.concat([x,y], ignore_index=True)
1st way of loading data in chunks¶
start = time()
c=0
for chuck in pd.read_csv('kamcord_data.csv',chunksize=50000):
if c==0:
df=chuck
c+=1
else:
df=df.append(chuck, ignore_index=True)
c+=1
print c
print 'used {:.2f} s'.format(time()-start)
print df.shape
df.head(1)
2ed way of loading data in chunks¶
start = time()
want=[]
for chuck in pd.read_csv('kamcord_data.csv',chunksize=50000):
want.append(chuck)
print len(want)
df=pd.concat(want, ignore_index=True)
print 'used {:.2f} s'.format(time()-start)
print df.shape
df.head(1)
3rd way of loading data in chunks¶
start = time()
want=[]
f = pd.read_csv('kamcord_data.csv',iterator = True)
go = True
while go:
try:
want.append(f.get_chunk(50000))
except Exception as e:
print type(e)
go = False
print len(want)
df=pd.concat(want, ignore_index=True)
print 'used {:.2f} s'.format(time()-start)
print df.shape
df.head(1)
Processing large amounts of Twitter data by chunks¶
import pandas as pd
# Import package
import json
# Initialize empty list to store tweets: tweets_data
tweets_data = []
# Open connection to file
h=open('tweets.txt','r')
# Read in tweets and store in list: tweets_data
for i in h:
try:
print 'O',
tmp=json.loads(i)
tweets_data.append(tmp)
except:
print 'X',
h.close()
t_df = pd.DataFrame(tweets_data)
print
print t_df.shape
t_df.to_csv('tweets.csv',index=False, encoding= 'utf-8')
t_df.head(1)
Processing large amounts of data by chunks¶
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv', chunksize=10):
# Iterate over the column in dataframe
for entry in chunk['lang']:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Print the populated dictionary
print(counts_dict)
Extracting information for large amounts of Twitter data¶
- reusable
- def func
# Define count_entries()
def count_entries(csv_file, c_size, colname):
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv(csv_file, chunksize=c_size):
# Iterate over the column in dataframe
for entry in chunk[colname]:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Return counts_dict
return counts_dict
# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv', 10, 'lang')
# Print result_counts
print(result_counts)
Python odds and ends scope, filter, reduce
description | code | comments | |
quickly assign values | a,b,c = (3,7,12) | unpack | |
nested functions | outer func return inner func | ||