Filtering and Targeting Data¶

Where Clauses

Where Clauses¶

● Restrict data returned by a query based on boolean conditions

● Compare a column against a value or another column

● Often used comparisons: '==', '<=', '>=', or '!='

Connecting to a PostgreSQL Database¶

connecting to a PostgreSQL database. When connecting to a PostgreSQL database
many prefer to use the psycopg2 database driver as it supports practically all of PostgreSQL's features efficiently and is the standard dialect for PostgreSQL in SQLAlchemy.
- There are three components to the connection string in this exercise: the dialect and driver ('postgresql+psycopg2://')
- followed by the username and password ('student:datacamp')
- followed by the host and port ('@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/')
- and finally, the database name ('census').

from sqlalchemy import create_engine
import pandas as pd

# Import create_engine function
from sqlalchemy import create_engine

# Create an engine to the census database
engine = create_engine('postgresql+psycopg2://postgres:iri@localhost')

connection = engine.connect()


# create table & insert data from Pandas df to PostgreSQL
df = pd.read_csv('all_050_in_36.P1.csv')
# df.to_sql('census',con)


# Use the .table_names() method on the engine to print the table names
print(engine.table_names())

[u'census', u'crime2']

r = connection.execute('select * from census')
r.fetchall()[:4]

[(0L, 36003L, 50L, 36L, 3L, 99999L, 999L, None, None, u'Allegany County', 48946L, 26140L, 49927L, 24505L, 48946L, 49927L),
 (1L, 36005L, 50L, 36L, 5L, 35620L, 408L, None, None, u'Bronx County', 1385108L, 511896L, 1332650L, 490659L, 1385108L, 1332650L),
 (2L, 36007L, 50L, 36L, 7L, 13780L, 999L, None, None, u'Broome County', 200600L, 90563L, 200536L, 88817L, 200600L, 200536L),
 (3L, 36009L, 50L, 36L, 9L, 36460L, 160L, None, None, u'Cattaraugus County', 80317L, 41111L, 83955L, 39839L, 80317L, 83955L)]

Expressions¶

● Provide more complex conditions than simple operators

● Eg. in_(), like(), between()

● Many more in documentation

● Available as method on a Column

Filter data selected from a Table - Simple¶

a where() clause is used to filter the data that a statement returns.
- to select all the records from the census table where the NAME starts with A

from sqlalchemy import MetaData, Table, select, or_, and_, not_

metadata = MetaData()

census = Table('census', metadata, autoload=True, autoload_with=engine)

print census.columns.keys()

['index', 'GEOID', 'SUMLEV', 'STATE', 'COUNTY', 'CBSA', 'CSA', 'NECTA', 'CNECTA', 'NAME', 'POP100', 'HU100', 'POP100.2000', 'HU100.2000', 'P001001', 'P001001.2000']

stmt = select([census])
print stmt

SELECT census.index, census."GEOID", census."SUMLEV", census."STATE", census."COUNTY", census."CBSA", census."CSA", census."NECTA", census."CNECTA", census."NAME", census."POP100", census."HU100", census."POP100.2000", census."HU100.2000", census."P001001", census."P001001.2000" 
FROM census

stmt = stmt.where(census.columns.NAME.startswith('A'))
print stmt

SELECT census.index, census."GEOID", census."SUMLEV", census."STATE", census."COUNTY", census."CBSA", census."CSA", census."NECTA", census."CNECTA", census."NAME", census."POP100", census."HU100", census."POP100.2000", census."HU100.2000", census."P001001", census."P001001.2000" 
FROM census 
WHERE (census."NAME" LIKE :NAME_1 || '%%')

results = connection.execute(stmt).fetchall()
results

[(0L, 36003L, 50L, 36L, 3L, 99999L, 999L, None, None, u'Allegany County', 48946L, 26140L, 49927L, 24505L, 48946L, 49927L),
 (51L, 36001L, 50L, 36L, 1L, 10580L, 104L, None, None, u'Albany County', 304204L, 137739L, 294565L, 129972L, 304204L, 294565L)]

# Loop over the results and print the age, sex, and pop2008
for result in results:
    print(result.STATE, result.COUNTY, result.NAME, result["POP100.2000"], result.POP100)

(36L, 3L, u'Allegany County', 49927L, 48946L)
(36L, 1L, u'Albany County', 294565L, 304204L)

Use the crime data¶

State (string)
Agency type (string)
Agency name (string)
Race (signed int64)
Religion (signed int64)
Sexual orientation (signed int64)
Ethnicity (signed int64)
Disability (signed int64)
Gender (signed int64)
Gender Identity (signed int64)
1st quarter (signed int64)
2nd quarter (signed int64)
3rd quarter (signed int64)
4th quarter (signed int64)
Population (string)

from sqlalchemy import create_engine
import pandas as pd



# Create an engine to the census database
engine = create_engine('postgresql+psycopg2://postgres:iri@localhost')

connection = engine.connect()


# create table & insert data from Pandas df to PostgreSQL
df = pd.read_csv('table13.csv')
df[['Gender','Population']] = df[['Gender','Population']].apply(lambda x: pd.to_numeric(x.str.replace(',',''), errors='coerce'))
# df.to_sql('crime2',connection)


# Use the .table_names() method on the engine to print the table names
print(engine.table_names())

[u'census', u'crime2']

df.head(1)

check df types¶

df.dtypes

State                  object
Agency type            object
Agency name            object
Race                    int64
Religion                int64
Sexual orientation      int64
Ethnicity               int64
Disability              int64
Gender                float64
Gender Identity       float64
1st quarter           float64
2nd quarter           float64
3rd quarter           float64
4th quarter           float64
Population            float64
dtype: object

transform population type from string to numeric¶

get rid of ',' first

df = pd.read_csv('table13.csv')
df.dtypes

State                  object
Agency type            object
Agency name            object
Race                    int64
Religion                int64
Sexual orientation      int64
Ethnicity               int64
Disability              int64
Gender                 object
Gender Identity       float64
1st quarter           float64
2nd quarter           float64
3rd quarter           float64
4th quarter           float64
Population             object
dtype: object

pd.to_numeric(df['Population'].str.replace(',','')).head()

0     39481.0
1     84139.0
2     35154.0
3     94126.0
4    299455.0
Name: Population, dtype: float64

batch convert datatype¶

specify errors = 'coerce' will leave NA data to NaN¶

df[['Gender','Population']].apply(lambda x: pd.to_numeric(x.str.replace(',',''), errors='coerce')).head()

stat¶

df['Agency type'].value_counts()

Cities                       1326
Metropolitan Counties         216
Universities and Colleges     109
Nonmetropolitan Counties      102
State Police Agencies          53
Other Agencies                 20
Name: Agency type, dtype: int64

df['State'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'DC', 'Delaware', 'Florida', 'Georgia',
       'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
       'Nevada', 'New_Hampshire', 'New_Jersey', 'New_Mexico', 'New_York',
       'North_Carolina', 'North_Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode_Island', 'South_Carolina', 'South_Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West_Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

Filter data selected from a Table - Expressions¶

In addition to standard Python comparators, we can also use methods such as in_() to create more powerful where() clauses.

You can see a full list of expressions in the SQLAlchemy Documentation.

http://docs.sqlalchemy.org/en/latest/core/sqlelement.html#module-sqlalchemy.sql.expression

states = ['Connecticut', 'Virginia']

from sqlalchemy import select, or_, and_, not_
from sqlalchemy import create_engine, MetaData, Table


# Create an engine to the census database
engine = create_engine('postgresql+psycopg2://postgres:iri@localhost')

connection = engine.connect()


metadata = MetaData()

crime = Table('crime2', metadata, autoload=True, autoload_with=engine)

# Create a query for the census table: stmt


# Append a where clause to match all the states in_ the list states
stmt = select([crime]).where(crime.c.State.in_(states))
print stmt

# Loop over the ResultProxy and print the state and its population in 2000
c = 0
for result in connection.execute(stmt):
    print(result.State, result.Population, result['Agency type'],result['Agency name'] ), 
    c+=1
    print c
    if c==10: break

SELECT crime2.index, crime2."State", crime2."Agency type", crime2."Agency name", crime2."Race", crime2."Religion", crime2."Sexual orientation", crime2."Ethnicity", crime2."Disability", crime2."Gender", crime2."Gender Identity", crime2."1st quarter", crime2."2nd quarter", crime2."3rd quarter", crime2."4th quarter", crime2."Population" 
FROM crime2 
WHERE crime2."State" IN (:State_1, :State_2)
(u'Connecticut', 19338.0, u'Cities', u'Bethel') 1
(u'Connecticut', 147076.0, u'Cities', u'Bridgeport') 2
(u'Connecticut', 10369.0, u'Cities', u'Canton') 3
(u'Connecticut', 83363.0, u'Cities', u'Danbury') 4
(u'Connecticut', 12806.0, u'Cities', u'Derby') 5
(u'Connecticut', 12934.0, u'Cities', u'East Hampton') 6
(u'Connecticut', 51275.0, u'Cities', u'East Hartford') 7
(u'Connecticut', 11455.0, u'Cities', u'East Windsor') 8
(u'Connecticut', 44661.0, u'Cities', u'Enfield') 9
(u'Connecticut', 34782.0, u'Cities', u'Glastonbury') 10

SQLAlchemy also allows users to use conjunctions such as¶

and(), or(), and not_() to build more complex filtering.¶

Order by Clauses¶

● Allows us to control the order in which records are returned in the query results

● Available as a method on statements order_by()

Order by Descending
- Wrap the column with desc() in the order_by() clause

Order by Multiple¶

● Just separate multiple columns with a comma

● Orders completely by the first column

● Then if there are duplicates in the first column, orders by the second column

● repeat until all columns are ordered

print crime.columns.keys()

['index', 'State', 'Agency type', 'Agency name', 'Race', 'Religion', 'Sexual orientation', 'Ethnicity', 'Disability', 'Gender', 'Gender Identity', '1st quarter', '2nd quarter', '3rd quarter', '4th quarter', 'Population']

crime.columns['Agency name']

Column('Agency name', TEXT(), table=<crime2>)

from sqlalchemy import desc

statement = select([crime]).where(
  and_(crime.c.State == 'New_York',
       or_(crime.columns['Agency type'] == 'Cities',
          crime.columns.Race == 2,
           crime.c.Gender == 3
         )
      )
  ).order_by(desc(crime.c.Population), crime.c['Agency name'])

print statement

SELECT crime2.index, crime2."State", crime2."Agency type", crime2."Agency name", crime2."Race", crime2."Religion", crime2."Sexual orientation", crime2."Ethnicity", crime2."Disability", crime2."Gender", crime2."Gender Identity", crime2."1st quarter", crime2."2nd quarter", crime2."3rd quarter", crime2."4th quarter", crime2."Population" 
FROM crime2 
WHERE crime2."State" = :State_1 AND (crime2."Agency type" = :Agency type_1 OR crime2."Race" = :Race_1 OR crime2."Gender" = :Gender_1) ORDER BY crime2."Population" DESC, crime2."Agency name"

connection.execute(statement).fetchall()[:10]

[(1205L, u'New_York', u'Other Agencies', u'New York City Metropolitan Transportation Authority', 2L, 15L, 0L, 0L, 0L, 0.0, 0.0, 4.0, 5.0, 4.0, 4.0, None),
 (1145L, u'New_York', u'Cities', u'New York', 44L, 152L, 99L, 17L, 1L, 0.0, 1.0, 56.0, 84.0, 81.0, 93.0, 8396126.0),
 (1126L, u'New_York', u'Cities', u'Buffalo', 11L, 1L, 2L, 2L, 0L, 1.0, 0.0, 6.0, 3.0, 4.0, 4.0, 258789.0),
 (1152L, u'New_York', u'Cities', u'Rochester', 3L, 1L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 210562.0),
 (1164L, u'New_York', u'Cities', u'Yonkers', 1L, 3L, 1L, 0L, 0L, 1.0, 1.0, 2.0, 0.0, 3.0, 2.0, 199134.0),
 (1123L, u'New_York', u'Cities', u'Amherst Town', 3L, 0L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 118296.0),
 (1122L, u'New_York', u'Cities', u'Albany', 1L, 2L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 2.0, 1.0, 0.0, 97956.0),
 (1134L, u'New_York', u'Cities', u'Greece Town', 3L, 0L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 96667.0),
 (1151L, u'New_York', u'Cities', u'Ramapo Town', 0L, 1L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 0.0, None, 1.0, 87204.0),
 (1128L, u'New_York', u'Cities', u'Clarkstown Town', 1L, 2L, 0L, 0L, 0L, 0.0, 0.0, 2.0, 1.0, 0.0, 0.0, 80705.0)]

connection.execute('select * from crime2').first()

(0L, u'Alabama', u'Cities', u'Florence', 2L, 0L, 0L, 0L, 0L, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 39481.0)

func¶

scalar()¶

print connection.execute('select sum(crime2."Population") from crime2').fetchall()

print connection.execute('select sum(crime2."Population") from crime2').scalar()

[(111029896.0,)]
111029896.0

Counting, Summing and Grouping Data¶

SQL Functions¶

● E.g. Count, Sum

● from sqlalchemy import func

● More efficient than processing in Python

● Aggregate data

Counting Distinct Data¶

SQLAlchemy's func module provides access to built-in SQL functions that can make operations like counting and summing faster and more efficient.

from sqlalchemy import func

# Build a query to count the distinct states values: stmt
stmt = select([func.count(crime.columns.State.distinct())])

print stmt

# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()

# Print the distinct_state_count
print(distinct_state_count)

print connection.execute(stmt).fetchall()

SELECT count(DISTINCT crime2."State") AS count_1 
FROM crime2
50
[(50L,)]

summing example¶

connection.execute(select([func.sum(crime.c.Population)])).scalar()

111029896.0

Group by¶

● Allows us to group row by common values.

Count of Records by State¶

Often, we want to get a count for each record with a particular value in another column.
- The .group_by() method helps answer this type of query.
  - You can pass a column to the .group_by() method and use in a aggregate function like sum() or count().
    - Much like the .order_by() method, .group_by() can take multiple columns as arguments.

# Build a query to select the state and count of records by state: stmt
stmt = select([crime.columns.State, func.count(crime.columns.Population)])

# Group stmt by state
stmt = stmt.group_by(crime.columns.State)

print stmt

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

print 
# Print results
print(results)

print 
# Print the keys/column names of the results returned
print(results[0].keys())

SELECT crime2."State", count(crime2."Population") AS count_1 
FROM crime2 GROUP BY crime2."State"

[(u'Alabama', 4L), (u'Indiana', 9L), (u'Minnesota', 35L), (u'New_Jersey', 137L), (u'Louisiana', 2L), (u'California', 199L), (u'Connecticut', 46L), (u'Alaska', 1L), (u'Nevada', 5L), (u'Oklahoma', 20L), (u'Idaho', 9L), (u'Maryland', 8L), (u'Ohio', 83L), (u'Delaware', 7L), (u'Florida', 26L), (u'Maine', 14L), (u'Illinois', 29L), (u'Montana', 8L), (u'Oregon', 12L), (u'Texas', 43L), (u'Mississippi', 2L), (u'South_Dakota', 5L), (u'Kansas', 22L), (u'North_Dakota', 11L), (u'Wisconsin', 15L), (u'DC', 1L), (u'Pennsylvania', 13L), (u'Tennessee', 45L), (u'Georgia', 5L), (u'Rhode_Island', 7L), (u'Vermont', 9L), (u'Colorado', 34L), (u'Arkansas', 13L), (u'Washington', 56L), (u'North_Carolina', 38L), (u'Virginia', 30L), (u'Wyoming', 1L), (u'Missouri', 13L), (u'Iowa', 6L), (u'New_Hampshire', 16L), (u'Arizona', 17L), (u'New_York', 58L), (u'South_Carolina', 26L), (u'Utah', 24L), (u'Kentucky', 50L), (u'Nebraska', 8L), (u'Massachusetts', 78L), (u'New_Mexico', 3L), (u'West_Virginia', 12L), (u'Michigan', 119L)]

['State', u'count_1']

Determining the Population Sum by State¶

to avoid confusion with query result column names like count_1,
- use the .label() method to provide a name for the resulting column.

# Build an expression to calculate the sum of labeled as population
pop = func.sum(crime.columns.Population).label('population')
print pop

sum(crime2."Population")

# Build a query to select the state and sum of pop2008: stmt
stmt = select([crime.columns.State, pop])

# Group stmt by state
stmt = stmt.group_by(crime.columns.State)
print stmt

SELECT crime2."State", sum(crime2."Population") AS population 
FROM crime2 GROUP BY crime2."State"

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

print 
# Print the keys/column names of the results returned
print(results[0].keys())

[(u'Alabama', 252900.0), (u'Indiana', 1453808.0), (u'Minnesota', 1911807.0), (u'New_Jersey', 3990457.0), (u'Louisiana', 387093.0), (u'California', 23717414.0), (u'Connecticut', 1970207.0), (u'Alaska', 299455.0), (u'Nevada', 2229342.0), (u'Oklahoma', 1038537.0), (u'Idaho', 470535.0), (u'Maryland', 800955.0), (u'Ohio', 3771479.0), (u'Delaware', 77814.0), (u'Florida', 3203198.0), (u'Maine', 221930.0), (u'Illinois', 4076305.0), (u'Montana', 309240.0), (u'Oregon', 1277020.0), (u'Texas', 9657128.0), (u'Mississippi', 115607.0), (u'South_Dakota', 285606.0), (u'Kansas', 786176.0), (u'North_Dakota', 328165.0), (u'Wisconsin', 1310774.0), (u'DC', 646449.0), (u'Pennsylvania', 2249969.0), (u'Tennessee', 2491040.0), (u'Georgia', 539915.0), (u'Rhode_Island', 378290.0), (u'Vermont', 132827.0), (u'Colorado', 2874986.0), (u'Arkansas', 380920.0), (u'Washington', 3061603.0), (u'North_Carolina', 3062521.0), (u'Virginia', 2152170.0), (u'Wyoming', 31884.0), (u'Missouri', 1408607.0), (u'Iowa', 199962.0), (u'New_Hampshire', 321835.0), (u'Arizona', 3809343.0), (u'New_York', 10664465.0), (u'South_Carolina', 564070.0), (u'Utah', 1027223.0), (u'Kentucky', 1751788.0), (u'Nebraska', 818783.0), (u'Massachusetts', 3454660.0), (u'New_Mexico', 580664.0), (u'West_Virginia', 222600.0), (u'Michigan', 4260370.0)]

['State', 'population']

Group by¶

● Supports multiple columns to group by with a pa#ern similar to order_by()

● Requires all selected columns to be grouped or aggregated by a function

Group by Multiple¶

Handling ResultSets from Functions¶

● SQLAlchemy auto generates “column names” for functions in the ResultSet

● The column names are o!en func_# such as count_1

● Replace them with the label() method

SQLAlchemy and Pandas for Visualization¶

● DataFrame can take a SQLAlchemy ResultSet

● Make sure to set the DataFrame columns to the ResultSet keys

save as DataFrame¶

df2 = pd.DataFrame(results, columns=results[0].keys())
df2.head()

Graphing¶

import matplotlib.pyplot as plt
%matplotlib inline

print df2.shape
df2[11:22].plot.barh(alpha=.6,color='g')
sns.set_context("notebook", font_scale=.5, )

(50, 2)

import seaborn as sns
sns.barplot('State','population',data = df2.sort_values('population', ascending=False)[:5])

<matplotlib.axes._subplots.AxesSubplot at 0xf259eb8>

plt.figure(figsize=(15,6))
bar = sns.barplot('State','population',data = df2.sort_values('population'))
sns.set_context("notebook", font_scale=2, )

bar.set_xticklabels(df2['State'].values,rotation=90, size=10)

plt.show()

	State	population
0	Alabama	252900.0
1	Indiana	1453808.0
2	Minnesota	1911807.0
3	New_Jersey	3990457.0
4	Louisiana	387093.0