Connecting to a database

● Engine: common interface to the database from SQLAlchemy

● Connection string: All the details required to find the database (and login, if necessary)

__

● 'sqlite:///census_nyc.sqlite'

Driver+Dialect Filename

create an engine

connect to engine

● Before querying your database, you’ll want to know what is in it: what the tables are, for example:

Engines and Connection Strings

  • An engine is just a common interface to a database, and the information it requires to connect to one is contained in a connection string, such as sqlite:///census_nyc.sqlite
  • Here, sqlite is the database driver, while census_nyc.sqlite is the path to a SQLite file contained in the local directory named census_nyc.sqlite.

document

  • create an engine that connects to a local SQLite file named census.sqlite. Then, print the names of the tables it contains using the .table_names() method. Note that when you just want to print the table names, you do not need to use engine.connect() after creating the engine.
In [47]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///census.sqlite')

print engine.table_names()

connection = engine.connect()
[u'cencus', u'census']

read a csv file and save the data to a sqlite database

In [11]:
import pandas as pd
df = pd.read_csv('all_050_in_36.P1.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 15 columns):
GEOID           62 non-null int64
SUMLEV          62 non-null int64
STATE           62 non-null int64
COUNTY          62 non-null int64
CBSA            62 non-null int64
CSA             62 non-null int64
NECTA           0 non-null float64
CNECTA          0 non-null float64
NAME            62 non-null object
POP100          62 non-null int64
HU100           62 non-null int64
POP100.2000     62 non-null int64
HU100.2000      62 non-null int64
P001001         62 non-null int64
P001001.2000    62 non-null int64
dtypes: float64(2), int64(12), object(1)
memory usage: 7.3+ KB
In [14]:
print df.shape
df.head(3)
(62, 15)
Out[14]:
GEOID SUMLEV STATE COUNTY CBSA CSA NECTA CNECTA NAME POP100 HU100 POP100.2000 HU100.2000 P001001 P001001.2000
0 36003 50 36 3 99999 999 NaN NaN Allegany County 48946 26140 49927 24505 48946 49927
1 36005 50 36 5 35620 408 NaN NaN Bronx County 1385108 511896 1332650 490659 1385108 1332650
2 36007 50 36 7 13780 999 NaN NaN Broome County 200600 90563 200536 88817 200600 200536

save DataFrame to database through the sqlalchemy connection

In [21]:
df.to_sql('census',connection)

check tables in engine

In [45]:
engine.table_names()
Out[45]:
[]

check data in engine

In [46]:
connection.execute('SELECT * FROM census LIMIT 10').fetchall()
Out[46]:
[(0, 36003, 50, 36, 3, 99999, 999, None, None, u'Allegany County', 48946, 26140, 49927, 24505, 48946, 49927),
 (1, 36005, 50, 36, 5, 35620, 408, None, None, u'Bronx County', 1385108, 511896, 1332650, 490659, 1385108, 1332650),
 (2, 36007, 50, 36, 7, 13780, 999, None, None, u'Broome County', 200600, 90563, 200536, 88817, 200600, 200536),
 (3, 36009, 50, 36, 9, 36460, 160, None, None, u'Cattaraugus County', 80317, 41111, 83955, 39839, 80317, 83955),
 (4, 36011, 50, 36, 11, 12180, 532, None, None, u'Cayuga County', 80026, 36489, 81963, 35477, 80026, 81963),
 (5, 36013, 50, 36, 13, 27460, 999, None, None, u'Chautauqua County', 134905, 66920, 139750, 64900, 134905, 139750),
 (6, 36015, 50, 36, 15, 21300, 999, None, None, u'Chemung County', 88830, 38369, 91070, 37745, 88830, 91070),
 (7, 36017, 50, 36, 17, 99999, 999, None, None, u'Chenango County', 50477, 24710, 51401, 23890, 50477, 51401),
 (8, 36019, 50, 36, 19, 38460, 999, None, None, u'Clinton County', 82128, 35888, 79894, 33091, 82128, 79894),
 (9, 36021, 50, 36, 21, 26460, 104, None, None, u'Columbia County', 63096, 32775, 63094, 30207, 63096, 63094)]
In [24]:
connection.execute('SELECT COUNT(*) FROM cencus LIMIT 10').fetchall()
Out[24]:
[(62,)]

Metadata

  • Metadata is literally "data about data." This term refers to information about data itself -- perhaps the origin, size, formatting or other characteristics of a data item. In the database field, metadata is essential to understanding and interpreting the contents of a data warehouse.

Autoloading Tables from a Database

reflection

● Reflection reads database and builds SQLAlchemy Table objects

  • SQLAlchemy can be used to automatically load tables from a database using something called reflection
    • Reflection is the process of reading the database and building the metadata based on that information.
      • It's the opposite of creating a Table by hand and is very useful for working with existing databases.
  • To perform reflection, we need to import the Table object from the SQLAlchemy package.
    • Then we use the Table object to read our table from the engine and autoload the columns.
      • Using the Table object in this manner looks like we are using a function.

Your job is to reflect the census table available on your engine into the census variable.

In [25]:
from sqlalchemy import MetaData, Table

metadata = MetaData()

# Reflect census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Print census table metadata
print(repr(census))
Table('census', MetaData(bind=None), Column('index', BIGINT(), table=<census>), Column('GEOID', BIGINT(), table=<census>), Column('SUMLEV', BIGINT(), table=<census>), Column('STATE', BIGINT(), table=<census>), Column('COUNTY', BIGINT(), table=<census>), Column('CBSA', BIGINT(), table=<census>), Column('CSA', BIGINT(), table=<census>), Column('NECTA', FLOAT(), table=<census>), Column('CNECTA', FLOAT(), table=<census>), Column('NAME', TEXT(), table=<census>), Column('POP100', BIGINT(), table=<census>), Column('HU100', BIGINT(), table=<census>), Column('POP100.2000', BIGINT(), table=<census>), Column('HU100.2000', BIGINT(), table=<census>), Column('P001001', BIGINT(), table=<census>), Column('P001001.2000', BIGINT(), table=<census>), schema=None)

Viewing Table Details

  • census.columns.keys() would return a list of column names of the census table.

  • use the metadata container to find out more details about the reflected table such as the columns and their types.

    • table objects are stored in the metadata.tables dictionary
      • get the metadata of our census table with metadata.tables['census']. This is similar to your use of the repr() function on the census table from the previous exercise.
In [30]:
# Print the column names
print census.columns.keys()
['index', 'GEOID', 'SUMLEV', 'STATE', 'COUNTY', 'CBSA', 'CSA', 'NECTA', 'CNECTA', 'NAME', 'POP100', 'HU100', 'POP100.2000', 'HU100.2000', 'P001001', 'P001001.2000']
In [31]:
# Print full table metadata
print(repr(metadata.tables['census']))
Table('census', MetaData(bind=None), Column('index', BIGINT(), table=<census>), Column('GEOID', BIGINT(), table=<census>), Column('SUMLEV', BIGINT(), table=<census>), Column('STATE', BIGINT(), table=<census>), Column('COUNTY', BIGINT(), table=<census>), Column('CBSA', BIGINT(), table=<census>), Column('CSA', BIGINT(), table=<census>), Column('NECTA', FLOAT(), table=<census>), Column('CNECTA', FLOAT(), table=<census>), Column('NAME', TEXT(), table=<census>), Column('POP100', BIGINT(), table=<census>), Column('HU100', BIGINT(), table=<census>), Column('POP100.2000', BIGINT(), table=<census>), Column('HU100.2000', BIGINT(), table=<census>), Column('P001001', BIGINT(), table=<census>), Column('P001001.2000', BIGINT(), table=<census>), schema=None)

SQL Statements

● Select, Insert, Update & Delete data

● Create & Alter data

Basic SQL querying

● SELECT column_name FROM table_name

● SELECT pop2008 FROM People

● SELECT * FROM People

conventional way of executing sql

  • applying the .execute() method on our connection, we can leverage a raw SQL query to query all the records in our census table.
    • The object returned by the .execute() method is a ResultProxy.
      • On this ResultProxy, we can then use the .fetchall() method to get our results - that is, the ResultSet.
In [32]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///census.sqlite')

engine = create_engine('sqlite:///census_nyc.sqlite')

stmt = 'SELECT * FROM census'
result_proxy = connection.execute(stmt)
results = result_proxy.fetchall()
In [42]:
print type(results)

print type(results)
print results[0]

print type(results[0])

print 
print results[0].keys()
print results[0].NAME
<type 'list'>
<type 'list'>
(0, 36003, 50, 36, 3, 99999, 999, None, None, u'Allegany County', 48946, 26140, 49927, 24505, 48946, 49927)
<class 'sqlalchemy.engine.result.RowProxy'>

[u'index', u'GEOID', u'SUMLEV', u'STATE', u'COUNTY', u'CBSA', u'CSA', u'NECTA', u'CNECTA', u'NAME', u'POP100', u'HU100', u'2000', u'2000', u'P001001', u'2000']
Allegany County

use SQLAlchemy to build Queries

● Provides a Pythonic way to build SQL statements

● Hides differences between backend database types

  • SQLAlchemy provides a nice "Pythonic" way of interacting with databases.
    • So rather than dealing with the differences between specific dialects of traditional SQL such as MySQL or PostgreSQL, you can leverage the Pythonic framework of SQLAlchemy to streamline your workflow and more efficiently query your data.
      • For this reason, it is worth learning even if you may already be familiar with traditional SQL.
  • make use of the select() function of the sqlalchemy module.
    • This function requires a list of tables or columns as the only required argument.
In [52]:
from sqlalchemy import Table, MetaData, select
metadata = MetaData()

census = Table('census', metadata, autoload=True, autoload_with=engine)

stmt = select([census])

results = connection.execute(stmt).fetchall()

print type(stmt)
print stmt
<class 'sqlalchemy.sql.selectable.Select'>
SELECT census.index, census."GEOID", census."SUMLEV", census."STATE", census."COUNTY", census."CBSA", census."CSA", census."NECTA", census."CNECTA", census."NAME", census."POP100", census."HU100", census."POP100.2000", census."HU100.2000", census."P001001", census."P001001.2000" 
FROM census
In [56]:
print type(results), type(results[0]), len(results)
<type 'list'> <class 'sqlalchemy.engine.result.RowProxy'> 62

SQLAlchemy Select Statement

● Requires a list of one or more Tables or Columns

● Using a table will select all the columns in it

In [58]:
first = results[0]
type(first)
Out[58]:
sqlalchemy.engine.result.RowProxy
In [60]:
print dir(first)
['__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_keymap', '_op', '_parent', '_processors', '_row', 'has_key', 'items', 'iterkeys', 'itervalues', 'keys', 'values']
In [63]:
print first.keys()
['index', 'GEOID', 'SUMLEV', 'STATE', 'COUNTY', 'CBSA', 'CSA', 'NECTA', 'CNECTA', 'NAME', 'POP100', 'HU100', 'POP100.2000', 'HU100.2000', 'P001001', 'P001001.2000']
In [68]:
first['POP100'], first['NAME']
Out[68]:
(48946, u'Allegany County')
  • you can get data from the first column by either using first_row[0] or by column name such as first_row['column_name']
In [73]:
first[9], first['NAME']
Out[73]:
(u'Allegany County', u'Allegany County')

Handling a ResultSet

Recall the differences between a ResultProxy and a ResultSet:

  • ResultProxy: The object returned by the .execute() method. It can be used in a variety of ways to get the data returned by the query.

  • ResultSet: The actual data asked for in the query when using a fetch method such as .fetchall() on a ResultProxy.

This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.

In [ ]: