Connecting to a database¶

● Engine: common interface to the database from SQLAlchemy

● Connection string: All the details required to find the database (and login, if necessary)

__¶

● 'sqlite:///census_nyc.sqlite'

Driver+Dialect Filename

create an engine¶

connect to engine¶

● Before querying your database, you’ll want to know what is in it: what the tables are, for example:

Engines and Connection Strings¶

An engine is just a common interface to a database, and the information it requires to connect to one is contained in a connection string, such as sqlite:///census_nyc.sqlite

Here, sqlite is the database driver, while census_nyc.sqlite is the path to a SQLite file contained in the local directory named census_nyc.sqlite.

document¶

http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls

create an engine that connects to a local SQLite file named census.sqlite. Then, print the names of the tables it contains using the .table_names() method. Note that when you just want to print the table names, you do not need to use engine.connect() after creating the engine.

from sqlalchemy import create_engine

engine = create_engine('sqlite:///census.sqlite')

print engine.table_names()

connection = engine.connect()

[u'cencus', u'census']

read a csv file and save the data to a sqlite database¶

import pandas as pd
df = pd.read_csv('all_050_in_36.P1.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 15 columns):
GEOID           62 non-null int64
SUMLEV          62 non-null int64
STATE           62 non-null int64
COUNTY          62 non-null int64
CBSA            62 non-null int64
CSA             62 non-null int64
NECTA           0 non-null float64
CNECTA          0 non-null float64
NAME            62 non-null object
POP100          62 non-null int64
HU100           62 non-null int64
POP100.2000     62 non-null int64
HU100.2000      62 non-null int64
P001001         62 non-null int64
P001001.2000    62 non-null int64
dtypes: float64(2), int64(12), object(1)
memory usage: 7.3+ KB

print df.shape
df.head(3)

(62, 15)

save DataFrame to database through the sqlalchemy connection¶

df.to_sql('census',connection)

check tables in engine¶

engine.table_names()

[]

check data in engine¶

connection.execute('SELECT * FROM census LIMIT 10').fetchall()

[(0, 36003, 50, 36, 3, 99999, 999, None, None, u'Allegany County', 48946, 26140, 49927, 24505, 48946, 49927),
 (1, 36005, 50, 36, 5, 35620, 408, None, None, u'Bronx County', 1385108, 511896, 1332650, 490659, 1385108, 1332650),
 (2, 36007, 50, 36, 7, 13780, 999, None, None, u'Broome County', 200600, 90563, 200536, 88817, 200600, 200536),
 (3, 36009, 50, 36, 9, 36460, 160, None, None, u'Cattaraugus County', 80317, 41111, 83955, 39839, 80317, 83955),
 (4, 36011, 50, 36, 11, 12180, 532, None, None, u'Cayuga County', 80026, 36489, 81963, 35477, 80026, 81963),
 (5, 36013, 50, 36, 13, 27460, 999, None, None, u'Chautauqua County', 134905, 66920, 139750, 64900, 134905, 139750),
 (6, 36015, 50, 36, 15, 21300, 999, None, None, u'Chemung County', 88830, 38369, 91070, 37745, 88830, 91070),
 (7, 36017, 50, 36, 17, 99999, 999, None, None, u'Chenango County', 50477, 24710, 51401, 23890, 50477, 51401),
 (8, 36019, 50, 36, 19, 38460, 999, None, None, u'Clinton County', 82128, 35888, 79894, 33091, 82128, 79894),
 (9, 36021, 50, 36, 21, 26460, 104, None, None, u'Columbia County', 63096, 32775, 63094, 30207, 63096, 63094)]

connection.execute('SELECT COUNT(*) FROM cencus LIMIT 10').fetchall()

[(62,)]

Metadata¶

Metadata is literally "data about data." This term refers to information about data itself -- perhaps the origin, size, formatting or other characteristics of a data item. In the database field, metadata is essential to understanding and interpreting the contents of a data warehouse.

Autoloading Tables from a Database¶

reflection¶

● Reflection reads database and builds SQLAlchemy Table objects

SQLAlchemy can be used to automatically load tables from a database using something called reflection
- Reflection is the process of reading the database and building the metadata based on that information.
  - It's the opposite of creating a Table by hand and is very useful for working with existing databases.

To perform reflection, we need to import the Table object from the SQLAlchemy package.
- Then we use the Table object to read our table from the engine and autoload the columns.
  - Using the Table object in this manner looks like we are using a function.

Your job is to reflect the census table available on your engine into the census variable.

from sqlalchemy import MetaData, Table

metadata = MetaData()

# Reflect census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Print census table metadata
print(repr(census))

Table('census', MetaData(bind=None), Column('index', BIGINT(), table=<census>), Column('GEOID', BIGINT(), table=<census>), Column('SUMLEV', BIGINT(), table=<census>), Column('STATE', BIGINT(), table=<census>), Column('COUNTY', BIGINT(), table=<census>), Column('CBSA', BIGINT(), table=<census>), Column('CSA', BIGINT(), table=<census>), Column('NECTA', FLOAT(), table=<census>), Column('CNECTA', FLOAT(), table=<census>), Column('NAME', TEXT(), table=<census>), Column('POP100', BIGINT(), table=<census>), Column('HU100', BIGINT(), table=<census>), Column('POP100.2000', BIGINT(), table=<census>), Column('HU100.2000', BIGINT(), table=<census>), Column('P001001', BIGINT(), table=<census>), Column('P001001.2000', BIGINT(), table=<census>), schema=None)

Viewing Table Details¶

census.columns.keys() would return a list of column names of the census table.
use the metadata container to find out more details about the reflected table such as the columns and their types.
- table objects are stored in the metadata.tables dictionary
  - get the metadata of our census table with metadata.tables['census']. This is similar to your use of the repr() function on the census table from the previous exercise.

# Print the column names
print census.columns.keys()

['index', 'GEOID', 'SUMLEV', 'STATE', 'COUNTY', 'CBSA', 'CSA', 'NECTA', 'CNECTA', 'NAME', 'POP100', 'HU100', 'POP100.2000', 'HU100.2000', 'P001001', 'P001001.2000']

# Print full table metadata
print(repr(metadata.tables['census']))

Table('census', MetaData(bind=None), Column('index', BIGINT(), table=<census>), Column('GEOID', BIGINT(), table=<census>), Column('SUMLEV', BIGINT(), table=<census>), Column('STATE', BIGINT(), table=<census>), Column('COUNTY', BIGINT(), table=<census>), Column('CBSA', BIGINT(), table=<census>), Column('CSA', BIGINT(), table=<census>), Column('NECTA', FLOAT(), table=<census>), Column('CNECTA', FLOAT(), table=<census>), Column('NAME', TEXT(), table=<census>), Column('POP100', BIGINT(), table=<census>), Column('HU100', BIGINT(), table=<census>), Column('POP100.2000', BIGINT(), table=<census>), Column('HU100.2000', BIGINT(), table=<census>), Column('P001001', BIGINT(), table=<census>), Column('P001001.2000', BIGINT(), table=<census>), schema=None)

SQL Statements¶

● Select, Insert, Update & Delete data

● Create & Alter data

Basic SQL querying¶

● SELECT column_name FROM table_name

● SELECT pop2008 FROM People

● SELECT * FROM People

conventional way of executing sql¶

applying the .execute() method on our connection, we can leverage a raw SQL query to query all the records in our census table.
- The object returned by the .execute() method is a ResultProxy.
  - On this ResultProxy, we can then use the .fetchall() method to get our results - that is, the ResultSet.

from sqlalchemy import create_engine

engine = create_engine('sqlite:///census.sqlite')

engine = create_engine('sqlite:///census_nyc.sqlite')

stmt = 'SELECT * FROM census'
result_proxy = connection.execute(stmt)
results = result_proxy.fetchall()

print type(results)

print type(results)
print results[0]

print type(results[0])

print 
print results[0].keys()
print results[0].NAME

<type 'list'>
<type 'list'>
(0, 36003, 50, 36, 3, 99999, 999, None, None, u'Allegany County', 48946, 26140, 49927, 24505, 48946, 49927)
<class 'sqlalchemy.engine.result.RowProxy'>

[u'index', u'GEOID', u'SUMLEV', u'STATE', u'COUNTY', u'CBSA', u'CSA', u'NECTA', u'CNECTA', u'NAME', u'POP100', u'HU100', u'2000', u'2000', u'P001001', u'2000']
Allegany County

use SQLAlchemy to build Queries¶

● Provides a Pythonic way to build SQL statements

● Hides differences between backend database types

SQLAlchemy provides a nice "Pythonic" way of interacting with databases.
- So rather than dealing with the differences between specific dialects of traditional SQL such as MySQL or PostgreSQL, you can leverage the Pythonic framework of SQLAlchemy to streamline your workflow and more efficiently query your data.
  - For this reason, it is worth learning even if you may already be familiar with traditional SQL.

make use of the select() function of the sqlalchemy module.
- This function requires a list of tables or columns as the only required argument.

from sqlalchemy import Table, MetaData, select
metadata = MetaData()

census = Table('census', metadata, autoload=True, autoload_with=engine)

stmt = select([census])

results = connection.execute(stmt).fetchall()

print type(stmt)
print stmt

<class 'sqlalchemy.sql.selectable.Select'>
SELECT census.index, census."GEOID", census."SUMLEV", census."STATE", census."COUNTY", census."CBSA", census."CSA", census."NECTA", census."CNECTA", census."NAME", census."POP100", census."HU100", census."POP100.2000", census."HU100.2000", census."P001001", census."P001001.2000" 
FROM census

print type(results), type(results[0]), len(results)

<type 'list'> <class 'sqlalchemy.engine.result.RowProxy'> 62

SQLAlchemy Select Statement¶

● Requires a list of one or more Tables or Columns

● Using a table will select all the columns in it

first = results[0]
type(first)

sqlalchemy.engine.result.RowProxy

print dir(first)

['__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_keymap', '_op', '_parent', '_processors', '_row', 'has_key', 'items', 'iterkeys', 'itervalues', 'keys', 'values']

print first.keys()

['index', 'GEOID', 'SUMLEV', 'STATE', 'COUNTY', 'CBSA', 'CSA', 'NECTA', 'CNECTA', 'NAME', 'POP100', 'HU100', 'POP100.2000', 'HU100.2000', 'P001001', 'P001001.2000']

first['POP100'], first['NAME']

(48946, u'Allegany County')

you can get data from the first column by either using first_row[0] or by column name such as first_row['column_name']

first[9], first['NAME']

(u'Allegany County', u'Allegany County')

Handling a ResultSet¶

Recall the differences between a ResultProxy and a ResultSet:

ResultProxy: The object returned by the .execute() method. It can be used in a variety of ways to get the data returned by the query.
ResultSet: The actual data asked for in the query when using a fetch method such as .fetchall() on a ResultProxy.

This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.

	GEOID	SUMLEV	STATE	COUNTY	CBSA	CSA	NECTA	CNECTA	NAME	POP100	HU100	POP100.2000	HU100.2000	P001001	P001001.2000
0	36003	50	36	3	99999	999	NaN	NaN	Allegany County	48946	26140	49927	24505	48946	49927
1	36005	50	36	5	35620	408	NaN	NaN	Bronx County	1385108	511896	1332650	490659	1385108	1332650
2	36007	50	36	7	13780	999	NaN	NaN	Broome County	200600	90563	200536	88817	200600	200536