Tag Archives: fun
bokeh 5th: project interactive data visualization web app
How to make useful and fun interactive data visualization web apps and how to deploy them online for public access?
simple url_based APIs tutorial
simple url_based APIs
What is an API?¶
- Set of protocols and routines
- Bunch of code
- Allows two so!ware programs to communicate with each other
In [7]:
import requests
url = 'http://www.omdbapi.com/?t=Split'
r = requests.get(url)
json_data = r.json()
for key, value in json_data.items():
print(key + ':', value)
with open("a_movie.json", 'w+') as save:
save.write(r.text)
Loading and exploring a JSON¶
- with open(file_path) as file:
In [9]:
import json
# Load JSON: json_data
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
API requests¶
- pull some movie data down from the Open Movie Database (OMDB) using their API.
- he movie you'll query the API about is The Social Network
- The query string should have one argument t=social+network
- Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
In [20]:
# Import requests package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?t=social+network'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Print the text of the response
print(r.text)
print type(r.text)
print type(r.json())
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
print
# Print each key-value pair in json_data
for key in json_data.keys():
print(key + ': ', json_data[key])
Wikipedia API¶
In [2]:
# Import package
import requests
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=machine+learning'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['233488']['extract']
print(pizza_extract)
In [ ]:
get data from the web using Python -1 beautifulsoup, requests, urllib
basics
using urllib
requests,
beautifulsoup
Importing flat files from the web¶
use Python2
University of California, Irvine's Machine Learning repository.
http://archive.ics.uci.edu/ml/index.html
- 'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.
In [1]:
# Import package
import urllib
# Import pandas
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape
In [2]:
df.head(3)
Out[2]:
Opening and reading flat files from the web¶
- load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.
In [3]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')
# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
Importing non-flat files from the web¶
- use pd.read_excel() to import an Excel spreadsheet.
In [4]:
# Import package
import pandas as pd
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)
# Print the sheetnames to the shell
print(xl.keys())
# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())
print type(xl)
type(xl['1700'])
Out[4]:
In [ ]:
In [5]:
from urllib2 import urlopen, Request
request = Request('http://jishichao.com')
response = urlopen(request)
html = response.read()
response.close()
In [6]:
print type(html)
len(html)
Out[6]:
Printing HTTP request results in Python using urllib¶
- You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?
- Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.
In [7]:
# Import packages
from urllib2 import urlopen, Request
# Specify the url
url = "http://docs.datacamp.com/teach/"
# This packages the request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Extract the response: html
html = response.read()
print type(html)
print
# Print the html
print(html[:300])
# Be polite and close the response!
response.close()
In [8]:
import requests
r = requests.get('http://jishichao.com')
text = r.text
In [9]:
print type(text)
print type(text.encode('utf-8'))
Beautiful Soup¶
Parsing HTML with BeautifulSoup¶
- Import the function BeautifulSoup from the package bs4
- Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
- Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
- Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
- Use the method prettify() on soup and assign the result to pretty_soup
In [14]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()
# Print the response
print type(pretty_soup)
print
print(pretty_soup[:300])
Turning a webpage into data using BeautifulSoup: getting the text¶
- Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.
- Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.
In [11]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text[100:300])
Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶
- Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags
- The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').
In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Print the title of Guido's webpage
print(soup.title)
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')
# Print the URLs to the shell
for link in a_tags:
print(link.get('href'))
In [13]:
type(a_tags), type(a_tags[0])
Out[13]:
In [ ]:
In [ ]: