get data from the web using Python -1 beautifulsoup, requests, urllib
basics
using urllib
requests,
beautifulsoup
Importing flat files from the web¶
use Python2
University of California, Irvine's Machine Learning repository.
http://archive.ics.uci.edu/ml/index.html
- 'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.
In [1]:
# Import package
import urllib
# Import pandas
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape
In [2]:
df.head(3)
Out[2]:
Opening and reading flat files from the web¶
- load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.
In [3]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')
# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
Importing non-flat files from the web¶
- use pd.read_excel() to import an Excel spreadsheet.
In [4]:
# Import package
import pandas as pd
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)
# Print the sheetnames to the shell
print(xl.keys())
# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())
print type(xl)
type(xl['1700'])
Out[4]:
In [ ]:
In [5]:
from urllib2 import urlopen, Request
request = Request('http://jishichao.com')
response = urlopen(request)
html = response.read()
response.close()
In [6]:
print type(html)
len(html)
Out[6]:
Printing HTTP request results in Python using urllib¶
- You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?
- Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.
In [7]:
# Import packages
from urllib2 import urlopen, Request
# Specify the url
url = "http://docs.datacamp.com/teach/"
# This packages the request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Extract the response: html
html = response.read()
print type(html)
print
# Print the html
print(html[:300])
# Be polite and close the response!
response.close()
In [8]:
import requests
r = requests.get('http://jishichao.com')
text = r.text
In [9]:
print type(text)
print type(text.encode('utf-8'))
Beautiful Soup¶
Parsing HTML with BeautifulSoup¶
- Import the function BeautifulSoup from the package bs4
- Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
- Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
- Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
- Use the method prettify() on soup and assign the result to pretty_soup
In [14]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()
# Print the response
print type(pretty_soup)
print
print(pretty_soup[:300])
Turning a webpage into data using BeautifulSoup: getting the text¶
- Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.
- Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.
In [11]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text[100:300])
Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶
- Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags
- The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').
In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url
url = 'http://jishichao.com'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Print the title of Guido's webpage
print(soup.title)
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')
# Print the URLs to the shell
for link in a_tags:
print(link.get('href'))
In [13]:
type(a_tags), type(a_tags[0])
Out[13]:
In [ ]:
In [ ]: