get data from the web using Python -1 beautifulsoup, requests, urllib

basics

using urllib

requests,

beautifulsoup

 

Importing flat files from the web

  • use Python2

  • University of California, Irvine's Machine Learning repository.

http://archive.ics.uci.edu/ml/index.html

  • 'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.
In [1]:
# Import package
import urllib

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape
(1599, 12)
In [2]:
df.head(3)
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

Opening and reading flat files from the web

  • load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.
In [3]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
(1599, 12)

Importing non-flat files from the web

  • use pd.read_excel() to import an Excel spreadsheet.
In [4]:
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

print type(xl)
type(xl['1700'])
[u'1700', u'1900']
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000
<type 'dict'>
Out[4]:
pandas.core.frame.DataFrame
In [ ]:
 
In [5]:
from urllib2 import urlopen, Request

request = Request('http://jishichao.com')

response = urlopen(request)

html = response.read()

response.close()
In [6]:
print type(html)
len(html)
<type 'str'>
Out[6]:
4843

Printing HTTP request results in Python using urllib

  • You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?
  • Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.
In [7]:
# Import packages
from urllib2 import urlopen, Request

# Specify the url
url = "http://docs.datacamp.com/teach/"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

print type(html)
print 
# Print the html
print(html[:300])


# Be polite and close the response!
response.close()
<type 'str'>

<!DOCTYPE html>
<link rel="shortcut icon" href="images/favicon.ico" />
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Home</title>
  <meta name="description" content="A

Requests

  • better and most used

Performing HTTP requests in Python using requests

  • do the same using the higher-level requests library.
In [8]:
import requests
r = requests.get('http://jishichao.com')
text = r.text
In [9]:
print type(text)
print type(text.encode('utf-8'))
<type 'unicode'>
<type 'str'>

Beautiful Soup

Parsing HTML with BeautifulSoup

  • Import the function BeautifulSoup from the package bs4
  • Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
  • Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
  • Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
  • Use the method prettify() on soup and assign the result to pretty_soup
In [14]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print type(pretty_soup)
print 
print(pretty_soup[:300])
<type 'unicode'>

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="../static/my.css" rel="stylesheet" type="text/css"/>
  <title>
   welcome 23333
  </title>
  <!--      <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r

Turning a webpage into data using BeautifulSoup: getting the text

  • Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.
  • Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.
In [11]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text[100:300])
<title>welcome 23333 </title>

An interactive Data Visualization Web App I wrote

My Notebook website built by Python Flask deployed on AWS
A image downloader for a specific website 'worldcosplay', the program helps you open their
  • Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags
  • The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').
In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))
<title>welcome 23333 </title>
http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=
      result&fr=&sf=1&fmq=1467292435965_R&pv=&ic=0&nc=1&z=&se=1&showtab
      =0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=草泥马动态图
http://shichaoji.com
http://www.jishichao.com:7777
http://www.jishichao.com:10086
https://jishichao.com
/windows0
/windows2
/windows1
/mac1
/linux64
/linux32
/plotting
./
In [13]:
type(a_tags), type(a_tags[0])
Out[13]:
(bs4.element.ResultSet, bs4.element.Tag)
In [ ]:
 
In [ ]:
 

Python odds and ends

 from sys import argv unpack argv

a,b,c…=argv

> python file.py ‘b’,’c’,…

argv will be a list contains

[‘file.py’, ‘a’,’b’,…]

a = python file name itself

b,c, …. is the variables pass when exec the python file

 The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.
                  | r   r+   w   w+   a   a+
------------------|--------------------------
read              | +   +        +        +
write             |     +    +   +    +   +
create            |          +   +    +   +
truncate          |          +   +
position at start | +   +    +   +
position at end   |                   +   +

where meanings are: (just to avoid any misinterpretation)

  • read – reading from file is allowed
  • write – writing to file is allowed
  • create – file is created if it does not exist yet
  • trunctate – during opening of the file it is made empty (all content of the file is erased)
  • position at start – after file is opened, initial position is set to the start of the file
  • position at end – after file is opened, initial position is set to the end of the file
  • close — Closes the file. Like File->Save.. in your editor.
  • read — Reads the contents of the file. You can assign the result to a variable.
  • readline — Reads just one line of a text file.
  • truncate — Empties the file. Watch out if you care about the file.
  • write('stuff') — Writes “stuff” to the file.
 file manipulate manipulate.py:

from sys import argv

script, input_file = argv

current_file = open(input_file)

def print_all(f):
print f.read()

def rewind(f):
f.seek(0)

def print_a_line(line_count, f):
print line_count, f.readline()

 $ python manipulate.py test.txt

Why does seek(0) not set the current_line to 0?
First, the seek() function is dealing in bytes, not lines. The code seek(0) moves the file to the 0 byte (first byte) in the file. Second, current_line is just a variable and has no real connection to the file at all. We are manually incrementing it.

 

How does readline() know where each line is?
Inside readline() is code that scans each byte of the file until it finds a \n character, then stops reading the file to return what it found so far. The file f is responsible for maintaining the current position in the file after each readline() call, so that it will keep reading each line.

Excel shortcuts (Chinese) tips

 problemshortcut
0快捷键之在工作表中移动和滚动向上、下、左或右移动单元格箭头键
1移动到当前数据区域的边缘CTRL+ 箭头键
2移动到行首HOME
3移动到工作表的开头CTRL+HOME
4移动到工作表的最后一个单元格。CTRL+END
5向下移动一屏PAGE DOWN
6向上移动一屏PAGE UP
7向右移动一屏ALT+PAGE DOWN
8向左移动一屏ALT+PAGE UP
9移动到工作簿中下一个工作表CTRL+PAGE DOWN
10移动到工作簿中前一个工作表CTRL+PAGE UP
11移动到下一工作簿或窗口CTRL+F6 或 CTRL+TAB
12移动到前一工作簿或窗口CTRL+SHIFT+F6
13移动到已拆分工作簿中的下一个窗格F6
14移动到被拆分的工作簿中的上一个窗格SHIFT+F6
15滚动并显示活动单元格CTRL+BACKSPACE
16显示“定位”对话框F5
17显示“查找”对话框SHIFT+F5
18重复上一次“查找”操作SHIFT+F4
19在保护工作表中的非锁定单元格之间移动TAB
202>Excel快捷键之处于END模式时在工作表中移动
21打开或关闭 END 模式END
22在一行或列内以数据块为单位移动END, 箭头键
23移动到工作表的最后一个单元格.END, HOME
24在当前行中向右移动到最后一个非空白单元格。END, ENTER
253>Excel快捷键之处于“滚动锁定”模式时在工作表中移动
26打开或关闭滚动锁定SCROLL LOCK
27移动到窗口中左上角处的单元格HOME
28移动到窗口中右下角处的单元格END
29向上或向下滚动一行上箭头键或下箭头键
30向左或向右滚动一列左箭头键或右箭头键
314>Excel快捷键之用于预览和打印文档
32显示“打印”对话框CTRL+P
33在打印预览中时
34当放大显示时,在文档中移动箭头键
35当缩小显示时,在文档中每次滚动一页PAGE UP
36当缩小显示时,滚动到第一页CTRL+上箭头键
37当缩小显示时,滚动到最后一页CTRL+下箭头键
385>Excel快捷键之用于工作表、图表和宏
39插入新工作表SHIFT+F11
40创建使用当前区域的图表F11 或 ALT+F1
41显示“宏”对话框ALT+F8
42显示“Visual Basic 编辑器”ALT+F11
43插入 Microsoft Excel 4.0 宏工作表CTRL+F11
44移动到工作簿中的下一个工作表CTRL+PAGE DOWN
45移动到工作簿中的上一个工作表CTRL+PAGE UP
46选择工作簿中当前和下一个工作表SHIFT+CTRL+PAGE DOWN
47选择当前工作簿或上一个工作簿SHIFT+CTRL+PAGE UP
486>Excel快捷键之选择图表工作表
49选择工作簿中的下一张工作表CTRL+PAGE DOWN
50选择工作簿中的上一个工作表CTRL+PAGE UP,END, SHIFT+ENTER
517>Excel快捷键之用于在工作表中输入数据
52完成单元格输入并在选定区域中下移ENTER
53在单元格中折行ALT+ENTER
54用当前输入项填充选定的单元格区域CTRL+ENTER
55完成单元格输入并在选定区域中上移SHIFT+ENTER
56完成单元格输入并在选定区域中右移TAB
57完成单元格输入并在选定区域中左移SHIFT+TAB
58取消单元格输入ESC
59删除插入点左边的字符,或删除选定区域BACKSPACE
60删除插入点右边的字符,或删除选定区域DELETE
61删除插入点到行末的文本CTRL+DELETE
62向上下左右移动一个字符箭头键
63移到行首HOME
64重复最后一次操作F4 或 CTRL+Y
65编辑单元格批注SHIFT+F2
66由行或列标志创建名称CTRL+SHIFT+F3
67向下填充CTRL+D
68向右填充CTRL+R
69定义名称CTRL+F3
708>Excel快捷键之设置数据格式
71显示“样式”对话框ALT+' (撇号)
72显示“单元格格式”对话框CTRL+1
73应用“常规”数字格式CTRL+SHIFT+~
74应用带两个小数位的“贷币”格式CTRL+SHIFT+$
75应用不带小数位的“百分比”格式CTRL+SHIFT+%
76应用带两个小数位的“科学记数”数字格式CTRL+SHIFT+^
77应用年月日“日期”格式CTRL+SHIFT+#
78应用小时和分钟“时间”格式,并标明上午或下午CTRL+SHIFT+@
79应用具有千位分隔符且负数用负号 (-) 表示CTRL+SHIFT+!
80应用外边框CTRL+SHIFT+&
81删除外边框CTRL+SHIFT+_
82应用或取消字体加粗格式CTRL+B
83应用或取消字体倾斜格式CTRL+I
84应用或取消下划线格式CTRL+U
85应用或取消删除线格式CTRL+5
86隐藏行CTRL+9
87取消隐藏行CTRL+SHIFT+( 左括号
88隐藏列CTRL+0(零)
89取消隐藏列CTRL+SHIFT+)右括号
909>Excel快捷键之编辑数据
91编辑活动单元格并将插入点放置到线条末尾F2
92取消单元格或编辑栏中的输入项ESC
93编辑活动单元格并清除其中原有的内容BACKSPACE
94将定义的名称粘贴到公式中F3
95完成单元格输入ENTER
96将公式作为数组公式输入CTRL+SHIFT+ENTER
97在公式中键入函数名之后,显示公式选项板CTRL+A
98在公式中键入函数名后为该函数插入变量名和括号CTRL+SHIFT+A
99显示“拼写检查”对话框。F7 键
10010>Excel快捷键之插入、删除和复制选中区域
101复制选定区域CTRL+C
102剪切选定区域CTRL+X
103粘贴选定区域CTRL+V
104清除选定区域的内容DELETE
105删除选定区域CTRL+ 连字符
106撤消最后一次操作CTRL+Z
107插入空白单元格CTRL+SHIFT+ 加号
10811>Excel快捷键之在选中区域内移动
109在选定区域内由上往下移动ENTER
110在选定区域内由下往上移动SHIFT+ENTER
111在选定区域内由左往右移动TAB
112在选定区域内由右往左移动SHIFT+TAB
113按顺时针方向移动到选定区域的下一个角CTRL+PERIOD
114右移到非相邻的选定区域CTRL+ALT+右箭头键
115左移到非相邻的选定区域CTRL+ALT+左箭头键
11612>Excel快捷键之选择单元格、列或行
117选定当前单元格周围的区域CTRL+SHIFT+*(星号)
118将选定区域扩展一个单元格宽度SHIFT+ 箭头键
119选定区域扩展到单元格同行同列的最后非空单元格CTRL+SHIFT+ 箭头键
120将选定区域扩展到行首SHIFT+HOME
121将选定区域扩展到工作表的开始CTRL+SHIFT+HOME
122将选定区域扩展到工作表的最后一个使用的单元格CTRL+SHIFT+END
123选定整列CTRL+SPACEBAR
124选定整行SHIFT+SPACEBAR
125选定整个工作表CTRL+A
126如果选定了多个单元格则只选定其中的单元格SHIFT+BACKSPACE
127将选定区域向下扩展一屏SHIFT+PAGE DOWN
128将选定区域向上扩展一屏SHIFT+PAGE UP
129选定了一个对象,选定工作表上的所有对象CTRL+SHIFT+SPACEBAR
130在隐藏对象、显示对象与对象占位符之间切换CTRL+6
131显示或隐藏“常用”工具栏CTRL+7
132使用箭头键启动扩展选中区域的功能F8
133将其他区域中的单元格添加到选中区域中SHIFT+F8
134将选定区域扩展到窗口左上角的单元格SCROLLLOCK, SHIFT+HOME
135将选定区域扩展到窗口右下角的单元格SCROLLLOCK, SHIFT+END
13613>Excel快捷键之处于End模式时展开选中区域
137打开或关闭 END 模式END
138将选定区域扩展到单元格同列同行的最后非空单元格END, SHIFT+ 箭头键
139将选定区域扩展到工作表上包含数据的最后一个单元格END, SHIFT+HOME
140将选定区域扩展到当前行中的最后一个单元格END, SHIFT+ENTER
14114>Excel快捷键之选择含有特殊字符单元格
142选中活动单元格周围的当前区域CTRL+SHIFT+*(星号)
143选中当前数组,此数组是活动单元格所属的数组CTRL+/
144选定所有带批注的单元格CTRL+SHIFT+O (字母 O)
145选择行中不与该行内活动单元格的值相匹配的单元格CTRL+\
146选中列中不与该列内活动单元格的值相匹配的单元格CTRL+SHIFT+|
147选定当前选定区域中公式的直接引用单元格CTRL+[ (左方括号)
148选定当前选定区域中公式直接或间接引用的所有单元格CTRL+SHIFT+{ 左大括号
149只选定直接引用当前单元格的公式所在的单元格CTRL+] (右方括号)
150选定所有带有公式的单元格,这些公式直接或间接引用当前单元格CTRL+SHIFT+}右大括号