Importing flat files from the web¶

use Python2
University of California, Irvine's Machine Learning repository.

http://archive.ics.uci.edu/ml/index.html

'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.

# Import package
import urllib

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape

(1599, 12)

df.head(3)

Opening and reading flat files from the web¶

load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()

(1599, 12)

Importing non-flat files from the web¶

use pd.read_excel() to import an Excel spreadsheet.

# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

print type(xl)
type(xl['1700'])

[u'1700', u'1900']
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000
<type 'dict'>

pandas.core.frame.DataFrame

from urllib2 import urlopen, Request

request = Request('http://jishichao.com')

response = urlopen(request)

html = response.read()

response.close()

print type(html)
len(html)

<type 'str'>

4843

Printing HTTP request results in Python using urllib¶

You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.

# Import packages
from urllib2 import urlopen, Request

# Specify the url
url = "http://docs.datacamp.com/teach/"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

print type(html)
print 
# Print the html
print(html[:300])


# Be polite and close the response!
response.close()

<type 'str'>

<!DOCTYPE html>
<link rel="shortcut icon" href="images/favicon.ico" />
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Home</title>
  <meta name="description" content="A

Requests¶

better and most used

Performing HTTP requests in Python using requests¶

do the same using the higher-level requests library.

import requests
r = requests.get('http://jishichao.com')
text = r.text

print type(text)
print type(text.encode('utf-8'))

<type 'unicode'>
<type 'str'>

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Import the function BeautifulSoup from the package bs4

Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
Use the method prettify() on soup and assign the result to pretty_soup

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print type(pretty_soup)
print 
print(pretty_soup[:300])

<type 'unicode'>

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="../static/my.css" rel="stylesheet" type="text/css"/>
  <title>
   welcome 23333
  </title>
  <!--      <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r

Turning a webpage into data using BeautifulSoup: getting the text¶

Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.

Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text[100:300])

<title>welcome 23333 </title>

An interactive Data Visualization Web App I wrote

My Notebook website built by Python Flask deployed on AWS
A image downloader for a specific website 'worldcosplay', the program helps you open their

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags

The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

<title>welcome 23333 </title>
http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=
      result&fr=&sf=1&fmq=1467292435965_R&pv=&ic=0&nc=1&z=&se=1&showtab
      =0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=草泥马动态图
http://shichaoji.com
http://www.jishichao.com:7777
http://www.jishichao.com:10086
https://jishichao.com
/windows0
/windows2
/windows1
/mac1
/linux64
/linux32
/plotting
./

type(a_tags), type(a_tags[0])

(bs4.element.ResultSet, bs4.element.Tag)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

from sys import argv	unpack argv a,b,c…=argv
> python file.py ‘b’,’c’,… argv will be a list contains [‘file.py’, ‘a’,’b’,…]	a = python file name itself b,c, …. is the variables pass when exec the python file
The argument mode points to a string beginning with one of the following sequences (Additional characters may follow these sequences.): ``r'' Open text file for reading. The stream is positioned at the beginning of the file. ``r+'' Open for reading and writing. The stream is positioned at the beginning of the file. ``w'' Truncate file to zero length or create text file for writing. The stream is positioned at the beginning of the file. ``w+'' Open for reading and writing. The file is created if it does not exist, otherwise it is truncated. The stream is positioned at the beginning of the file. ``a'' Open for writing. The file is created if it does not exist. The stream is positioned at the end of the file. Subsequent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar. ``a+'' Open for reading and writing. The file is created if it does not exist. The stream is positioned at the end of the file. Subse- quent writes to the file will always end up at the then current end of file, irrespective of any intervening fseek(3) or similar.
`\| r r+ w w+ a a+ ------------------\|-------------------------- read \| + + + + write \| + + + + + create \| + + + + truncate \| + + position at start \| + + + + position at end \| + +` where meanings are: (just to avoid any misinterpretation) read – reading from file is allowed write – writing to file is allowed create – file is created if it does not exist yet trunctate – during opening of the file it is made empty (all content of the file is erased) position at start – after file is opened, initial position is set to the start of the file position at end – after file is opened, initial position is set to the end of the file
`close` — Closes the file. Like `File->Save..` in your editor. `read` — Reads the contents of the file. You can assign the result to a variable. `readline` — Reads just one line of a text file. `truncate` — Empties the file. Watch out if you care about the file. `write('stuff')` — Writes “stuff” to the file.
file manipulate manipulate.py: from sys import argv script, input_file = argv current_file = open(input_file) def print_all(f): print f.read() def rewind(f): f.seek(0) def print_a_line(line_count, f): print line_count, f.readline()	$ python manipulate.py test.txt Why does seek(0) not set the current_line to 0? First, the seek() function is dealing in bytes, not lines. The code seek(0) moves the file to the 0 byte (first byte) in the file. Second, current_line is just a variable and has no real connection to the file at all. We are manually incrementing it. How does readline() know where each line is? Inside readline() is code that scans each byte of the file until it finds a \n character, then stops reading the file to return what it found so far. The file f is responsible for maintaining the current position in the file after each readline() call, so that it will keep reading each line.

	problem	shortcut
0	快捷键之在工作表中移动和滚动向上、下、左或右移动单元格箭头键
1	移动到当前数据区域的边缘	CTRL+ 箭头键
2	移动到行首	HOME
3	移动到工作表的开头	CTRL+HOME
4	移动到工作表的最后一个单元格。	CTRL+END
5	向下移动一屏	PAGE DOWN
6	向上移动一屏	PAGE UP
7	向右移动一屏	ALT+PAGE DOWN
8	向左移动一屏	ALT+PAGE UP
9	移动到工作簿中下一个工作表	CTRL+PAGE DOWN
10	移动到工作簿中前一个工作表	CTRL+PAGE UP
11	移动到下一工作簿或窗口	CTRL+F6 或 CTRL+TAB
12	移动到前一工作簿或窗口	CTRL+SHIFT+F6
13	移动到已拆分工作簿中的下一个窗格	F6
14	移动到被拆分的工作簿中的上一个窗格	SHIFT+F6
15	滚动并显示活动单元格	CTRL+BACKSPACE
16	显示“定位”对话框	F5
17	显示“查找”对话框	SHIFT+F5
18	重复上一次“查找”操作	SHIFT+F4
19	在保护工作表中的非锁定单元格之间移动	TAB
20	2>Excel快捷键之处于END模式时在工作表中移动
21	打开或关闭 END 模式	END
22	在一行或列内以数据块为单位移动	END, 箭头键
23	移动到工作表的最后一个单元格.	END, HOME
24	在当前行中向右移动到最后一个非空白单元格。	END, ENTER
25	3>Excel快捷键之处于“滚动锁定”模式时在工作表中移动
26	打开或关闭滚动锁定	SCROLL LOCK
27	移动到窗口中左上角处的单元格	HOME
28	移动到窗口中右下角处的单元格	END
29	向上或向下滚动一行	上箭头键或下箭头键
30	向左或向右滚动一列	左箭头键或右箭头键
31	4>Excel快捷键之用于预览和打印文档
32	显示“打印”对话框	CTRL+P
33	在打印预览中时
34	当放大显示时，在文档中移动	箭头键
35	当缩小显示时，在文档中每次滚动一页	PAGE UP
36	当缩小显示时，滚动到第一页	CTRL+上箭头键
37	当缩小显示时，滚动到最后一页	CTRL+下箭头键
38	5>Excel快捷键之用于工作表、图表和宏
39	插入新工作表	SHIFT+F11
40	创建使用当前区域的图表	F11 或 ALT+F1
41	显示“宏”对话框	ALT+F8
42	显示“Visual Basic 编辑器”	ALT+F11
43	插入 Microsoft Excel 4.0 宏工作表	CTRL+F11
44	移动到工作簿中的下一个工作表	CTRL+PAGE DOWN
45	移动到工作簿中的上一个工作表	CTRL+PAGE UP
46	选择工作簿中当前和下一个工作表	SHIFT+CTRL+PAGE DOWN
47	选择当前工作簿或上一个工作簿	SHIFT+CTRL+PAGE UP
48	6>Excel快捷键之选择图表工作表
49	选择工作簿中的下一张工作表	CTRL+PAGE DOWN
50	选择工作簿中的上一个工作表	CTRL+PAGE UP，END, SHIFT+ENTER
51	7>Excel快捷键之用于在工作表中输入数据
52	完成单元格输入并在选定区域中下移	ENTER
53	在单元格中折行	ALT+ENTER
54	用当前输入项填充选定的单元格区域	CTRL+ENTER
55	完成单元格输入并在选定区域中上移	SHIFT+ENTER
56	完成单元格输入并在选定区域中右移	TAB
57	完成单元格输入并在选定区域中左移	SHIFT+TAB
58	取消单元格输入	ESC
59	删除插入点左边的字符，或删除选定区域	BACKSPACE
60	删除插入点右边的字符，或删除选定区域	DELETE
61	删除插入点到行末的文本	CTRL+DELETE
62	向上下左右移动一个字符	箭头键
63	移到行首	HOME
64	重复最后一次操作	F4 或 CTRL+Y
65	编辑单元格批注	SHIFT+F2
66	由行或列标志创建名称	CTRL+SHIFT+F3
67	向下填充	CTRL+D
68	向右填充	CTRL+R
69	定义名称	CTRL+F3
70	8>Excel快捷键之设置数据格式
71	显示“样式”对话框	ALT+' （撇号）
72	显示“单元格格式”对话框	CTRL+1
73	应用“常规”数字格式	CTRL+SHIFT+~
74	应用带两个小数位的“贷币”格式	CTRL+SHIFT+$
75	应用不带小数位的“百分比”格式	CTRL+SHIFT+%
76	应用带两个小数位的“科学记数”数字格式	CTRL+SHIFT+^
77	应用年月日“日期”格式	CTRL+SHIFT+#
78	应用小时和分钟“时间”格式，并标明上午或下午	CTRL+SHIFT+@
79	应用具有千位分隔符且负数用负号 (-) 表示	CTRL+SHIFT+!
80	应用外边框	CTRL+SHIFT+&
81	删除外边框	CTRL+SHIFT+_
82	应用或取消字体加粗格式	CTRL+B
83	应用或取消字体倾斜格式	CTRL+I
84	应用或取消下划线格式	CTRL+U
85	应用或取消删除线格式	CTRL+5
86	隐藏行	CTRL+9
87	取消隐藏行	CTRL+SHIFT+( 左括号
88	隐藏列	CTRL+0（零）
89	取消隐藏列	CTRL+SHIFT+)右括号
90	9>Excel快捷键之编辑数据
91	编辑活动单元格并将插入点放置到线条末尾	F2
92	取消单元格或编辑栏中的输入项	ESC
93	编辑活动单元格并清除其中原有的内容	BACKSPACE
94	将定义的名称粘贴到公式中	F3
95	完成单元格输入	ENTER
96	将公式作为数组公式输入	CTRL+SHIFT+ENTER
97	在公式中键入函数名之后，显示公式选项板	CTRL+A
98	在公式中键入函数名后为该函数插入变量名和括号	CTRL+SHIFT+A
99	显示“拼写检查”对话框。	F7 键
100	10>Excel快捷键之插入、删除和复制选中区域
101	复制选定区域	CTRL+C
102	剪切选定区域	CTRL+X
103	粘贴选定区域	CTRL+V
104	清除选定区域的内容	DELETE
105	删除选定区域	CTRL+ 连字符
106	撤消最后一次操作	CTRL+Z
107	插入空白单元格	CTRL+SHIFT+ 加号
108	11>Excel快捷键之在选中区域内移动
109	在选定区域内由上往下移动	ENTER
110	在选定区域内由下往上移动	SHIFT+ENTER
111	在选定区域内由左往右移动	TAB
112	在选定区域内由右往左移动	SHIFT+TAB
113	按顺时针方向移动到选定区域的下一个角	CTRL+PERIOD
114	右移到非相邻的选定区域	CTRL+ALT+右箭头键
115	左移到非相邻的选定区域	CTRL+ALT+左箭头键
116	12>Excel快捷键之选择单元格、列或行
117	选定当前单元格周围的区域	CTRL+SHIFT+*（星号）
118	将选定区域扩展一个单元格宽度	SHIFT+ 箭头键
119	选定区域扩展到单元格同行同列的最后非空单元格	CTRL+SHIFT+ 箭头键
120	将选定区域扩展到行首	SHIFT+HOME
121	将选定区域扩展到工作表的开始	CTRL+SHIFT+HOME
122	将选定区域扩展到工作表的最后一个使用的单元格	CTRL+SHIFT+END
123	选定整列	CTRL+SPACEBAR
124	选定整行	SHIFT+SPACEBAR
125	选定整个工作表	CTRL+A
126	如果选定了多个单元格则只选定其中的单元格	SHIFT+BACKSPACE
127	将选定区域向下扩展一屏	SHIFT+PAGE DOWN
128	将选定区域向上扩展一屏	SHIFT+PAGE UP
129	选定了一个对象，选定工作表上的所有对象	CTRL+SHIFT+SPACEBAR
130	在隐藏对象、显示对象与对象占位符之间切换	CTRL+6
131	显示或隐藏“常用”工具栏	CTRL+7
132	使用箭头键启动扩展选中区域的功能	F8
133	将其他区域中的单元格添加到选中区域中	SHIFT+F8
134	将选定区域扩展到窗口左上角的单元格	SCROLLLOCK, SHIFT+HOME
135	将选定区域扩展到窗口右下角的单元格	SCROLLLOCK, SHIFT+END
136	13>Excel快捷键之处于End模式时展开选中区域
137	打开或关闭 END 模式	END
138	将选定区域扩展到单元格同列同行的最后非空单元格	END, SHIFT+ 箭头键
139	将选定区域扩展到工作表上包含数据的最后一个单元格	END, SHIFT+HOME
140	将选定区域扩展到当前行中的最后一个单元格	END, SHIFT+ENTER
141	14>Excel快捷键之选择含有特殊字符单元格
142	选中活动单元格周围的当前区域	CTRL+SHIFT+*（星号）
143	选中当前数组，此数组是活动单元格所属的数组	CTRL+/
144	选定所有带批注的单元格	CTRL+SHIFT+O (字母 O)
145	选择行中不与该行内活动单元格的值相匹配的单元格	CTRL+\
146	选中列中不与该列内活动单元格的值相匹配的单元格	CTRL+SHIFT+\|
147	选定当前选定区域中公式的直接引用单元格	CTRL+[ （左方括号）
148	选定当前选定区域中公式直接或间接引用的所有单元格	CTRL+SHIFT+{ 左大括号
149	只选定直接引用当前单元格的公式所在的单元格	CTRL+] （右方括号）
150	选定所有带有公式的单元格，这些公式直接或间接引用当前单元格	CTRL+SHIFT+}右大括号

get data from the web using Python -1 beautifulsoup, requests, urllib

Importing flat files from the web¶

Opening and reading flat files from the web¶

Importing non-flat files from the web¶

Printing HTTP request results in Python using urllib¶

Requests¶

Performing HTTP requests in Python using requests¶

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Turning a webpage into data using BeautifulSoup: getting the text¶

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Python odds and ends

Excel shortcuts (Chinese) tips

Data Science Notebook

get data from the web using Python -1 beautifulsoup, requests, urllib

Importing flat files from the web¶

Opening and reading flat files from the web¶

Importing non-flat files from the web¶

Printing HTTP request results in Python using urllib¶

Requests¶

Performing HTTP requests in Python using requests¶

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Turning a webpage into data using BeautifulSoup: getting the text¶

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Python odds and ends

Excel shortcuts (Chinese) tips