Importing flat files from the web¶

use Python2
University of California, Irvine's Machine Learning repository.

http://archive.ics.uci.edu/ml/index.html

'winequality-red.csv', the flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.

# Import package
import urllib

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urllib.urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print df.shape

(1599, 12)

df.head(3)

Opening and reading flat files from the web¶

load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')

# Print the head of the DataFrame
# print(df.head())
print df.shape
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1], alpha=.4, figsize=(6,3))
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()

(1599, 12)

Importing non-flat files from the web¶

use pd.read_excel() to import an Excel spreadsheet.

# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

print type(xl)
type(xl['1700'])

[u'1700', u'1900']
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000
<type 'dict'>

pandas.core.frame.DataFrame

from urllib2 import urlopen, Request

request = Request('http://jishichao.com')

response = urlopen(request)

html = response.read()

response.close()

print type(html)
len(html)

<type 'str'>

4843

Printing HTTP request results in Python using urllib¶

You have just just packaged and sent a GET request to "http://docs.datacamp.com/teach/" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method.

# Import packages
from urllib2 import urlopen, Request

# Specify the url
url = "http://docs.datacamp.com/teach/"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

print type(html)
print 
# Print the html
print(html[:300])


# Be polite and close the response!
response.close()

<type 'str'>

<!DOCTYPE html>
<link rel="shortcut icon" href="images/favicon.ico" />
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Home</title>
  <meta name="description" content="A

Requests¶

better and most used

Performing HTTP requests in Python using requests¶

do the same using the higher-level requests library.

import requests
r = requests.get('http://jishichao.com')
text = r.text

print type(text)
print type(text.encode('utf-8'))

<type 'unicode'>
<type 'str'>

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Import the function BeautifulSoup from the package bs4

Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup()
Use the method prettify() on soup and assign the result to pretty_soup

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc, "lxml")

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print type(pretty_soup)
print 
print(pretty_soup[:300])

<type 'unicode'>

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="../static/my.css" rel="stylesheet" type="text/css"/>
  <title>
   welcome 23333
  </title>
  <!--      <script>
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r

Turning a webpage into data using BeautifulSoup: getting the text¶

Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.

Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text[100:300])

<title>welcome 23333 </title>

An interactive Data Visualization Web App I wrote

My Notebook website built by Python Flask deployed on AWS
A image downloader for a specific website 'worldcosplay', the program helps you open their

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag < a >; store the result in the variable a_tags

The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'http://jishichao.com'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

<title>welcome 23333 </title>
http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=
      result&fr=&sf=1&fmq=1467292435965_R&pv=&ic=0&nc=1&z=&se=1&showtab
      =0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=草泥马动态图
http://shichaoji.com
http://www.jishichao.com:7777
http://www.jishichao.com:10086
https://jishichao.com
/windows0
/windows2
/windows1
/mac1
/linux64
/linux32
/plotting
./

type(a_tags), type(a_tags[0])

(bs4.element.ResultSet, bs4.element.Tag)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

get data from the web using Python -1 beautifulsoup, requests, urllib

Importing flat files from the web¶

Opening and reading flat files from the web¶

Importing non-flat files from the web¶

Printing HTTP request results in Python using urllib¶

Requests¶

Performing HTTP requests in Python using requests¶

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Turning a webpage into data using BeautifulSoup: getting the text¶

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Leave a Reply Cancel reply

Data Science Notebook

get data from the web using Python -1 beautifulsoup, requests, urllib

Importing flat files from the web¶

Opening and reading flat files from the web¶

Importing non-flat files from the web¶

Printing HTTP request results in Python using urllib¶

Requests¶

Performing HTTP requests in Python using requests¶

Beautiful Soup¶

Parsing HTML with BeautifulSoup¶

Turning a webpage into data using BeautifulSoup: getting the text¶

Turning a webpage into data using BeautifulSoup: getting the hyperlinks¶

Leave a Reply Cancel reply