SCRAPING USING URLLIB
[root@rhel7 html]# python
Python 2.7.5
Python 2.7.5
>>> import urllib
>>> f=urllib.urlopen("https://www.microsoft.com")>>> print f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
>>> f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
Important : Here f is an object and by using f we can grab lots of info about url which are listed below
#### To get more info about its web page the server name where this page is hosted
>>> print f.info()
Server: Apache
ETag: "6082151bd56ea922e1357f5896a90d0a:1425454794"
Accept-Ranges: bytes
Content-Length: 1020
Content-Type: text/html
Connection: close
X-N: S
@@@ To get URL from f object
>>> print f.geturl()
https://www.microsoft.com
@@@ To read the page code written in html
p>>> print f.read()
<html><head><title>Microsoft Corporation</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"></meta><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><meta name="SearchTitle" content="Microsoft.com" scheme=""></meta><meta name="Description" content="Get product information, support, and news from Microsoft." scheme=""></meta><meta name="Title" content="Microsoft.com Home Page" scheme=""></meta><meta name="Keywords" content="Microsoft, product, support, help, training, Office, Windows, software, download, trial, preview, demo, business, security, update, free, computer, PC, server, search, download, install, news" scheme=""></meta><meta name="SearchDescription" content="Microsoft.com Homepage" scheme=""></meta></head><body><p>Your current User-Agent string appears to be from an automated process, if this is incorrect, please click this link:<a href="http://www.microsoft.com/en/us/default.aspx?redir=true">United States English Microsoft Homepage</a></p></body></html>
@@@ You can also customize the f.info() data
>>> print f.info()['Server']
Apache
>>> print f.info()['Date']
@@@ Get response code of http
>>> print f.code
200
@@@ you can encode url keywords which are in dictonary
>>> d={1:"this",2:"tow j "}
>>> print urllib.urlencode(d)[root@rhel7 html]# python
Python 2.7.5 (default, Feb 11 2014, 07:46:25)[GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import urllib
>>> f=urllib.urlopen("https://www.microsoft.com")>>> print f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
>>> f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
Important : Here f is an object and by using f we can grab lots of info about url which are listed below
#### To get more info about its web page the server name where this page is hosted
>>> print f.info()
Server: Apache
ETag: "6082151bd56ea922e1357f5896a90d0a:1425454794"
Last-Modified: Wed, 04 Mar 2015 07:39:54 GMT
Accept-Ranges: bytes
Content-Length: 1020
Content-Type: text/html
Date: Mon, 10 Aug 2015 02:33:46 GMT
Connection: close
X-N: S
@@@ To get URL from f object
>>> print f.geturl()
https://www.microsoft.com
@@@ To read the page code written in html
p>>> print f.read()
<html><head><title>Microsoft Corporation</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"></meta><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><meta name="SearchTitle" content="Microsoft.com" scheme=""></meta><meta name="Description" content="Get product information, support, and news from Microsoft." scheme=""></meta><meta name="Title" content="Microsoft.com Home Page" scheme=""></meta><meta name="Keywords" content="Microsoft, product, support, help, training, Office, Windows, software, download, trial, preview, demo, business, security, update, free, computer, PC, server, search, download, install, news" scheme=""></meta><meta name="SearchDescription" content="Microsoft.com Homepage" scheme=""></meta></head><body><p>Your current User-Agent string appears to be from an automated process, if this is incorrect, please click this link:<a href="http://www.microsoft.com/en/us/default.aspx?redir=true">United States English Microsoft Homepage</a></p></body></html>
@@@ You can also customize the f.info() data
>>> print f.info()['Server']
Apache
>>> print f.info()['Date']
Mon, 10 Aug 2015 02:33:46 GMT
>>>
@@@ Get response code of http
>>> print f.code
200
@@@ you can encode url keywords which are in dictonary
>>> d={1:"this",2:"tow j "}
>>> print urllib.urlencode(d)
1=this&2=tow+j+
Comments
Post a Comment