SCRAPING USING URLLIB



[root@rhel7 html]# python
Python 2.7.5 
>>> import  urllib
>>> f=urllib.urlopen("https://www.microsoft.com")
>>> print  f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
>>> f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>

Important : Here f is an object and by using f we can grab lots of info about url which are listed below

#### To get more info about its web page the server name where this page is hosted

>>> print f.info()

Server: Apache
ETag: "6082151bd56ea922e1357f5896a90d0a:1425454794"
Accept-Ranges: bytes
Content-Length: 1020
Content-Type: text/html
Connection: close
X-N: S

@@@  To get URL from  f object

>>> print  f.geturl()
https://www.microsoft.com

@@@  To read the page code written in html 

p>>> print  f.read()

<html><head><title>Microsoft Corporation</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"></meta><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><meta name="SearchTitle" content="Microsoft.com" scheme=""></meta><meta name="Description" content="Get product information, support, and news from Microsoft." scheme=""></meta><meta name="Title" content="Microsoft.com Home Page" scheme=""></meta><meta name="Keywords" content="Microsoft, product, support, help, training, Office, Windows, software, download, trial, preview, demo,  business, security, update, free, computer, PC, server, search, download, install, news" scheme=""></meta><meta name="SearchDescription" content="Microsoft.com Homepage" scheme=""></meta></head><body><p>Your current User-Agent string appears to be from an automated process, if this is incorrect, please click this link:<a href="http://www.microsoft.com/en/us/default.aspx?redir=true">United States English Microsoft Homepage</a></p></body></html>


@@@  You can also customize the  f.info()  data

>>> print f.info()['Server']
Apache
>>> print f.info()['Date']

@@@  Get response code of http

>>> print f.code
200

@@@  you can encode url keywords which  are in dictonary

>>> d={1:"this",2:"tow j "}
>>> print urllib.urlencode(d)[root@rhel7 html]# python
Python 2.7.5 (default, Feb 11 2014, 07:46:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import  urllib
>>> f=urllib.urlopen("https://www.microsoft.com")
>>> print  f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>
>>> f
<addinfourl at 140286118192624 whose fp = <socket._fileobject object at 0x7f96f39ed7d0>>

Important : Here f is an object and by using f we can grab lots of info about url which are listed below

#### To get more info about its web page the server name where this page is hosted

>>> print f.info()

Server: Apache
ETag: "6082151bd56ea922e1357f5896a90d0a:1425454794"
Last-Modified: Wed, 04 Mar 2015 07:39:54 GMT
Accept-Ranges: bytes
Content-Length: 1020
Content-Type: text/html
Date: Mon, 10 Aug 2015 02:33:46 GMT
Connection: close
X-N: S

@@@  To get URL from  f object

>>> print  f.geturl()
https://www.microsoft.com

@@@  To read the page code written in html 

p>>> print  f.read()

<html><head><title>Microsoft Corporation</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"></meta><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><meta name="SearchTitle" content="Microsoft.com" scheme=""></meta><meta name="Description" content="Get product information, support, and news from Microsoft." scheme=""></meta><meta name="Title" content="Microsoft.com Home Page" scheme=""></meta><meta name="Keywords" content="Microsoft, product, support, help, training, Office, Windows, software, download, trial, preview, demo,  business, security, update, free, computer, PC, server, search, download, install, news" scheme=""></meta><meta name="SearchDescription" content="Microsoft.com Homepage" scheme=""></meta></head><body><p>Your current User-Agent string appears to be from an automated process, if this is incorrect, please click this link:<a href="http://www.microsoft.com/en/us/default.aspx?redir=true">United States English Microsoft Homepage</a></p></body></html>


@@@  You can also customize the  f.info()  data

>>> print f.info()['Server']
Apache
>>> print f.info()['Date']
Mon, 10 Aug 2015 02:33:46 GMT
>>> 

@@@  Get response code of http

>>> print f.code
200

@@@  you can encode url keywords which  are in dictonary

>>> d={1:"this",2:"tow j "}
>>> print urllib.urlencode(d)
1=this&2=tow+j+

Comments

Popular posts from this blog

HOW TO USE DOCKER

Flutter Native Application Development

SINGLE LOOP SORTING TECHNIQUE