urllib2 解析

2010年11月20日 | 分类: 兴趣所在 | 标签:

概述

urllib2中的核心类:
Request :一个具体的url请求,包含了请求的所有信息,不仅仅试用于http协议
OpenerDirector:与BaseHandler组合,通过组合不同得handler处理不同的请求
BaseHandler :参与完成请求处理的类,不同的请求处理都继承这个类

在urllib2中,一次请求被分为三个过程,分别是request,open,response
request:目的在于构造本次请求Request对象所需得所有信息,如http协议中的header信息
open:处理具体请求的过程,封装Request对象,调用更底层的类完成请求并返回response
response:对返回的Response对象做处理
当然后有一个error处理的过程,但这个不是主动触发的。

OpenerDirector

因为每次请求的具体实现是不同的handler,而且一次请求可能由很多handler组成。所以实现这一耦合机制的类就是OpenerDirector,这个类可以注册(添加)各种不同的handler用来帮助处理一次请求。通常来说handler中的命名规则为 protocol_request|open|response,这分别对应不同协议的三个过程。还是直接上代码,写了一点中文的注释。

class OpenerDirector:
    def __init__(self):
        # manage the individual handlers
        # 所有已注册的handler
        self.handlers = []
        # 已注册的不同过程的方法
        self.handle_open = {}
        self.handle_error = {}
        self.process_response = {}
        self.process_request = {}

    # 添加一个handler
    #
    def add_handler(self, handler):
        # 通过检测BaseHandler中的方法 确保handler继承于BaseHandler
        if not hasattr(handler, "add_parent"):
            raise TypeError("expected BaseHandler instance, got %r" %
                            type(handler))

        # 省略一些handler验证代码,主要是检查,这些handler是否有处理过程函数

        # 如果这个handler验证成功,会调用add_parent,这是BaseHandler的方法
        # 使得在handler中可以使用self.parent访问OpenerDirector,在HTTPErrorProcessor有用到
        if added:
            # the handlers must work in an specific order, the order
            # is specified in a Handler attribute
            bisect.insort(self.handlers, handler)
            handler.add_parent(self)

    def close(self):
        # Only exists for backwards compatibility.
        pass

    # 调用某个chain中的某种协议的方法
    def _call_chain(self, chain, kind, meth_name, *args):
        # Handlers raise an exception if no one else should try to handle
        # the request, or return None if they can't but another handler
        # could.  Otherwise, they return the response.
        handlers = chain.get(kind, ())
        for handler in handlers:
            func = getattr(handler, meth_name)

            result = func(*args)
            if result is not None:
                return result

    # 核心的方法,在此方法中实现了一次请求的三个过程
    def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
        # accept a URL or a Request object
        if isinstance(fullurl, basestring):
            req = Request(fullurl, data)
        else:
            req = fullurl
            if data is not None:
                req.add_data(data)

        req.timeout = timeout
        protocol = req.get_type()

        # pre-process request
        # 调用所有已注册的handler的request处理方法
        meth_name = protocol+"_request"
        for processor in self.process_request.get(protocol, []):
            meth = getattr(processor, meth_name)
            req = meth(req)

        # 处理open过程
        response = self._open(req, data)

        # post-process response
        # 调用所有已注册的handler的respone处理方法
        meth_name = protocol+"_response"
        for processor in self.process_response.get(protocol, []):
            meth = getattr(processor, meth_name)
            response = meth(req, response)

        return response

    # 对于open处理过程,还分了三个小类别default,protocol,unknow
    # 按照这个数序如果存在某个处理方法则调用,返回结果
    def _open(self, req, data=None):
        result = self._call_chain(self.handle_open, 'default',
                                  'default_open', req)
        if result:
            return result

        protocol = req.get_type()
        result = self._call_chain(self.handle_open, protocol, protocol +
                                  '_open', req)
        if result:
            return result

        return self._call_chain(self.handle_open, 'unknown',
                                'unknown_open', req)

    #error处理过程是一个被动过程,它会调用handle_error中注册的错误处理方法
    def error(self, proto, *args):
        #省略代码

Handler

urllib2提供很多handler来处理不同的请求,常用的HTTPHandler,FTPHandler都比较好理解。这里提一下HTTPCookieProcessor和HTTPRedirectHandler

HTTPCookieProcessor是处理cookie的,在很多需要身份验证的请求中cookie是必不可少的,python中对cookie的操作是有cookielib模块来完成的,而这个handler只是调用了其方法,在request和response过程中将cookie加到请求中和把cookie从响应中解析出来。

HTTPRedirectHandler是处理30x状态的handler,直接看源码,貌似英文的注释已经讲的很明白了

class HTTPRedirectHandler(BaseHandler):
    # maximum number of redirections to any single URL
    # this is needed because of the state that cookies introduce
    max_repeats = 4
    # maximum total number of redirections (regardless of URL) before
    # assuming we're in a loop
    max_redirections = 10

    # 这个方法把当前Requst头中的信息附加到新的url中,就是跳转的目的url
    def redirect_request(self, req, fp, code, msg, headers, newurl):
        """Return a Request or None in response to a redirect.

        This is called by the http_error_30x methods when a
        redirection response is received.  If a redirection should
        take place, return a new Request to allow http_error_30x to
        perform the redirect.  Otherwise, raise HTTPError if no-one
        else should try to handle this url.  Return None if you can't
        but another Handler might.
        """
        m = req.get_method()
        if (code in (301, 302, 303, 307) and m in ("GET", "HEAD")
            or code in (301, 302, 303) and m == "POST"):
            # Strictly (according to RFC 2616), 301 or 302 in response
            # to a POST MUST NOT cause a redirection without confirmation
            # from the user (of urllib2, in this case).  In practice,
            # essentially all clients do redirect in this case, so we
            # do the same.
            # be conciliant with URIs containing a space
            newurl = newurl.replace(' ', '%20')
            newheaders = dict((k,v) for k,v in req.headers.items()
                              if k.lower() not in ("content-length", "content-type")
                             )
            return Request(newurl,
                           headers=newheaders,
                           origin_req_host=req.get_origin_req_host(),
                           unverifiable=True)
        else:
            raise HTTPError(req.get_full_url(), code, msg, headers, fp)

    # Implementation note: To avoid the server sending us into an
    # infinite loop, the request object needs to track what URLs we
    # have already seen.  Do this by adding a handler-specific
    # attribute to the Request object.
    # 处理302错误
    def http_error_302(self, req, fp, code, msg, headers):
        # Some servers (incorrectly) return multiple Location headers
        # (so probably same goes for URI).  Use first header.
        # 获取跳转的url
        if 'location' in headers:
            newurl = headers.getheaders('location')[0]
        elif 'uri' in headers:
            newurl = headers.getheaders('uri')[0]
        else:
            return

        # fix a possible malformed URL
        urlparts = urlparse.urlparse(newurl)
        if not urlparts.path:
            urlparts = list(urlparts)
            urlparts[2] = "/"
        newurl = urlparse.urlunparse(urlparts)

        newurl = urlparse.urljoin(req.get_full_url(), newurl)

        # XXX Probably want to forget about the state of the current
        # request, although that might interact poorly with other
        # handlers that also use handler-specific request attributes
        # 构造新的请求
        new = self.redirect_request(req, fp, code, msg, headers, newurl)
        if new is None:
            return

        # loop detection
        # .redirect_dict has a key url if url was previously visited.
        # 循环检测机制,防止跳转循环
        # 把已经访问的url添加到redirect_dict中并对跳转的次数做了限制
        if hasattr(req, 'redirect_dict'):
            visited = new.redirect_dict = req.redirect_dict
            if (visited.get(newurl, 0) >= self.max_repeats or
                len(visited) >= self.max_redirections):
                raise HTTPError(req.get_full_url(), code,
                                self.inf_msg + msg, headers, fp)
        else:
            visited = new.redirect_dict = req.redirect_dict = {}
        visited[newurl] = visited.get(newurl, 0) + 1

        # Don't close the fp until we are sure that we won't use it
        # with HTTPError.
        fp.read()
        fp.close()

        # 获取新url的内容
        return self.parent.open(new, timeout=req.timeout)

    # 对于30x的错误都用302的方法实现
    http_error_301 = http_error_303 = http_error_307 = http_error_302

    inf_msg = "The HTTP server returned a redirect error that would " \
              "lead to an infinite loop.\n" \
              "The last 30x error message was:\n

Error handler

错误处理需要单独讲就是因为其特殊性,在urllib2中,处理错误的hanlder是HTTPErrorProcessor完成的

class HTTPErrorProcessor(BaseHandler):
    """Process HTTP error responses."""
    handler_order = 1000  # after all other processing

    def http_response(self, request, response):
        code, msg, hdrs = response.code, response.msg, response.info()

        # According to RFC 2616, "2xx" code indicates that the client's
        # request was successfully received, understood, and accepted.
        # 对于不是2xx的返回状态一概认为产生了一个错误
        # 都使用OpenerDirector的error方法来分发到相应的handler的处理方法中
        if not (200 <= code < 300):
            response = self.parent.error(
                'http', request, response, code, msg, hdrs)

        return response

    https_response = http_respons

urlopen,install_opener,build_opener

这是urllib2模块的方法,在urllib2模块中存在一个全局变量保存OpenerDirector实例。
urlopen方法则是调用了OpenerDirector实例的open方法
install_opener方法把一个OpenerDirector实例做为当前的opener
最关键的是build_opener,它决定了OpenerDirector中存在哪些handler

def build_opener(*handlers):
    """Create an opener object from a list of handlers.

    The opener will use several default handlers, including support
    for HTTP, FTP and when applicable, HTTPS.

    If any of the handlers passed as arguments are subclasses of the
    default handlers, the default handlers will not be used.
    """
    import types
    def isclass(obj):
        return isinstance(obj, types.ClassType) or hasattr(obj, "__bases__")

    opener = OpenerDirector()
    # 默认会加载的handler
    # 如果有这些类的子类则用子类代替他们
    default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
                       HTTPDefaultErrorHandler, HTTPRedirectHandler,
                       FTPHandler, FileHandler, HTTPErrorProcessor]
    if hasattr(httplib, 'HTTPS'):
        default_classes.append(HTTPSHandler)
    skip = set()
    # 获取默认handler中可以被替换的handler
    for klass in default_classes:
        for check in handlers:
            # 传入的handler可以是类名也可以是一个实例
            if isclass(check):
                if issubclass(check, klass):
                    skip.add(klass)
            elif isinstance(check, klass):
                skip.add(klass)
    # 去掉可以替换的handler
    for klass in skip:
        default_classes.remove(klass)
    # 添加handler
    for klass in default_classes:
        opener.add_handler(klass())
    # 再添加传入的handler
    for h in handlers:
        # 实例化
        if isclass(h):
            h = h()
        opener.add_handler(h)
    return opener

总结

显而易见urllib2的扩展性是很好的,opener很handler的低耦合可以使我们添加其他对于其他任何协议的handler,这里提供一个实现了文件上传功能的HTTPClient类(点击下载),这个类使用了https://github.com/seisen/urllib2_file提供的上传文件功能模块,不过这个与HTTPCookieProcessor有冲突,所以我添加了两个方法使在需要上传文件的时候用文件上传功能。
可以在urllib2_file.py后添加

def install_FHandler():
    urllib2._old_HTTPHandler = urllib2.HTTPHandler
    urllib2.HTTPHandler = newHTTPHandler
    urllib2._opener = None

def uninstall_FHandler():
    urllib2.HTTPHandler = urllib2._old_HTTPHandler
    urllib2._opener = None
  1. 2011年1月19日15:01

    这篇文章不错,支持下!

  2. 2011年9月21日00:40

    诶..看不懂啊.看不懂..
    网络什么的弱爆了..

  3. 2012年10月5日22:22

    怎么上传文件呢?

  4. 一只梨
    2012年10月12日15:33

    不错,赞一个。

  5. 2012年10月15日17:14

    学习了,最近在看urllib2的源码

请输入算式结果(看不清请点击图片)
(必须)