异步崛起, aiohttp VS requests

通常requests库作为爬虫来说他是同步的, 当然利用启动线程进程也可以完成多线程操作, 但是这样子未免会浪费机器性能, 然而在现在异步崛起, aiohttp库性能真是让人十分满意, 他是异步并发, 提供了对asyncio/await的支持,可以实现单线程并发IO操作.

现在对新浪新闻做一次50页的爬取测试

测试初始url: https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r=

注: 此url的参数控制返回一个json格式的50页新闻数据


  • 先来看看requests代码
import requests, time, random, json
from lxml.html import etree

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'referer': 'https://news.sina.com.cn/roll/',
}

test_url = 'https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r={}'

def requests_func():
    res = requests.get(test_url.format(random.random()), headers=header)
    html = res.text
    json_data = json.loads(html)
    data_list = json_data['result']['data']
    for data in data_list:
        url = data['url']
        res = requests.get(url, headers=header)
        page = etree.HTML(res.content.decode('utf8'))
        title = page.xpath('//title')[0].text
        print(data_list.index(data) + 1, ':', title, url)

if __name__ == '__main__':
    start = time.time()
    requests_func()
    end = time.time()
    print('requests所需时间', end - start)

从图片看来, 10秒这样完成了50页的数据爬取, 速度还算是很快的, 这里请留意顺序, 后面和异步做比较


  • 接下来看看aiohttp代码
import asyncio, aiohttp, time, random, json
from lxml.html import etree

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'referer': 'https://news.sina.com.cn/roll/',
}

test_url = 'https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r={}'

async def aiohttp_func():
    async with aiohttp.ClientSession() as session:
        async with session.get(test_url.format(random.random()), headers=header) as res:
            html = await res.read()
            json_data = json.loads(html)
            data_list = json_data['result']['data']
            for data in data_list:
                url = data['url']
                res = await session.get(url, headers=header)
                html = await res.read()
                page = etree.HTML(html)
                title = page.xpath('//title')[0].text
                print(data_list.index(data) + 1, ':', title, url)

if __name__ == '__main__':
    start = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(aiohttp_func())
    end = time.time()
    print('aiohttp所需时间', end - start)

哇哦, 速度快了一倍多, 虽然是快了, 但是好像也没说的那么厉害? 当然不止如此了, 来见识一下ensure_future的威力!


  • 使用ensure_future方法的代码
import asyncio, aiohttp, time, random, json
from lxml.html import etree

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    'referer': 'https://news.sina.com.cn/roll/',
}

test_url = 'https://feed.mix.sina.com.cn/api/roll/get?pageid=153&lid=2509&k=&num=50&page=1&r={}'

tasks = []
async def get_data(url, index):
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=header) as res:
            html = await res.read()
            page = etree.HTML(html)
            title = page.xpath('//title')[0].text
            print(index + 1, ':', title, url)

async def aiohttp_func_future():
    async with aiohttp.ClientSession() as session:
        async with session.get(test_url.format(random.random()), headers=header) as res:
            html = await res.read()
            json_data = json.loads(html)
            data_list = json_data['result']['data']
            for data in data_list:
                url = data['url']
                task = asyncio.ensure_future(get_data(url, data_list.index(data)))
                tasks.append(task)

if __name__ == '__main__':
    start = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(aiohttp_func_future())
    loop.run_until_complete(asyncio.gather(*tasks))
    end = time.time()
    print('aiohttp所需时间', end - start)

这速度真是可怕, 不到1秒, 异步并发速度惊人, 参考和requests的数字顺序, 你会发现异步并没有按照实际顺序来完成操作.

当应用到网络操作时,异步的代码表现十分的优秀,但是相对而言, 异步的aiohttp代码量会比requests要多的多, 虽说快也不意味着一定是哪个更好, 这个还是需要看真正的使用场景, 因为当你需要回调时, aiohttp往往就要付出更多的代码来处理, 用的不好反倒是个累赘, 还是按实际需求来选择所需的工具, 这才是最好的选择...


如果本文对你有启发,或者对本文有疑问或者功能/方法建议,可以在下方做出评论,或者直接联系我,谢谢您的观看和支持!

添加新评论

本站现已启用评论投票,被点踩过多的评论将自动折叠。与本文无关评论请发留言板。请不要水评论,谢谢。

已有 0条评论