NodeJS网络爬虫

网上有很多其他语言平台版本的网络爬虫，比如Python，Java。那怎么能少得了我们无所不能的javascript呢😂？这个和之前给产品狗开发的批量图片下载配置工具，原理很相似，核心就是调用Node的http模块。

网络爬虫基本就由如下部分组成：

程序入口
请求模块
数据解释

程序入口可以用web页面实现，还可以在网页上显示抓取的数据和分析结果；但是这个我只想把精力花在核心模块，页面和样式不想花太多精力去弄。所以呢，我就开发个node的命令行工具，这个比较成熟的就是commander了。

请求模块方面，我只想抓取百度的页面，还有知乎的页面，它们的请求都是https协议的，好在Node的https和http模块功能几乎是一样的，这里需要做的就是熟悉它的api就好了，也是easy。

数据解释模块，因为抓取出来的页面内容是字符串，所以可以用正则表达式去匹配，但是这样太麻烦了。有没有更好的方式？抓取回来可都是html内容，用jQuery以操作dom的方式去解析数据不是很方便嘛，恰好有个服务端的jquery库cheerio。

页面抓取完，数据也分析出来了，那就非常简单了，你可以选择存入数据库或者写入文件保存。接下来就开始实现上面的功能模块。

程序入口

开始配置和初始化commander，具体的使用方式参考官方的文档：*https://www.npmjs.com/package/commander*，这里不再详细解释用法了,下面开始配置commander。
首先要在package.json添加如下节点，表示注册了一个命令行 “grab”。

"bin": {
  "grab": "bin/grab.js"
},

接着在grab.js开始定义commander，最后我们就可以这样执行命令行：”grab baidu <内容>”，当然可以用bd简写代替baidu，知乎的定义和百度是一样，这里不再重复介绍了。

program
    // .allowUnknownOption()//不报错误
    .version('0.0.1')
    .usage('这是我的网络爬虫程序😎'
      +'\n  grap [option]'
      +'\n    bd baidu: baidu search'
      +'\n    zh zhihu: zhihu search');

program
    .command('baidu <cmd>')
    .alias('bd')
    .description('baidu search baidu')
    .option("-t, --tieba", "baidu tieba")
    .action(function(cmd, options){
      console.log('baidu search "%s":', cmd);
      request.baiduSearch(cmd);
    }).on('--help', function() {
      console.log('  Examples:');
      console.log();
      console.log('    grab bd    <cmd>');
      console.log('    grab baidu <cmd>');
      console.log();
    });

program.parse(process.argv);

请求模块

https模块发起请求主要有两种方式，这里稍微封装了下：

get方式，主要针对的是简单的请求，只需要传递url发起get请求。知乎的调用这个就可以了。

function get(url,callback) {
    return https.get(url,function(response) {
        var body = '';

        response.on('data', function(data) {
            body += data;
        });

        response.on('end', function() {
            callback(body);
        });
    });
}

requerst方式，不但可以发起get请求，也可以发起post请求，还可以修改端口，请求header。这个主要是针对限制比较多的百度爬虫。百度必须设置header，同时百度请求参数也比较复杂，需要专门配置，具体可参考网上的资料。

function request(options,callback){
    // var postData = qs.stringify({});
    var body,
    req = https.request(options, (res) => {
        console.log('STATUS: ' + res.statusCode);
        // console.log('HEADERS: ' + JSON.stringify(res.headers));
        res.setEncoding('utf8');
        res.on('data', function (chunk) {
            body+=chunk;
        });
        res.on('end',function(){
            callback(body)
        });
    });

    req.on('error', function(e) {
        console.log('problem with request: ' + e.message);
    });

    // write data to request body
    // req.write(postData);
    req.end();
}

function baiduRequset(pageNo,pageSize,keyword){
    var path='/s?'+qs.stringify({
        ie:'utf-8',
        f:8,
        rsv_bp:1,
        tn:'baidu',
        rn:pageSize,
        pn:pageNo*pageSize,
        wd:keyword
    }),
    options = {
        hostname: 'www.baidu.com',
        port: 443,
        path: path,
        method: 'GET',
        headers: {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
        }
    };

    request(options,function(body){
        saveFile(pageNo,keyword,body);
        showBaiduResult(pageNo,body);
    });
}

数据解释

抓取数据之后，我们需要做的就是调用cheerio，以jquery获取dom内容的方式获取结果，并显示出来，当然也可以保存文件或数据库。

/**
 * 显示结果
 * @param  {[type]} pageNo [description]
 * @param  {[type]} body   [description]
 * @return {[type]}        [description]
 */
function showBaiduResult(pageNo,body){
  var title,summary,link,
      reg=/<[^<>]+>/g,
      $ = cheerio.load(body,{decodeEntities: false});

  $('#content_left .result').each(function(i,item){
    var $a = $(item).find('h3 a');
    title = $a.html();
    link = $a.attr('href');
    summary=$(item).find('.c-abstract').html();
    if(title){
      console.log(`第${pageNo+1}页 第${i+1}条`);
      console.log(`link: ${link}`.green);
      // console.log(`title: ${title}`);
      console.log('title: ');
      ouputColor(title);
      if(summary){
        // console.log(`summary: ${summary}`);
        console.log('summary: ');
        ouputColor(summary);
      }
    }
    console.log('------------------------------');
    console.log('');
  });
}

// 知乎
exports.zhihuSearch=function(keyword,cb){
  get('https://www.zhihu.com/search?type=content&q='+keyword,function(content){
    var title,summary;
    var $ = cheerio.load(content,{decodeEntities: false});
    saveFile(0,keyword,content);
    $('.list .item').each(function(i,item){
      title=$(item).find('.js-title-link').html();
      summary=$(item).find('.summary').html();
      if(title){
        // title=(''+title).replace(/<[^<>]+>/g,'');
        // summary=(''+summary).replace(/<.+>/g,'');
        console.log('title: ');
        ouputColor(title);
        if(summary){
          console.log('summary: ');
          ouputColor(summary);
        }
      }
      console.log('------------------------------');
      console.log('');
    });
  });
};

执行爬虫

功能完成后，先试验一下抓取知乎的内容

grab zh webgl

抓取到的html文件保存在download文件夹，同时在命令行显示抓取结果。

如果要执行百度的爬虫，运行如下命令行即可

grab bd webgl

总结

这里完成的是最基本的爬虫功能，代码请看net_grab

Jeff's World

Things that I'm interested in

程序入口

请求模块

数据解释

执行爬虫

总结