php也可以写爬虫Guzzle是一个十分强大的php的模拟HTTP client的第三方库 Goutte是用来解析HTML文档的_PHP

php也可以写爬虫Guzzle是一个十分强大的php的模拟HTTP client的第三方库 Goutte是用来解析HTML文档的

PHP 2022-08-04 21:20:48 0

php也可以写爬虫

说起爬虫，大多数第一反应都是python，python强大的requests和bs4等等强大的第三方库让人们都喜欢用python去写爬虫。但是php作为“世界上最好的语言”当然也可以用来开发爬虫。
写了一个小的爬虫爬取社区的文章源码地址

前期准备

composer
Guzzle Guzzle是一个十分强大的php的模拟HTTP client的第三方库，可以通过composer安装
Goutte Goutte是一个用来解析HTML文档的第三方库，可以通过composer安装

开始工作

1.安装两个库

Goutte composer require fabpot/goutte
Guzzle composer require guzzlehttp/guzzle:~6.0

2.创建命令

php artisan make:command Spider

3.命令参数

protected $signature = 'command:spider {concurrency} {keyWords*}'; //concurrency为并发数 keyWords为查询关键词

4.编写爬虫

写了一个简单的，主要用来爬取社区的文章，通过命令行参数获取要搜索的关键词，然后爬取文章，并爬下内容存在本地。直接贴代码啦。

argument('concurrency');  //并发数
        $keyWords = $this->argument('keyWords');    //查询关键词
        $guzzleClent = new GuzzleClient();
        $client = new GoutteClient();
        $client->setClient($guzzleClent);
        $request = function ($total) use ($client,$keyWords){
            foreach ($keyWords as $key){
                $url='https://laravel-china.org/search?q='.$key;
                yield function () use($client,$url){
                    return $client->request('GET',$url);
                };
            }
        };
        $pool = new Pool($guzzleClent,$request(count($keyWords)),[
            'concurrency' => $concurrency,
            'fulfilled' => function ($response, $index) use ($client){
                $response->filter('h2 > a')->reduce(function($node) use ($client){
                    if(strlen($node->attr('title'))==0) {
                        $title = $node->text();             //文章标题
                        $link = $node->attr('href');        //文章链接
                        $carwler = $client->request('GET',$link);       //进入文章
                        $content=$carwler->filter('#emojify')->first()->text();     //获取内容
                        Storage::disk('local')->put($title,$content);           //储存在本地
                    }
                });
            },
            'rejected' => function ($reason, $index){
                $this->error("Error is ".$reason);
            }
        ]);
        //开始爬取
        $promise = $pool->promise();
        $promise->wait();
    }
}