Skip to content

Latest commit

 

History

History
352 lines (273 loc) · 9.13 KB

README.md

File metadata and controls

352 lines (273 loc) · 9.13 KB

Proxy Scraper

Library for scraping free proxies lists written in PHP

Build Status Coverage Status Codacy Badge Packagist

Quick start

composer require vantoozz/proxy-scraper:~2 php-http/guzzle6-adapter hanneskod/classtools
<?php declare(strict_types = 1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}

Older versions

This is version 2 of the library. For version 1 please check v1 branch.

Upgrade

How to upgrade

Setup

The library requires a PSR-18 compatible HTTP client. To use the library you have to install any of them, e.g.:

composer require php-http/guzzle6-adapter

All available clients are listed on Packagist: https://packagist.org/providers/psr/http-client-implementation.

Then install proxy-scraper library itself:

composer require vantoozz/proxy-scraper:~2

Usage

Auto-configuration

The simplest way to start using the library is to use proxyScraper() function which instantiates and configures all the scrapers.

Please note, auto-configuration function in addition to php-http/guzzle6-adapter requires hanneskod/classtools dependency.

composer require php-http/guzzle6-adapter hanneskod/classtools
<?php declare(strict_types = 1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}
HTTP Client

You can override default parameters of the HTTP client like this:

<?php declare(strict_types=1);

use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle6\Client as HttpAdapter;
use Http\Message\MessageFactory\GuzzleMessageFactory as MessageFactory;
use Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient;

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new Psr18HttpClient(
    new HttpAdapter(new GuzzleClient([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new MessageFactory
);

foreach (proxyScraper($httpClient)->get() as $proxy) {
    echo $proxy . "\n";
}

Of course, you may manually configure the scraper and underlying HTTP client:

Single scraper

<?php declare(strict_types = 1);

use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle6\Client as HttpAdapter;
use Http\Message\MessageFactory\GuzzleMessageFactory as MessageFactory;
use Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new Psr18HttpClient(
    new HttpAdapter(new GuzzleClient([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new MessageFactory
);

$scraper = new Scrapers\FreeProxyListScraper($httpClient);

foreach ($scraper->get() as $proxy) {
    echo $proxy . "\n";
}

Composite scraper

You can easily get data from many scrapers at once:

<?php declare(strict_types = 1);

use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle6\Client as HttpAdapter;
use Http\Message\MessageFactory\GuzzleMessageFactory as MessageFactory;
use Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new Psr18HttpClient(
    new HttpAdapter(new GuzzleClient([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new MessageFactory
);

$compositeScraper = new Scrapers\CompositeScraper;

$compositeScraper->addScraper(new Scrapers\FreeProxyListScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\CoolProxyScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\SocksProxyScraper($httpClient));

foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

Error handling

Sometimes things go wrong. This example shows how to handle errors while getting data from many scrapers:

<?php declare(strict_types = 1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$compositeScraper = new Scrapers\CompositeScraper;

// Set exception handler
$compositeScraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurs: ' . $e->getMessage() . "\n";
});

// Fake scraper throwing an exception
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface
{
    public function get(): \Generator
    {
        throw new ScraperException('some error');
    }
});

// Fake scraper with no exceptions
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface
{
    public function get(): \Generator
    {
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
});

//Run composite scraper
foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

Will output

An error occurs: some error
192.168.0.1:8888

In the same manner you may configure exceptions handling for the scraper created with proxyScraper() function as it returns an instance of CompositeScraper:

<?php declare(strict_types = 1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = proxyScraper();

$scraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurs: ' . $e->getMessage() . "\n";
});

Validating proxies

Validation steps may be added:

<?php declare(strict_types = 1);

use Vantoozz\ProxyScraper\Exceptions\ValidationException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
use Vantoozz\ProxyScraper\Validators;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new class implements Scrapers\ScraperInterface
{
    public function get(): \Generator
    {
        yield new Proxy(new Ipv4('104.202.117.106'), new Port(1234));
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
};

$validator = new Validators\ValidatorPipeline;
$validator->addStep(new Validators\Ipv4RangeValidator);

foreach ($scraper->get() as $proxy) {
    try {
        $validator->validate($proxy);
        echo '[OK] ' . $proxy . "\n";
    } catch (ValidationException $e) {
        echo '[Error] ' . $e->getMessage() . ': ' . $proxy . "\n";
    }
}

Will output

[OK] 104.202.117.106:1234
[Error] IPv4 is in private range: 192.168.0.1:8888

Metrics

A Proxy object may have metrics (metadata) associated with.

By default Proxy object has source metric:

<?php declare(strict_types = 1);

use GuzzleHttp\Client as GuzzleClient;
use Http\Adapter\Guzzle6\Client as HttpAdapter;
use Http\Message\MessageFactory\GuzzleMessageFactory as MessageFactory;
use Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new Psr18HttpClient(
    new HttpAdapter(new GuzzleClient([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new MessageFactory
);

$scraper = new Scrapers\FreeProxyListScraper($httpClient);

/** @var Proxy $proxy */
$proxy = $scraper->get()->current();

foreach ($proxy->getMetrics() as $metric) {
    echo $metric->getName() . ': ' . $metric->getValue() . "\n";
}

Will output

source: Vantoozz\ProxyScraper\Scrapers\FreeProxyListScraper

Note. Examples use Guzzle as HTTP client.

Testing

Unit tests
./vendor/bin/phpunit --testsuite=unit
Integration tests
./vendor/bin/phpunit --testsuite=integration
System tests
php ./tests/systemTests.php

Upgrade from version 1

The biggest difference from version 1 is using a PSR-18 instead of HTTPlug compatible HTTP client.

Instead of

$httpClient = new \Vantoozz\ProxyScraper\HttpClient\GuzzleHttpClient(
    new \GuzzleHttp\Client([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])
);

the client should be instantiated like

$httpClient = new \Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient(
    new \Http\Adapter\Guzzle6\Client(new \GuzzleHttp\Client([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new \Http\Message\MessageFactory\GuzzleMessageFactory
);