Skip to content

A proxy load balance server which makes full use of proxy pool. 代理池负载均衡服务

License

Notifications You must be signed in to change notification settings

huge67/proxy_tower

 
 

Repository files navigation

proxy_tower

A proxy load balance server allows web crawlers to use proxy pool more effectively

中文文档

trying to solve:

  1. Free proxies usually have low success rate
  2. Payment proxies have uncertain expiration time and are difficult to make full use of
  3. Avoid using invalid proxies constantly

Note: proxy_tower itself does not seek proxies

Features

  • Multiple forwarding
    • Forward request to multiple proxies
    • Return the fastest and valid response

Multiple forwarding can increase the success rate of using free or unstable proxies

  • Response verification
    • Pattern is a reused page of the target site with same URL prefix and similar HTML structure,such as movie.douban.com/subject/ for http://movie.douban.com/subject/6981153/
    • Patterns and verification rules are stored in a prefix tree which helps verify responses from different sites easily and effectively
    • Separated proxy pools for different patterns

Requirements

  • Python >= 3.6
  • redis server

Getting started

  1. pip3 install -r requirements.txt
  2. python3 proxy_entrance.py
  3. python3 bench.py

config.py

# proxy_tower relies heavily on redis which is used for storing proxies and validation rules
redis_host = getenv('redis_host', '127.0.0.1')
redis_port = getenv('redis_port', 6379)
redis_db = getenv('redis_db', 0)
redis_password = getenv('redis_password')
redis_addr = 'redis://{}:{}/{}'.format(redis_host, redis_port, redis_db)

# more options, see config.py

Docker

# with existing redis addr. Notice that you cannot use 127.0.0.1 redis address
docker pull worldwonderer/proxy_tower
docker run redis_host=<redis-ip> --env redis_port=<6379> --env redis_password=<foobared> -p 8893:8893 worldwonderer/proxy_tower

# don't have existing redis
# highly recommended approach which starts a webui simultaneously
cd proxy_tower/
docker-compose up

Response Verification

Currently support 2 kinds of verification rules

  1. whitelist If the response contains specified keywords, response is determined to be valid
  2. xpath If xpath can extract specified value from response, response is determined to be valid
import json
import redis

r = redis.StrictRedis()
# whitelist
r.hset("response_check_pattern", "movie.douban.com/subject/", json.dumps({'pattern': 'movie.douban.com/subject/', rule': 'whitelist', 'value':'ratingValue'}))

# xpath
r.hset("response_check_pattern", "movie.douban.com/subject/", json.dumps({'pattern': 'movie.douban.com/subject/', 'rule': '//*[@id="recommendations"]/h2/i', 'value':'喜欢这部电影的人也喜欢'}))

After configuring the verification rule for the pattern movie.douban.com/subject/,when you crawl web pages like https://movie.douban.com/subject/27119724/,proxy_tower will verify the content of response and score the proxy

Adding proxies

You can add proxy source in models/proxy.py through file or API

# File
class ProxyFile(ProxySource):

    def __init__(self, tag, file_path):
        self.file_path = file_path
        self.tag = tag

    async def fetch_proxies(self):
        with open(self.file_path, 'r') as f:
            proxy_candidates = re.findall(self.proxy_pattern, f.read())
            for proxy in proxy_candidates:
                yield Proxy.parse(proxy, tag=self.tag, support_https=True, paid=False)


# API
class ProxyApi(ProxySource):

    def __init__(self, tag, api, valid_time):
        self.api = api
        self.tag = tag
        self.valid_time = valid_time

    async def fetch_proxies(self):
        r = await crawl("GET", self.api)
        text = await r.text()
        proxy_candidates = re.findall(self.proxy_pattern, text)
        for proxy in proxy_candidates:
            yield Proxy.parse(proxy, tag=self.tag, valid_time=self.valid_time, paid=False)

Proxies from different proxy source have their own properties, you can tag the proxy and initialize properties at the very beginning

  • valid_time
  • support_https
  • paid

Dashboard

proxy_tower_dashboard

  • Display all proxies and their info
  • View, modify or add patterns
  • A Line chart of each pattern's success rate

HTTPS

For https must sites,put 'Need-Https': 'yes' in headers,proxy_tower will pick proxies with support_https tag

Note:do not add https in URL, e.g. use http://www.bilibili.com instead of https://www.bilibili.com

Todo

  • Test
  • Support conditional expressions in verification rules

About

A proxy load balance server which makes full use of proxy pool. 代理池负载均衡服务

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 87.9%
  • HTML 11.0%
  • Other 1.1%