Kazuho's Weblog: perl

Showing posts with label perl. Show all posts

Wednesday, December 27, 2017

git blameでプルリクエストの番号を表示する

GitHubでプルリクエスト前提の開発をしていると、git blameで「なぜ、このコードがこうなっているのか」調べる際に、commit idではなくプルリクエストの番号を表示してほしくなります。

というわけで書いたのが git-blame-pr.pl。

以下のような感じで表示されるので、調査がはかどります。

$ git-blame-pr.pl lib/core/request.c
(中略)
PR #446   
PR #606   h2o_iovec_t h2o_get_redirect_method(h2o_iovec_t method, int status)
PR #606   {
PR #606       if (h2o_memis(method.base, method.len, H2O_STRLIT("POST")) && !(status == 307 || status == 308))
PR #606           method = h2o_iovec_init(H2O_STRLIT("GET"));
PR #606       return method;
PR #606   }
PR #606   
PR #1436  static void do_push_path(void *_req, const char *path, size_t path_len, int is_critical)
PR #1436  {
PR #1436      h2o_req_t *req = _req;
PR #1436  
PR #1436      if (req->conn->callbacks->push_path != NULL)
PR #1436          req->conn->callbacks->push_path(req, path, path_len, is_critical);
PR #1436  }
PR #1436  
PR #1169  h2o_iovec_t h2o_push_path_in_link_header(h2o_req_t *req, const char *value, size_t value_len)
PR #446   {
PR #1169      h2o_iovec_t ret = h2o_iovec_init(value, value_len);
PR #446   
PR #1436      h2o_extract_push_path_from_link_header(&req->pool, value, value_len, req->path_normalized, req->input.scheme,
PR #1436                                             req->input.authority, req->res_is_delegated ? req->scheme : NULL,
PR #1436                                             req->res_is_delegated ? &req->authority : NULL, do_push_path, req, &ret);
PR #446   
PR #1169      return ret;
PR #446   }

Tuesday, March 24, 2015

released Server::Starter 0.21; no more external dependencies, easy to fat-pack

I am happy to announce the release of Server-Starter version 0.21.

In the release I have removed dependencies to perl modules not in core (e.g. Proc::Wait3, List::MoreUtils, Scope::Guard).

Without dependencies on XS modules it would now be easier to install the module on any system.

The change also opens the possibility to fat-pack the `start-server` script; it can be accomplished by just running the fatpack-simple script.

$ fatpack-simple start_server
-> perl-strip Server/Starter.pm
-> perl-strip Server/Starter/Guard.pm
-> Successfully created start_server.fatpack
$

Have fun!

Tuesday, February 17, 2015

Writing signal-aware waitpid in Perl

As I have talked in YAPC::Asia couple of years ago, the wait functions (e.g. wait, waitpid) of Perl do not return EINTR when receiving a signal.

This is a problem if you would want to wait for child processes until receiving a signal. Proc::Wait3 can be a solution, however the module may be hard to install as it is an XS module. It should also be noted that the module provides replacement for wait only; no workaround exists for waitpid.

So today I have scrubbed my head wondering if I could come up with a pure-perl solution, and, here it is. The Perl script below launches a worker process (that just sleeps), and waits for the process to complete, or until SIGTERM is being received.

use strict;
use warnings;
use Errno ();

our $got_sigterm = 0;
our $sighandler_should_die = 0;

# fork a child process that does the task
my $child_pid = fork;
die "fork failed:$!"
    unless defined $child_pid;
if ($child_pid == 0) {
    # in child process, do something...
    sleep 100;
    exit 0;
}

$SIG{TERM} = sub {
    $got_sigterm = 1;
    die "dying to exit from waitpid"
        if $sighandler_should_die;
};

warn "master process:$$, child process:$child_pid";

# parent process, wait for child exit or SIGTERM
while (! $got_sigterm) {
    if (my_waitpid($child_pid, 0) == $child_pid) {
        # exit the loop if child died
        warn "child process exitted";
        $child_pid = -1;
        last;
    }
}

if ($child_pid != -1) {
    warn "got SIGTERM, stopping the child";
    kill 'TERM', $child_pid;
    while (waitpid($child_pid, 0) != $child_pid) {
    }
}

sub my_waitpid {
    my @args = @_;
    local $@;
    my $ret = eval {
        local $sighandler_should_die = 1;
        die "exit from eval"
            if $got_sigterm;
        waitpid($args[0], $args[1]);
    };
    if ($@) {
        $ret = -1;
        $! = Errno::EINTR;
    }
    return $ret;
}

The trick is that waitpid is surrounded by a eval within the my_waitpid function, and the signal handler calls die to exit the eval if the $sighandler_should_die flag is being set. It is also essential to check the $got_sigterm flag within the eval block after setting the $sighandler_should_die flag, since otherwise there would be a race condition.

By using these tricks it has now become possible to implement process managers in pure-perl!

Tuesday, July 1, 2014

The JSON SQL Injection Vulnerability

tl;dr

Many SQL query builders written in Perl do not provide mitigation against JSON SQL injection vulnerability.

Developers should not forget to either type-check the input values taken from JSON (or any other hierarchical data structure) before passing them to the query builders, or should better consider migrating to query builders that provide API immune to such vulnerability.

Note: 問題の発見者による日本語での説明はこちらです.

Background

Traditionally, web applications have been designed to take HTML FORMs as their input. But today, more and more web applications are starting to receive their input using JSON or other kind of hierarchically-structured data containers thanks to the popularization of XMLHttpRequest and smartphone apps.

Designed in the old days, a number of Perl modules including SQL::Maker have been using unblessed references to define SQL expressions. The following example illustrate how the operators are being specified within the users' code. The first query being generated consists of an equality match. The second query is generated through the use of a hashref to specify the operator used for comparison.

use SQL::Maker;
my $maker = SQL::Maker->new(…);

# generates: SELECT * FROM `user` WHERE `name`=?
$maker->select('user', ['*'], {name => $query->param('name')}); 

# generates: SELECT * FROM `fruit` WHERE `price`<=?
$maker->select('fruit', ['*'], {price => {'<=', $query->param('max_price')}});

This approach did not receive any security concern at the time it was invented, when the only source of input were HTML FORMs, since it is represented as a set of key-value pairs where all values are scalars. In other words, it is impossible to inject SQL expressions via HTML FORMs due to the fact that there is a guarantee by the query parser that the right hand expression of foo (i.e. $query->param('foo')) is not a hashref.

JSON SQL Injection

But the story has changed with JSON. JSON objects are represented as hashrefs in Perl, and thus a similar code receiving JSON is vulnerable against SQL operator injection.

Consider the code below.

use SQL::Maker;
my $maker = SQL::Maker->new(…);

# find an user with given name
$maker->select('user', ['*'], {name => $json->{'name'}});

The intention of the developer is to execute an SQL query that fetches the user information by using an equality match. If the input is {"name": "John Doe"} the condition of the generated query would be name='John Doe', and a row related to the specified person would be returned.

But what happens if the name field of the JSON was an object? If the supplied input is {"name": {"!=", ""}}, then the query condition becomes name!='' and the database will return all rows with non-empty names. Technically speaking, SQL::Maker accepts any string supplied at the key part as the operator to be used (i.e. there is no whitelisting); so the attack is not limited to changing the operator. (EDIT: Jun 3 2014)

Similar problem exists with the handling of JSON arrays; if the name field of the JSON is an array, then the IN operator would be used instead of the intended = operator.

It should be said that within the code snippet exists an operator injection vulnerability, which is referred hereafter as JSON SQL injection. The possibility of an attacker changing the operator may not seem like an issue of high risk, but there are scenarios in which an unexpected result-set of queries lead to unintended information disclosures or other hazardous behaviors of the application.

To prevent such attack, application developers should either assert that the type of the values are not references (representing arrays/hashes in JSON), or forcibly convert the values to scalars as shown in the snippet below.

use SQL::Maker;
my $maker = SQL::Maker->new(…);

# find an user with given argument that is forcibly converted to string
$maker->select('user', ['*'], {name => $json->{'name'} . ''});

Programmers Deserve a Better API

As explained, the programming interface provided by the SQL builders including SQL::Maker is per spec. as such, and thus it is the responsibility of the users to assert correctness of the types of the data being supplied.

But it should also be said that the programming interface is now inadequate in the sense that it is prone to the vulnerability. It would be better if we could use a better, safer way to build SQL queries.

To serve such purpose, we have done two things:

developed SQL::QueryMaker
introduced strict mode to SQL::Maker

SQL::QueryMaker and the Strict Mode of SQL::Maker

SQL::QueryMaker is a module that we have developed and released just recently. It is not a fully-featured query builder but a module that concentrates in building query conditions. Instead of using unblessed references, the module uses blessed references (i.e. objects) for representing SQL expressions / exports global functions for creating such objects. And such objects are accepted by the most recent versions of SQL::Maker as query conditions.

Besides that, we have also introduced strict mode to SQL::Maker. When operating under strict mode, SQL::Maker will not accept unblessed references as its arguments, so that it would be impossible for attackers to inject SQL operators even if the application developers forgot to type-check the supplied JSON values.

The two together provides a interface immune to JSON SQL injection. The code snippet shown below is an example using the features. Please consult the documentation of the modules for more detail.

use SQL::Maker;
use SQL::QueryMaker;

my $maker = SQL::Maker->new(
   …,
   strict => 1,
);

# generates: SELECT * FROM `user` WHERE `name`=?
$maker->select('user', ['*'], {name => $json->{‘name'}}); 

# generates: SELECT * FROM `fruit` WHERE `price`<=?
$maker->select('fruit', ['*'], {price => sql_le($json->{‘max_price’})});

Similar Problem may Exist in Other Languages / Libraries

I would not be surprised if the same proneness exist in other modules of Perl or similar libraries available for other programming languages, since it would seem natural from the programmers' standpoint to change the behaviour of the match condition based on the type of the supplied value.

Generally speaking application developers should not except that a value within JSON is of a certain type. You should always check the type before using them. OTOH we believe that library developers should provide a programming interface immune to vulnerabilities, as we have done in the case of SQL::Maker and SQL::QueryMaker.

Note: the issue was originally reported by Mr. Toshiharu Sugiyama, my colleague working at DeNA Co., Ltd.

Tuesday, April 22, 2014

[perl][memo] File::Tempのバッドノウハウ

■まとめ

tempfile(...)が作成したテンポラリファイルは、環境によってはflockされていることがある
tempfile(CLEANUP => 1)は、テンポラリファイルへの参照をretainする
つまり、CLEANUPを指定している場合、参照カウントに頼った自動closeは機能しないので、明示的にcloseする必要がある
また、明示的にcloseしないとflock可能にならない場合がある

■ログ

16:23:30 <kazuho_> あれ perl って file handle への refcnt がゼロになったら自動的に close してくれますよね
16:23:43 <tokuhirom> してくれますね
16:23:48 <tokuhirom> しなきゃおかしいw
16:32:33 <kazuho_> https://gist.github.com/kazuho/11168660
16:32:37 <kazuho_> こういうことだった
16:32:53 <tokuhirom> あー。それな。
16:33:01 <tokuhirom> なんか File::Temp さんごちゃごちゃやってんすよね
16:42:37 <kazuho_> linux で perl -MFile::Temp=tempfile -e '(undef, my $fn) = tempfile(UNLINK => 1); sleep 100'
16:42:47 <kazuho_> ってやっても、テンポラリファイルが開きっぱになるなー
16:49:41 <kazuho_> _deferred_unlink って関数が $fh にぎにぎしちゃうのかー > File::Temp
16:50:50 <tokuhirom> UNLINK => 1 するとなかなか UNLINK されなくなるの、だいぶアレゲですねw
16:51:16 <kazuho_> というより、
16:51:22 <kazuho_> > # close the filehandle without checking its state
16:51:23 <kazuho_> > # in order to make real sure that this is closed
16:51:30 <kazuho_> という理由で $fh をにぎりっぱにしてるから
16:51:38 <kazuho_> refcnt 減らしても自動でcloseされない
16:52:13 <tokuhirom> なんか UNLINK => 0 してやり過ごすってのを昔見た気がした
16:52:27 <kazuho_> そしたら自動削除してくれないじゃんw
16:52:33 <kazuho_> 明示的にcloseするわ…
16:53:03 <kazuho_> Starlet で select を、テンポラリファイルの flock で囲おうとしてるんだけど
16:53:15 <kazuho_> osx だと tempfile が EXLOCK 指定する
16:53:23 <kazuho_> → UNLINK してるとファイル開いたままになる
16:53:28 <kazuho_> → ロックできない！！！
16:53:34 <kazuho_> という問題なので
16:54:01 <tokuhirom> 明示的にクローズが正解かー
16:54:07 <kazuho_> ですね
16:54:09 <kazuho_> まあ、UNLINK => 1 してるとファイル開きっぱになるの、バグだと思うけどなー
16:54:15 <kazuho_> file descriptorたりなくなるじゃん！！
16:54:52 <kazuho_> まあそういう場合は tempdir してその中に手動でファイル作ろうね、なんだろうな
16:54:53 <tokuhirom> なんかでもそこ今更変えられなさそうw
16:54:57 <kazuho_> 変えようがないでしょうね
16:54:58 <tokuhirom> a-
16:55:40 <kazuho_> あざますあざます

Friday, April 11, 2014

[メモ] Starlet 0.22のリリースに伴いThundering Herd問題を再訪した件

@takezawaさんから、PerlベースのWebアプリケーションサーバであるStarletで複数ポートをlistenできるようにするPRをいただいたのでマージしました。やったね！

で、それに伴いprefork型TCPサーバのThundering Herd問題を再訪したので、その備忘録。なお、Thundering Herd問題については、prefork サーバーと thundering herd 問題 - naoyaのはてなダイアリーや、Starman と Starlet のベンチマークと Accept Serialization - Hateburo: kazeburo hatenablogあたりを参照のこと。

まず、こんなテストスクリプトを書いた： thundering-herd.pl

こいつを書いてあるコメントのとおり実行してみることで、２種類のケースでThundering Herd問題が存在するか調べることができる。

で、こいつを使った結果、以下の結論に達した。

accept(2)の呼出によるThundering Herd問題だが、多くの環境で過去の問題になったと考えてよい
select(2)で接続を待ち受け、次にaccept(2)を呼ぶようなケースでは、依然としてThundering Herd問題が存在する（当然と言えば当然だが）

とか言ってみたけど、、linux 2.6.32とOS X 10.8.5でしかテストしてないので、補足あれば教えてください m(__)m

あと、複数のポートにbind(2)するTCPサーバを書く際に注意すべき点として、prefork型サーバでselect(2)→accept(2)の呼出順序で接続を確立する際は、ソケットをnonblockingモードにしておかないと、ワーカプロセスが余っているのに接続できないといった事態が発生する可能性がある。

この点は、上記のテストスクリプトを書くまでレビューするのを忘れていたんだけど、確認したらtakezawaさんのコードではちゃんと記述されていて、ちょっと感動した。いい仕事ありがとうございました。

Tuesday, February 25, 2014

ウェブアプリの「合理的な」セキュリティ対策に関する一考察

※※※ミドルウェアを中心に、ウェブ関連の研究開発に携わっている者による雑文です※※※

ウェブの脆弱性は、ウェブアプリケーションのバグに起因するものと、ウェブブラウザのバグに起因するものの２者に大別することができる。

ウェブアプリケーションを開発／提供する仕事に従事している者には、この前者、すなわち、ウェブアプリケーションのバグに起因する脆弱性を最小限に抑え込むことを求められる^注1。

かといって、脆弱性がないことを保障するのは難しい。「ウェブアプリケーションにバグがあっても脆弱性とはならない（あるいは被害が限定される）ような設計」を採用するのが現実的だと考えられる。

OSにおける、プロセス間のメモリ分離やuserIDに基づいたファイルへのアクセス制御を考えてみると、OSがセキュリティを「強制」するため、アプリケーション側で不正なコードが実行されても脆弱性とならない、もしくは、影響を小さく抑え込むことができるようになっていることがわかる。

ウェブ技術における同様の例は数多いが、たとえばXSS脆弱性対策を見ると、

Content Security Policy

実行可能なスクリプトを制限する機能

HttpOnly

スクリプトからアクセス不可能なクッキー

自動エスケープ対応のテンプレートエンジン

Xslate等^注2

といったものを挙げることができる。また、SQL Injection対策を見ると、

(ある種の)2-way SQL

SQLは外部ファイルに書き、プログラムからはバインドするパラメータしか操作できない
参考: 外だしSQLの使い方 | DBFlute

ストアドプロシージャによるアクセス手法限定

参考: パスワードが漏洩しないウェブアプリの作り方〜ソルトつきハッシュで満足する前に考えるべきこと

等の手法が知られている。

これらの対策をひとつ選択し、あるいは組み合わせて使うことで、コーディングミスがあったとしても脆弱性が発現しない（もしくは発現する可能性が低い）アプリケーションを実現することができる。

ただ、この種の技術には多かれ少なかれ、アプリケーション側のコードに不自由を強いるという側面がある。

たとえば、Content Security Policyには、インラインの<SCRIPT>タグを実行しづらいという制限がある（1.1で修正見込）し、例として挙げたSQL Injection対策のいずれもが現実的でないウェブアプリケーションも多いだろう。また、SQLにおける条件節の漏れによる情報漏洩のように、本質的に対策が難しい^注3問題も存在する。

以上のように、共通モジュール（あるいは下位レイヤ）でアクセス方法を「強制」する仕組みを用いることで、脆弱性への耐性を高めるという情報工学における一般的なアプローチは、ウェブ技術においても有効であり、積極的に使用すべきである^注4。一方で、述べたように、今後の発展が期待される分野も存在する^注5。

注1: 後者については、一義的にはウェブブラウザベンダーが対応すべき問題である。もちろん、ウェブアプリケーション側で緩和策が実装できるならすれば良いケースもある

注2: 最新のテンプレートエンジン事情を良く知らないので列挙はしません。また、DOM　APIベースのアプローチについても本稿では割愛します。

注3: ウェブアプリケーションにおいては、アクセス制限とアクセスを単一のクエリで記述することを求められることが多いため。この点は、ケーパビリティの概念を導入したORMのようなアプローチで解決可能なのかもしれないが…

注4: 「IPA 独立行政法人情報処理推進機構：安全なウェブサイトの作り方」では、脆弱性を９種類に類型化して説明しているが、そのほとんどは「アプリケーションプログラマがミスをしても問題ない」ような共通コード（ウェブアプリケーションフレームワークやライブラリ等）の使用により回避することが可能であるし、そのような実装が称揚されるべきである

注5: なので、研究課題として面白いと思います

Wednesday, December 18, 2013

プログラミング言語における正規表現リテラルの必要性について

Twitterに書いたことのまとめです。

プログラミング言語の仕様の一部として正規表現リテラルを提供することの得失について、JavaScriptを例に説明します。

■より簡潔なコード

言うまでもありませんが、正規表現リテラルを使った方が簡潔なコードになります。

(new RegExp("abc")).exec(s)  // リテラルを使わない場合
/abc/.exec(s)                // リテラルを使った場合

また、正規表現リテラルがない場合は、文字列リテラルとしてのエスケープと正規表現としてのエスケープが二重に必要になる結果、コードの保守性が低下します^注1。

new RegExp("\\\\n");  // リテラルを使わない場合
/\\n/                 // リテラルを使った場合

■エラー検出タイミング

正規表現リテラルがない場合、実際にその正規表現が評価されるまで記述エラーを検出することができません。正規表現リテラルがあれば、コンパイル時^注2にエラーを検出できるので、開発効率と品質が向上します。

new RegExp("(abc")  // 実行時例外
/(abc/              // コンパイル（起動）時にエラー検出

■実行速度

正規表現リテラルがないと、正規表現を適用する度ごとに、毎回正規表現をコンパイルするコードを書きがちです。これは、実行速度を大幅に悪化させます。正規表現リテラルがあれば、正規表現を言語処理系側でコンパイルして使い回すことができるので、実行速度が向上します。

new RegExp("abc").exec(s) // 実行する度に正規表現がコンパイルされる
/abc/.exec(s)             // 正規表現のコンパイルは実行回数に関係なく１回

また、正規表現に対しては「単純な文字列処理関数より遅そう」という意見を目にすることもありますが、そのような一般化は誤りです^注3。例えば、JavaScript処理系における速度比較についてはregexp-indexof・jsPerfをご覧ください。ウェブブラウザによっては、正規表現を使ったほうがString#indexOfよりも高速なのがご確認いただけると思います。

■より単純で強力な文字列API

上記３点より、正規表現の使用を前提とするのであれば、正規表現リテラルを採用した方が言語処理系の利用者の開発効率が向上することは明らかだと思います。

残る問題は、正規表現リテラルを採用することで、そのプログラミング言語はより煩雑で、利用者にとって使いづらいものになってしまわないかという点です。

この点については、以下のトレードオフが存在します。

PythonやPHPのような正規表現の使用を積極的にアフォードしていないプログラミング言語では、多くの文字列処理関数が存在します。利用者は、これらの関数の仕様を記憶するか、あるいは都度ドキュメントを参照することを求められます。

これに対し、JavaScriptやPerlのような正規表現リテラルを提供しているプログラミング言語では、文字列処理関数の数は比較的少なくなっています。必要な処理は、プログラマが正規表現を使って簡単に書くことができるからです。

また、正規表現を使うことで、例えば以下のような、複数の評価手法を合成した処理を簡単に記述することができます。文字列処理関数を使うアプローチの場合、このような処理をするためには複数の関数を組み合わせざるを得ません。

/^\s*abc/   // 先頭に空白、続いてabc

■まとめ

以上のように、正規表現リテラルを言語仕様に導入すれば、プログラマに正規表現の学習を強いることと引き換えに、より単純で強力な処理系を提供することができます^注4。

言語処理系の開発時に正規表現リテラルを採用すべきか否かについて検討すべき論点は、だいたい以上のとおりだと思います。あとは、言語処理系がどのような目的で、どのような開発者に使われるのか、処理系開発者のバランス感覚によって決まる問題ではないでしょうか。

■補遺 (2013/12/19追記)：

正規表現リテラル導入の是非は上のとおりですが、文字列処理に正規表現を推奨すべき否か、という論点について、私の考えを補足します。

プログラミング言語仕様の設計という視座にたった場合、文字列処理の手法として、文字列処理関数を推奨するアプローチと正規表現を推奨するアプローチのいずれがより優れているか、という問いに対して一般化できる答えは存在しません。

たとえば、PHP等に存在するtrim関数（文字列の前後の空白を削除）は、同等の正規表現（s.replace(/^\s*(.*)\s*$/, "$1")）よりも簡潔です。文字列処理のうち、頻出するパターンについて、適切な名前をもつ関数を提供することで可読性を向上させるというアプローチは理に適っています。

逆のケースとしては、次のようなものがあります。

文字列処理を実装していると、文字列の末尾から改行文字を１つ（のみ）削除したいこともあれば、末尾の改行文字を全て削除したいこともあります。正規表現を使えば、この種の需要に対応することは簡単です（/\n$/ あるいは /\n*$/）が、これらの需要に対応する文字列処理関数をいちいち言語処理系で提供することは現実的ではありません。

あるいは、電話番号を入力するフォームにおいて適切な文字のみが使われているかチェックするのに適しているのは、文字列関数ではなく正規表現（/^[0-9\-]+$/）でしょう。

このように、正規表現を使うアプローチには、文字列処理関数と比較して、単純な処理においては可読性が劣りがちな一方で、様々なバリエーションがある処理を統一的な手法で記述できるという点では優れているという特徴があります。

注1: この問題は、エスケープシーケンスの衝突を回避できる言語（例: Perl）、raw文字列リテラルが存在する言語（例: Python）では問題になりません
注2: 多くのインタプリタ型処理系の場合は起動時
注3: プログラムのソースコードであれ正規表現であれ、コンパイル後にVMで（場合によっては機械語にコンパイルして）実行される以上、速度差はその変換処理の優劣の問題だと理解すべきです
注4: 正規表現は使い方に制約がなさすぎて嫌、という考え方もあるかと思います

Saturday, October 15, 2011

Unix Programming with Perl 2 (my slides at YAPC::Asia 2011)

Today's slides are here:

Unix Programming with Perl 2

View more presentations from kazuho.

Last year's slides are here:

Unix Programming with Perl

View more presentations from kazuho

Thursday, April 28, 2011

Webアプリケーションの無停止稼働 - Server::Starter, Parallel::Prefork, Starlet を使って (SoozyConference 7 発表資料)

１月に開催された SoozyConference 7 の発表資料です。

Webアプリケーションの無停止稼働

View more presentations from kazuho

Friday, February 4, 2011

5x performance - switching from LWP to Furl & Net::DNS::Lite

Recently I rewrote some of our code that used LWP::UserAgent to use Furl instead, and have been observing more than 5x increase in performance (the CPU time spent for each HTTP request in average has dropped 82%).

The fact clearly shows that if you are having performance issues with LWP::UserAgent it is a good idea to switch to Furl. And here are my recommendations when doing so:

use the low-level interface (Furl::HTTP)

The OO-interface provided by Furl.pm is not as fast as the low level interface, though it is still about twice as fast as LWP::UserAgent.

use Net::DNS::Lite for accurate timeouts

The timeout parameter of LWP::UserAgent is not accurate. The module may wait longer than the specified timeout while looking up hostnames. Furl provides a callback to use non-default hostname lookup functions with support for timeouts, and Net::DNS::Lite can be used for the purpose.

use Cache::LRU to cache DNS queries

In general DNS queries are lightweight but caching the responses can still have positive effect (if the cache is very fast). Cache::LRU is suitable for such an usecase.

The below code snippet illustrates how to setup a Furl::HTTP object with the described configuration.

use Cache::LRU;
use Furl;
use Net::DNS::Lite;

# setup cache for Net::DNS::Lite
$Net::DNS::Lite::CACHE = Cache::LRU->new(
    size => 256,
);

# create HTTP object...
my $furl = Furl::HTTP->new(
    inet_aton => \&Net::DNS::Lite::inet_aton,
    timeout   => $timeout_in_seconds,
    ...

Monday, November 29, 2010

Cache::LRU (a handy and fast in-memory cache module in pure-perl)

Last week, after not being able to find a handy and fast cache module, I decided to write one by myself, and the outcome is Cache::LRU.

Cache::LRU is an in-memory cache module written in pure-perl. It has no dependencies, and the code is less than 100 lines long. Yet it is faster than other modules as the result of a primitive benchmark shows (note: Cache::FastMmap is a shared-memory cache and the overhead of inter-process communication needs to be taken into consideration).

$ perl -Ilib benchmark/simple.pl
cache_hit:
                          Rate Cache::Ref::LRU (List) Cache::FastMmap Cache::Ref::LRU (Array) Cache::FastMmap (raw) Tie::Cache::LRU Cache::LRU
Cache::Ref::LRU (List)  29.1/s                     --            -14%                    -31%                  -37%            -72%       -80%
Cache::FastMmap         33.9/s                    17%              --                    -20%                  -27%            -68%       -77%
Cache::Ref::LRU (Array) 42.3/s                    45%             25%                      --                   -9%            -60%       -71%
Cache::FastMmap (raw)   46.4/s                    60%             37%                     10%                    --            -56%       -69%
Tie::Cache::LRU          105/s                   260%            209%                    147%                  125%              --       -29%
Cache::LRU               148/s                   408%            336%                    249%                  218%             41%         --

cache_set:
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                       s/iter Cache::Ref::LRU (List) Tie::Cache::LRU  Cache::LRU
Cache::Ref::LRU (List)   3.02                     --            -43%        -70%
Tie::Cache::LRU          1.71                    77%              --        -47%
Cache::LRU              0.910                   232%             88%          --
$ uname -a
Darwin ******** 10.5.0 Darwin Kernel Version 10.5.0: Fri Nov  5 23:20:39 PDT 2010; root:xnu-1504.9.17~1/RELEASE_I386 i386