How to use mod_proxy to fetch a document via HTTP?
WEB applications sometimes need to fetch another document. If the document is available on
the local machine it can simply be read it from the disk. However, if it is on the WEB a Perl programmer
tends to create a LWP::UserAgent object to
do it.
Under normal circumstances there is nothing wrong with this approach. Though, under mod_perl
there is perhaps already a mod_proxy available in the same process. That means there is an engine
that can perform HTTP requests. It is written in C and uses the Apache memory management functions. So it will
probably have a less memory footprint than a Perl implementation. Wouldn't it hence be good if one could use it from
a modperl handler?
It turns out to be possible. The general idea is to combine a subrequest with an output filter. Let's see how to do that.
Subrequests
If you already know what a subrequest is you can skip this section.
There are several ways to look at a subrequest. One might say it is Apache's way of stat(2) because one can
get a documents metadata. Others look at it as a tool to include the content of another document into the current one.
Mod_autoindex for example uses subrequests to look up metadata for each directory element.
Mod_dir uses them to see if a DirectoryIndex is available. Mod_include's
<!--#include virtual="" --> directive deployes a subrequest to include the content of
another document in the current one.
A modperl handler creates a subrequest this way:
use Apache2::SubRequest ();
my $subr=$r->lookup_uri("/local/uri");
This runs all phases of the HTTP request cycle up to and including the fixup phase. Then control is passed back to the handler and it can inspect the subrequest object. So far no output is written by the subrequest.
Eventually, the handler decides to run the subrequest further. In Perl this looks like:
$subr->run;
This statement will run the subrequest's response phase and without any measures include its output in the current output stream.
A replacement for the SSI statement <!--#include virtual="/other/document" --> in Perl
would look like
$r->lookup_uri("/other/document")->run;
An example
Suppose we have a little SSI document called /ssi/t1.html. It contains one line:
>><!--#echo var="DATE_LOCAL" --><<
Don't forget to apply the INCLUDES output filter to the URI:
<Location /ssi/t1.html>
SetOutputFilter INCLUDES
</Location>
Let's see how it works:
$ curl http://localhost/ssi/t1.html
>>Monday, 15-Mar-2010 10:17:46 CET<<
So, the SSI stuff works. Now let's try to include that document into a document generated by a modperl handler:
sub handler {
my ($r)=@_;
my $subr=$r->lookup_uri("/ssi/t1.html");
$r->print("BEFORE\n");
$subr->run;
$r->print("AFTER\n");
return Apache2::Const::OK;
}
Try it out:
$ curl http://localhost/test
BEFORE
>>Monday, 15-Mar-2010 10:23:53 CET<<
AFTER
The output filter
But we don't want the content of the other document in our output. Instead we want it as the content of a Perl variable. That's why we need a filter that applies to the subrequest. It should collect the subrequest output and drop it afterwards. A very simple implementation would read:
use Apache2::Filter ();
use Apache2::SubRequest ();
use APR::Brigade ();
use APR::Bucket ();
use Apache2::Const -compile=>'OK';
my $content;
sub filter {
my ($f, $bb) = @_;
while (my $b = $bb->first) {
$b->read(my $buf);
$content.=$buf;
$b->delete;
}
return Apache2::Const::OK;
}
Error handling is completely omitted here to show the idea. This filter would collect also the HTTP error message if the request fails which is probably not what you want.
Let's modify our modperl handler above to try it out:
my $content;
sub filter {
my ($f, $bb) = @_;
while (my $b = $bb->first) {
$b->read(my $buf);
$content.=$buf;
$b->delete;
}
return Apache2::Const::OK;
}
sub handler {
my ($r)=@_;
my $subr=$r->lookup_uri("/ssi/t1.html");
$content='';
$subr->add_output_filter( \&filter );
$r->print("BEFORE\n");
$subr->run;
$r->print("AFTER\n");
$r->print("content=$content");
return Apache2::Const::OK;
}
Note, the subrequest is run between printing BEFORE and AFTER. The next line prints
the content of the $content variable.
$ curl http://localhost/test
BEFORE
AFTER
content=>><!--#echo var="DATE_LOCAL" --><<
Oops, what is that? Why do I see the SSI code instead of the date? The point is, mod_include is also implemented
as a filter. So, it depends upon the filter order if it is applied before our filter or after it. But that's another story. I just wanted to
point out that there are pitfals like that when fetching local URLs this way. Perhaps I'll write another article about that ... some day.
But one thing works for sure. The subrequest output has been removed from the output stream and collected as $content.
Configuring mod_proxy
Let's start with a really simple reverse proxy configuration:
<Location /--/proxy>
ProxyPass http://foertsch.name/ModPerl-Tricks
</Location>
After restarting apache this page should be available through the proxy. Let's try a HEAD request:
$ curl -I http://localhost/--/proxy/using-modproxy-as-useragent.shtml
HTTP/1.1 200 OK
Date: Mon, 15 Mar 2010 12:29:15 GMT
Server: Apache
Last-Modified: Mon, 15 Mar 2010 12:25:12 GMT
ETag: "89a68-23e5-481d5f8fa1a00"
Accept-Ranges: bytes
Content-Type: text/html; charset=UTF-8
Putting all together
Now, it's really a small step to fetch the content of http://foertsch.name/ModPerl-Tricks/index.shtml through mod_proxy:
sub handler {
my ($r)=@_;
my $subr=$r->lookup_uri("/--/proxy/index.shtml");
$content='';
$subr->add_output_filter( \&filter );
$r->print("BEFORE\n");
$subr->run;
$r->print("AFTER\n");
$r->print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
$r->print("$content");
$r->print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n");
return Apache2::Const::OK;
}
The only thing that has really changed is the parameter passed to $r->lookup_uri.
$ curl http://localhost/test
BEFORE
AFTER
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<?xml version="1.0" encoding="UTF-8"?>
[ a lot of lines deleted ]
</html>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
So, the general idea works. But it seems really very inconvenient to have to configure a proxy location for each server that I eventually want to fetch content from.
Accessing an arbitrary URL
But hey, there is a pause between the subrequest creation and its execution. Maybe we can modify it to go to whatever
location we want. First, let's have a look at how mod_proxy sets $r->filename because
that's the key point:
sub handler {
my ($r)=@_;
my $subr=$r->lookup_uri("/--/proxy/index.shtml");
$r->print("subr_filename=".$subr->filename."\n");
return Apache2::Const::OK;
}
$ curl http://localhost/test
subr_filename=proxy:http://foertsch.name/ModPerl-Tricks/index.shtml
That looks quite promising. Let's try to change that before running the subrequest:
sub handler {
my ($r)=@_;
my $subr=$r->lookup_uri("/--/proxy");
$content='';
$subr->add_output_filter( \&filter );
$subr->filename("proxy:http://kabatinte.net/");
$subr->run;
$r->print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
$r->print($content);
$r->print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n");
$r->print("subr: status=".$subr->status."\n");
return Apache2::Const::OK;
}
What is new? First, we set a new filename before running the subrequest, second, the parameter passed to lookup_uri
has changed and third, $subr->status is printed out.
But see what happens. I am calling curl -v to show the HTTP status line and the headers as well:
$ curl http://localhost/test -v
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /test HTTP/1.1
> User-Agent: curl/7.19.6 (x86_64-unknown-linux-gnu) libcurl/7.19.6 OpenSSL/0.9.8k zlib/1.2.3 libidn/1.10
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 15 Mar 2010 13:31:41 GMT
< Server: Apache
< Transfer-Encoding: chunked
< Content-Type: text/plain
< X-Pad: avoid browser bug
<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The document has moved <a href="http://www.kabatinte.net/shop/1.shtml">here</a>.</p>
</body></html>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
subr: status=303
* Connection #0 to host localhost left intact
* Closing connection #0
Oops, at first glance that looks like a redirect. But look a bit closer. The main request's status is 200. Only the
request to http://kabatinte.net results in a 303. Note, the output filter collects the subrequest's
output no matter what status it is in.
By now we actually have a nice tool. Configure an otherwise unused URI as reverse proxy and then use this function to get the content of an arbitrary URL:
{
our $PROXY='/--/proxy';
my $content;
sub filter {
my ($f, $bb) = @_;
while (my $b = $bb->first) {
$b->read(my $buf);
$content.=$buf;
$b->delete;
}
return Apache2::Const::OK;
}
sub fetch_url {
my ($r, $url)=@_;
my $subr;
if( $url=~m!^/! ) { # local request
$subr=$r->lookup_uri($url);
} else { # remote via mod_proxy
$subr=$r->lookup_uri($PROXY);
$subr->filename("proxy:".$url);
}
$subr->add_output_filter( \&filter );
$content='';
$subr->run;
return $content;
}
}
In fact, I have used this approach for years.
Getting rid of /--/proxy
But the necessity of having a special URI configured as proxy only to trigger the mod_proxy handler
feels a bit inconvenient. And recently I had an idea how to get rid of it. Is it possible to turn an arbitrary request
into a proxy request after $r->lookup_uri returns? As it turns out, the answer is yes and it is really simple.
$subr->handler('proxy_server') activates mod_proxy and $r->proxyreq(2)
makes it a reverse proxy request.
sub {
my ($r)=@_;
my $subr=$r->lookup_uri("/");
$content='';
$subr->add_output_filter( \&filter );
$subr->proxyreq(2);
$subr->filename("proxy:http://kabatinte.net/");
$subr->handler('proxy_server');
$subr->run;
$r->print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
$r->print($content);
$r->print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n");
$r->print("subr: status=".$subr->status."\n");
return Apache2::Const::OK;
}
Now, you can drop that <Location /--/proxy> from the httpd.conf and see if it still works.
ModPerl2::Tools implements this approach, now.
How about other request methods, POST, HEAD etc?
As long as you don't need to pass a request body they work. Let's try an invalid method:
sub {
my ($r)=@_;
my $subr=$r->lookup_uri("/");
$content='';
$subr->add_output_filter( \&filter );
$subr->proxyreq(2);
$subr->filename("proxy:http://kabatinte.net/");
$subr->handler('proxy_server');
$subr->method('BLUB');
$subr->run;
$r->print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
$r->print($content);
$r->print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n");
$r->print("subr: status=".$subr->status."\n");
return Apache2::Const::OK;
}
$ curl http://localhost/test
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>
</head><body>
<h1>Method Not Implemented</h1>
<p>BLUB to /index.shtml not supported.<br />
</p>
</body></html>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
subr: status=501
The log file entry on kabatinte.net shows a method BLUB with a HTTP code of 501.
Though, if you need to send a request body it gets complicated. Subrequests don't have a body by definition. I am running
a slightly patched apache that can pass the request body of the main request to a subrequest. Apache 2.4 will have
mod_request that can buffer a request body. Perhaps it can also be used to generate custom request
bodies for subrequests.
HTTPS & FTP
They work. Try out https://kabatinte.net/ or
ftp://ftp.suse.com/pub/.
With a more or less up to date apache one can even use CA and client SSL certificates, revocation lists, etc.
Letzte Aktualisierung: 15.03.2010

