Advanced Usage » Downloader Middleware

Advanced Usage

Downloader Middleware

Learn how to write custom middleware to hook into Roach’s request/response cycle.

Downloader middleware sits between the Roach engine and the Downloader. The Downloader is in charge of handling the HTTP side of things, i.e. sending requests and retrieving responses. Every outgoing request and incoming response will get passed through the downloader middleware.

Writing Middleware

Downloader middleware are classes which implement DownloaderMiddlewareInterface.

interface DownloaderMiddlewareInterface extends
    RequestMiddlewareInterface,
    ResponseMiddlewareInterface
{
}

A downloader middleware deals with both outgoing requests as well as incoming responses. When writing downloader middleware, we often are interested in only one of these things, not both of them. For this reason, Roach splits these two concerns into two separate interfaces, RequestMiddlewareInterface and ResponseMiddlewareInterface, respectively.

The main DownloaderMiddlewareInterface is the combination of these two interfaces. This separation allows us to only implement the interface we care about, instead of always having to satisfy both interfaces. Behind the scenes, Roach will wrap our middleware in an adapter class that will provide the missing methods to that our middleware always satisfies the full interface.

Request Middleware

Downloader middleware that process outgoing requests before they get sent need to implement RequestMiddlewareInterface.

interface RequestMiddlewareInterface extends ConfigurableInterface
{
    public function handleRequest(Request $request): Request;
}

This interface defines a single method, handleRequest. This method will get called with the Request that’s about to be sent and is supposed to return another Request object.

Request middleware can be used to apply some transformation to all outgoing requests, like the built-in UserAgentMiddleware.

Note that downloader request middleware gets run after spider request middleware. The exception to this are the initial requests, which don’t get sent through the spider middleware at all.

Dropping Requests

If we don’t want the request to get sent, we can drop it by calling the Request class’drop method and returning the result.

<?php

use RoachPHP\Downloader\Middleware\RequestMiddlewareInterface;
use RoachPHP\Http\Request;
use RoachPHP\Support\Configurable;

class MyRequestMiddleware implements RequestMiddlewareInterface
{
    use Configurable;

    public function handleRequest(Request $request): Request
    {
        // Make sure to provide a useful reason for this
        // request got dropped.
        return $request->drop('Whoops, did I do that?');
    }
}

Dropping a request prevents any further downloader middleware from running and the request will not get send. It will also fire a RequestDropped event which you can subscribe to in an extension.

Short-Circuiting a Request

Sometimes you may want to simply set the response of a request without the request actually getting sent, i.e. in a caching middleware. In this case you may use the Request::withResponse() method inside you middleware.

class CachingMiddleware implements RequestMiddlewareInterface
{
    use Configurable;

    public function handleRequest(Request $request): Request
    {
        return $request->withResponse(
            $this->cache->forRequest($request)
        );
    }
}

All request and response middleware will still get run for this request. However, the request will not actually get sent and will instead immediately be sent through the processing pipeline. All events still get fired as usual.

Defining Configuration Options

Check out the dedicated page about configuring middleware and extensions to learn how to define configuration options for our middleware.

Response Middleware

Downloader middleware that deal with responses need to implement ResponseMiddlewareInterface.

interface ResponseMiddlewareInterface extends ConfigurableInterface
{
    public function handleResponse(Response $response): Response;
}

This interface defines a single method handleResponse which takes the Response object as a parameter and is supposed to return another Response object.

Downloader response middleware gets run immediately after a response was received. This means it gets run before any spider response middleware.

Dropping Responses

To drop a response, we can call the Response class’drop method and return the result.

<?php

use RoachPHP\Http\Response;
use RoachPHP\Spider\Middleware\ResponseMiddlewareInterface;
use RoachPHP\Support\Configurable;

class MyResponseMiddleware implements ResponseMiddlewareInterface
{
    use Configurable;

    public function handleResponse(Response $response): Response
    {
        return $response->drop('Responses only get processed during working hours');
    }
}

Dropping a response prevents any further downloader middleware from being run and the response will not get passed to the spider. Roach will fire a ResponseDropped event which we can subscribe on in an extension.

Accessing the Request

Every Response stores a reference to it’s corresponding Request. This allows us to keep track of information across the HTTP boundary. We can access the request through the getRequest method on the Response.

$response->getRequest();
// => RoachPHP\Http\Request {#2666}

Defining Configuration Options

Check out the dedicated page about configuring middleware and extensions to learn how to define configuration options for our middleware.

Built-in Middleware

Roach ships with various built-in downloader middleware for common tasks when dealing with HTTP requests and responses.

Handling HTTP Errors

The built-in RoachPHP\Downloader\Middleware\HttpErrorMiddleware automatically drops requests with a non-successful HTTP status. According to the HTTP standard, responses with a status in the 200-300 range are considered successful.

This middleware is enabled by default if your spider extends from BasicSpider.

Configuration Options

NameDefaultDescription
handleStatus[]A list of HTTP statuses outside the 200-300 range that should be handled by your spider. For instance, setting this option to [404] would allow your spider to process 404 responses.

Setting the User-Agent Header

In order to attach the same User-Agent header to every outgoing request, we can use the RoachPHP\Downloader\Middleware\UserAgentMiddleware middleware.

Configuration Options

NameDefaultDescription
userAgentroach-phpThe user agent to attach to every outgoing HTTP request.

Request Deduplication

To avoid sending duplicate requests, we can register the RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware for our spider. Any request to a URL that has already been crawled or scheduled during the same run will be dropped.

Configuration Options

NameDefaultDescription
ignore_url_fragmentsfalseWhether or not URL fragments should be ignored when comparing URLs. If set to false, https://url.com/foo#bar will be considered a duplicate of https://url.com/foo.
ignore_trailing_slashestrueWhether or not trailing slashes should be ignored when comparing URLs.
ignore_query_stringfalseWhether or not the query string should be ignored when comparing URLs. When set to true it will completely ignore the query string, so it will consider two URLs to be identical if they contain all the same key-value pairs in the query, regardless of order.

Using this middleware is recommended for most use-cases as it allows you to write your spider in a very naive way without having to worry about bombarding the server with duplicate requests.

Managing Cookies

Roach can automatically keep track of cookies for us if we register the built-in RoachPHP\Downloader\Middleware\CookieMiddleware. This middleware extracts the Set-Cookie header from every response and sends them back in subsequent requests, just like a browser does.

Be aware that Roach currently uses a shared cookie jar for all requests and responses of a run. This means having multiple session cookies for the same domain is currently unsupported.

Respecting Robots.txt

If we want to write a good spider, it’s a good idea to respect the robots.txt directives of the sites we’re crawling. Roach comes with a RoachPHP\Downloader\Middleware\RobotsTxtMiddleware which compares every request against the site’s robots.txt (if there is one) and drops the request if we’re not allowed to index the page.

Since this middleware uses spatie/robots-txt beind the scenes, it will also inspect the page’s meta tags as well as response headers.

Executing Javascript

Many sites don’t directly return the final HTML but depend on some Javascript being run first. To deal with this, Roach includes a RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware we can use in our spider.

This middleware will intercept every response and swap out its body with the body returned after executing Javascript. This means that in our spider, we don’t have to care about whether or not Javascript needed to be run or not. We can simply writing our scraper as if we’re dealing with static HTML.

Prerequisites

In order to use this middleware, we first need to require the spatie/browsershot package in our application.

composer require spatie/browsershot

The middleware uses this package behind the scenes to execute Javascript. This package, in turn, uses Puppeteer which controls a headless Chrome instance. This means that we need to ensure that puppeteer is installed on our system.

Check out the requirements section of spatie/browsershot for more information on how to install Puppeteer for your system.

Configuration Options

Most of the middleware’s configuration options will configure the underlying Browsershot instance. Check out the Browsershot documentation for a more complete description of what each of the configuration values do.

NameDefaultDescription
chromiumArguments[]Custom arguments which will get passed to Chromium. Corresponds to Browsershot::addChromiumArguments.
chromePathnullCustom path to a Chrome or Chromium executable. Will default to the executable installed by Puppeteer. Corresponds to Browsershot::setCromePath.
binPathnullCustom script path which get executed instead of Browsershot’s default script. Corresponds to Browsershot::setBinPath.
nodeModulePathnullPath to an alternative node_modules folder to use. Corresponds to Browsershot::setNodeModulePath.
includePathnullOverrides the PATH environment variable Browsershot uses to find executables. This is an alternative to specifying paths to the various executables individually. Corresponds to Browsershot::setIncludePath.
nodeBinarynullCustom path to the node executable. Corresponds to Browsershot::setNodeBinary.
npmBinarynullCustom path to the npm executable. Corresponds to Browsershot::setNpmBinary.

Proxy Middleware

The ProxyMiddleware allows you to configure HTTP proxies that Roach will use when crawling specific hosts. Under the hood, this will configure the proxy option on the underlying Guzzle request, so see their documentation for a more detailed description of the parameters.

Configuration Options

NameDefaultDescription
loadernullThe class that is used to load the proxy configuration. By default, this will load the configuration from the array specified in the proxy option. You can implement your own configuration loader by implementing the ConfigurationLoaderInterface, e.g. to load the proxy configurations from a database.
proxy[]A string or dictionary of hosts and proxy options (see Proxy Options. If a string is provided, the same proxy settings are used for every host. Otherwise, the downloader will check if a proxy was configured for the host of the current request. If no proxy settings exist for the host and no wildcard proxy was configured, the request will be sent without using a proxy.

Proxy Options

Proxy options simply get passed through to the underlying Guzzle request. See the Guzzle documentation for more detailed information.

use RoachPHP\Downloader\ProxyMiddleware;

class MySpider extends BasicSpider
{
    public array $downloaderMiddleware = [
        [ProxyMiddleware::class, [
            'proxy' => [
                'example.com' => [
                    'http' => 'http://localhost:8125', // Use this proxy with "http"
                    'https' => 'http://localhost:9124', // Use this proxy with "https"
                    'no' => ['.mit.edu'], // Don't use a proxy with these
                ],
            ],
        ]],
    ];
}

If the proxy option is set to a string, the same proxy URL will be used for all hosts and protocols.

So this

public array $downloaderMiddleware = [
  [ProxyMiddleware::class, ['proxy' => 'http://localhost:8125']],
];

is equivalent to this.

public array $downloaderMiddleware = [
  [ProxyMiddleware::class, [
      'proxy' => [
         '*' => [
            'http' => 'http://localhost:8125',
            'https' => 'http://localhost:8125',
            'no' => [],
         ],
      ],
  ]],
];

This also works on a per-host basis. This

public array $downloaderMiddleware = [
  [ProxyMiddleware::class, [
      'proxy' => [
         'example.com' => 'http://localhost:8125',
      ],
  ]],
];

is equivalent to this.

public array $downloaderMiddleware = [
  [ProxyMiddleware::class, [
      'proxy' => [
         'example.com' => [
            'http' => 'http://localhost:8125',
            'https' => 'http://localhost:8125',
            'no' => [],
         ]
      ],
  ]],
];