Getting Started
Basic Concepts
Advanced Usage
Framework Integration
Basic Concepts
Spiders
Define how websites get crawled and how data is scraped from its pages.
Spiders are classes which define how a website will get processed. This includes both crawling for links and extracting data from specific pages (scraping).
Example spider
It's easiest to explain all the different parts of a spider by looking at an example. Here's a spider that extracts the title and subtitle of all pages of this very documentation.
<?php
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class RoachDocsSpider extends BasicSpider
{
/**
* @var string[]
*/
public array $startUrls = [
'https://roach-php.dev/docs/spiders'
];
public function parse(Response $response): \Generator
{
$title = $response->filter('h1')->text();
$subtitle = $response
->filter('main > div:nth-child(2) p:first-of-type')
->text();
yield $this->item([
'title' => $title,
'subtitle' => $subtitle,
]);
}
}
Here’s how this spider will be processed:
- Roach starts by sending requests to all URLs defined inside the
$startUrls
property of the spider. In our case, there’s only the single URLhttps://roach-php.dev/docs/spiders
. - The response of each request gets passed to the
parse
method of the spider. - Inside the
parse
method, we filter the response using CSS selectors to extract both the title and subtitle. Check out the page on scraping responses for more information. - We then
yield
an item from our method by calling$this->item(...)
and passing in array of our data. - The item will then get sent through the item processing pipeline.
- Since there are no further requests to be sent, the spider closes.
Generating the initial requests
When Roach starts a run of your spider, it first generates the initial requests from the spider’s starting URLs. These URLs are often referred to as seeds. There are several different ways you can define the starting URLs for a spider.
Static URLs
The most straight forward way of specifying the starting URLs for a spider is via the $startUrls
property.
<?php
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class MySpider extends BasicSpider
{
/**
* @var string[]
*/
public array $startUrls = [
'https://roach-php.dev/docs/spiders',
];
public function parse(Response $response): \Generator
{ /* ... */ }
}
Roach will send a request to each URL defined in this array and call the parse
method for each respective response.
Manually Creating the Request Objects
While using the $startUrls
property is very convenient, it makes a few assumptions:
- it assumes that all requests are to be sent as
GET
requests, - it assumes that the initial requests will get processed by the
parse
method of our spider
Since $startUrls
is a property, we can only define static values in it. We can’t add dynamic parts to the URLs like the current date, for example.
If we need complete control over the requests that get created, we can override the initialRequests
method on the spider, instead.
<?php
use Datetime;
use RoachPHP\Http\Request;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class MySpider extends BasicSpider
{
public function parse(Response $response): \Generator
{ /* ... */ }
/** @return Request[] */
protected function initialRequests(): array
{
$yesterday = (new DateTime('yesterday'))->format('Y/m/d');
return [
new Request(
'GET',
"https://fussballdaten.de/kalender/{$yesterday}",
[$this, 'parse']
),
];
}
}
The initialRequests
method needs to return an array of Request
objects. Since we can now directly instantiate Request
objects, we are free to configure these requests however we want. In the example above, we’re setting the start URL to a dynamic value based on the current date.
The Request
class has the following constructor:
Request::__construct(
string $method,
string $uri,
callable $parseMethod,
array $options = [],
);
The $options
parameter takes an array of Guzzle request options which allows us to configure the underlying Guzzle request directly, if that’s the flexibility we need.
Using the initialRequests
method, we could also provide a different parse method for the initial requests as well.
<?php
use RoachPHP\Http\Request;
use RoachPHP\Spider\BasicSpider;
class RoachDocsSpider extends BasicSpider
{
public function parseOverview(Response $response): \Generator
{
// We’re only interested in the overview page
// because we can extract the links we’re
// actually interested in from it.
$pages = $response
->filter('main > div:first-child a')
->links();
foreach ($pages as $page) {
// Since we’re not specifying the second parameter,
// all article pages will get handled by the
// spider’s `parse` method.
yield $this->request('GET', $page->getUri());
}
}
public function parse(Response $response): \Generator
{
// Akshually parse the subpages...
}
/** @return Request[] */
protected function initialRequests(): array
{
return [
new Request(
'GET',
'https://roach-php.dev',
// Specify a different parse method for
// the intial request.
[$this, 'parseOverview']
),
];
}
}
This pattern can be useful when we don’t know the URLs we want to crawl ahead of time. Think of a news site with the current stories on the front page. We could define the initial request of our spider to crawl the front page and extract the URLs to the actual stories from it. Then, we send requests to each of these pages to scrape them.
What makes this so clean is that we’re able to define a different parsing method for both kinds of requests. The logic to extract URLs from the front page is completely different than scraping an actual article. Imagine having to do both of these things in a single method. It would be madness! Madness, I say!
Configuring Spiders
The way to change the behavior of Roach is by registering middleware and extensions for our spider. By doing so, we can add default headers to each requests, deal with cookies, collect metrics for our runs and much more.
There are three different kinds of middleware.
- Downloader Middleware — This middleware sits between the Roach engine and the Downloader, the component in charge of dealing with the HTTP side of things. Every outgoing request and incoming response gets passed through this middleware stack.
- Spider Middleware — The spider middleware sits between the engine and your spider. It handles responses before they get passed to your spider’s parse callback, as well as items and new requests that get emitted by our spiders.
- Item Processors — Every item that ours spiders emit get passed through the item processing pipeline. This pipeline consists of multiple item processors which, well, process the items. “Processing” can mean many different things of course. Anything from cleaning up data, filtering duplicates, storing things in a database or even sending notification mails.
Extensions, on the other hand don’t live in a specific context, but listen on events that get fired at various points during a run instead. Don’t worry if this difference seems a little too esoteric at this point. It will be explained in more detail in the section about extending Roach.
Defining Spider Configuration
There is no definitive way to load a spider’s configuration. You might prefer defining all configuration in separate config files or leading it dynamically from the database. To help you get started, Roach provides a BasicSpider
base class that you can extend from. This class allows you to provide your spider’s configuration as class properties.
<?php
use RoachPHP\Spider\BasicSpider;
class MySpider extends BasicSpider
{
/**
* The spider middleware that should be used for runs
* of this spider.
*/
public array $spiderMiddleware = [];
/**
* The downloader middleware that should be used for
* runs of this spider.
*/
public array $downloaderMiddleware = [];
/**
* The item processors that emitted items will be send
* through.
*/
public array $itemProcessors = [];
/**
* The extensions that should be used for runs of this
* spider.
*/
public array $extensions = [];
/**
* How many requests are allowed to be sent concurrently.
*/
public int $concurrency = 2;
/**
* The delay (in seconds) between requests. Note that there
* is no delay between concurrent requests. Instead, Roach
* will wait for the `$requestDelay` before sending the
* next "batch" of concurrent requests.
*/
public int $requestDelay = 2;
}
To register the RequestDeduplicationMiddleware
that ships with Roach, we would add its fully qualified class-name (FQCN) to the $downloaderMiddleware
array.
public array $downloaderMiddleware = [
RoachPHP\Downloader\Middleware\RequestDeduplication::class,
];
Passing Options to Middleware
Some middleware or extensions might allow you to pass options to them. For example, the built-in UserAgentMiddleware
allows us to define a custom userAgent
that will be attached to every request. In these cases, we use a slightly different syntax when registering the middleware.
public array $downloaderMiddleware = [
[
RoachPHP\Downloader\Middleware\UserAgentMiddleware::class,
['userAgent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)'],
]
];
Instead of passing the FQCN of the middleware directly, we pass an array instead. The first entry of the array is the FQCN of the middleware. The second entry is an array of options that will be passed to the middleware. In the example above, we’re setting the userAgent
option of the middleware to Mozilla/5.0 (compatible; RoachPHP/0.1.0)
.
The exact options we can specify are defined by each handler individually. These options are explained on the respective sub-pages where we take a look at the built-in middleware and extensions.
This process is identical for $spiderMiddleware
, $itemProcessors
and $extensions
.
Running Spiders
After we have set up our spider, it’s finally time to run it. Luckily, Roach makes this super easy. All we have to do is pass the class name of our spider to the static Roach::startSpider()
method and Roach will take care of the rest.
<?php
use App\Spiders\MySpider;
use RoachPHP\Roach;
Roach::startSpider(MySpider::class);
Starting a Spider from the CLI
To start a spider directly from the CLI, we can use the roach:run
command and pass it the fully-qualified name of our spider.
vendor/bin/roach roach:run App\\Spiders\\MySpider
Retrieving Scraped Items After a Run
The startSpider
method does not have a return value. If we want to get back all items that were scraped during a run, we can use the Roach::collectSpider()
method instead.
<?php
use App\Spiders\MySpider;
use RoachPHP\Roach;
// $items is an array<int, ItemInterface>
$items = Roach::collectSpider(MySpider::class);
The return value of this method is an array of all scraped items our spider emitting during the run. Note that for an item to be considered scraped, it needs to have passed through the spider’s item pipeline successfully without getting dropped.
Other than that, this method looks and behaves exactly the same as the startSpider
method.
Fun fact: behind the scenes, Roach simply registers an additional extension for the spider run to keep track of all scraped items.
Overriding Spider Configuration
Sometimes, it can be useful to override parts of a spider’s configuration only for a single run. For example, you might want to start a spider run based on some user input or enable a specific extension while debugging.
We can achieve this without changing the default spider configuration by passing an Overrides
object as the second parameter to the Roach::startSpider
method.
<?php
use RoachPHP\Spider\Configuration\Overrides;
Roach::startSpider(
MySpider::class,
new Overrides(startUrls: ['https://my-override-url.com']),
);
The Overrides
object’s constructor accepts the configuration options that our spider class does.
<?php
final class Overrides
{
public function __construct(
public ?array $startUrls = null,
public ?array $downloaderMiddleware = null,
public ?array $spiderMiddleware = null,
public ?array $itemProcessors = null,
public ?array $extensions = null,
public ?int $concurrency = null,
public ?int $requestDelay = null,
) {
}
}
Any overrides passed to Roach::startSpider
will get merged with the spider’s default configuration before the run is started. If we don’t want to override certain options, we can either explicitly pass null
to the constructor, or use name arguments to skip them entirely.
Roach::startSpider(
MySpider::class,
// Only override `requestDelay`. Use spider defaults for
// everything else.
new Overrides(requestDelay: 2),
);
Passing Additional Context to Spiders
In order to pass arbitrary data into our spider when starting the run, we can provide the optional $context
parameter to either Roach::startSpider
or Roach::collectSpider
.
Roach::startSpider(
MySpider::class,
null,
['some-key' => 'some-value'],
);
// Or use named parameters
Roach::startSpider(
MySpider::class,
context: ['some-key' => 'some-value'],
);
The $context
array can contain arbitrary data and can be accessed via $this->context
inside our spider.
// Pass context when starting a run
Roach::startSpider(MySpider::class, context: ['foo' => 'bar']);
final class MySpider extends BasicSpider
{
public function parse(Response $response): Generator
{
$value = $this->context['foo']; // "bar"
// ...
}
}
On this page
- Example spider
- Generating the initial requests
- Static URLs
- Manually Creating the Request Objects
- Configuring Spiders
- Defining Spider Configuration
- Passing Options to Middleware
- Running Spiders
- Starting a Spider from the CLI
- Retrieving Scraped Items After a Run
- Overriding Spider Configuration
- Passing Additional Context to Spiders