# Crawler

A high performance web crawler in Elixir, with worker pooling and rate limiting via [OPQ](

## Usage

Crawler.crawl("", max_depths: 2)

## Configurations

| Option          | Type    | Default Value         | Description |
| `:max_depths`   | integer | `3`                   | Maximum nested depth of pages to crawl.
| `:workers`      | integer | `10`                  | Maximum number of concurrent workers for crawling.
| `:interval`     | integer | `0`                   | Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit.
| `:timeout`      | integer | `5000`                | Timeout value for fetching a page, in ms.
| `:user_agent`   | string  | `Crawler/x.x.x (...)` | User-Agent value sent by the fetch requests.
| `:save_to`      | string  | `nil`                 | When provided, the path for saving crawled pages.
| `:parser`       | module  | `Crawler.Parser`      | The default parser, useful when you need to handle parsing differently or to add extra functionalities.

## Features Backlog

Crawler is under active development, below is a non-comprehensive list of features to be implemented.

- [x] Set the maximum crawl depth.
- [x] Save to disk.
- [x] Set timeouts.
- [ ] Crawl assets (CSS and images, etc).
- [ ] The ability to manually stop/pause/restart the crawler.
- [ ] Restrict crawlable domains, paths or file types.
- [x] Limit concurrent crawlers.
- [x] Limit rate of crawling.
- [x] Set crawler's user agent.
- [ ] The ability to retry a failed crawl.
- [ ] DSL for scraping page content.

## Changelog

## License

