README.md
# ReqCrawl
[![ReqCrawl version](https://img.shields.io/hexpm/v/req_crawl.svg)](https://hex.pm/packages/req_crawl)
[![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/req_crawl/)
[![Hex Downloads](https://img.shields.io/hexpm/dt/req_crawl)](https://hex.pm/packages/req_crawl)
[![Twitter Follow](https://img.shields.io/twitter/follow/ac_alejos?style=social)](https://twitter.com/ac_alejos)
Req plugins to support common crawling functions
## Installation
```elixir
def deps do
[
{:req_crawl, "~> 0.2.0"}
{:saxy, "~> 1.5"} # Optionally to use `ReqCrawl.Sitemap`
]
end
```
## Plugins
### ReqCrawl.Robots
A Req plugin to parse robots.txt files
You can attach this plugin to any `%Req.Request` you use for a crawler and it will only run against
URLs with a path of `/robots.txt`.
It outputs a map with the following fields:
* `:errors` - A list of any errors encountered during parsing
* `:sitemaps` - A list of the sitemaps
* `:rules` - A map of the rules with User-Agents as the keys and a map with the following values as the fields:
* `:allow` - A list of allowed paths
* `:disallow` - A list of the disallowed paths
### ReqCrawl.Sitemap
Gathers all URLs from a Sitemap or SitemapIndex according to the specification described
at <https://sitemaps.org/protocol.html>
Supports the following formats:
* `.xml` (for `sitemap` and `sitemapindex`)
* `.txt` (for `sitemap`)
Outputs a 2-Tuple of `{type, urls}` where `type` is one of `:sitemap` or `:sitemapindex` and `urls` is a list
of URL strings extracted from the body.
Output is stored in the `ReqResponse` in the private field under the `:crawl_sitemap` key