README.rst

==========
treewalker
==========

.. image:: https://travis-ci.org/AntoineGagne/treewalker.svg?branch=master
    :target: https://travis-ci.org/AntoineGagne/treewalker

.. image:: http://img.shields.io/hexpm/v/treewalker.svg?style=flat
    :target: https://hex.pm/packages/treewalker

.. image:: https://img.shields.io/github/release/AntoineGagne/treewalker?color=brightgreen
    :target: https://github.com/AntoineGagne/treewalker/releases

.. image:: https://coveralls.io/repos/github/AntoineGagne/treewalker/badge.svg?branch=master
    :target: https://coveralls.io/github/AntoineGagne/treewalker?branch=master

:Author: `Antoine Gagné <gagnantoine@gmail.com>`_

.. contents::
    :backlinks: none

.. sectnum::

A web crawler in Erlang that respects ``robots.txt``.

Installation
============

This library is available on `hex.pm <https://hex.pm/packages/treewalker>`_.

Keep in mind that the library is not yet stable and its API may be subject to changes.

Usage
=====

.. code-block:: erlang

    %% This will add the specified crawler to the supervision tree
    {ok, _} = treewalker:add_crawler(example, #{scraper => example_scraper,
                                                fetcher => example_fetcher,
                                                max_depth => 3,
                                                link_filter => example_link_filter,
                                                store => example_store}),
    %% Starts crawling
    ok = treewalker:start_crawler(example),
    %% ...
    %% Stops the crawler
    %% The pending requests will be completed and dropped
    ok = treewalker:stop_crawler(example),

Options
=======

The following settings are available via the ``sys.config`` configuration:

.. code-block:: erlang

    {treewalker, [
                  %% The minimum delay to wait before retrying a failed request
                  {min_retry_delay, pos_integer()},
                  %% The maximum delay to wait before retrying a failed request
                  {max_retry_delay, pos_integer()},
                  %% The maximum amount of retries of a failed request
                  {max_retries, pos_integer()},
                  %% The maximum amount of delay before starting a request (in seconds)
                  {max_worker_delay, pos_integer()},
                  %% The maximum amount of concurrent workers making HTTP requests
                  {max_concurrent_worker, pos_integer()},
                  %% The user agent making the HTTP requests
                  {user_agent, binary()}]},

Development
===========

Running all the tests and linters
---------------------------------

You can run all the tests and linters with the ``rebar3`` alias:

.. code-block:: sh

    rebar3 check