README.md

## ExCrawlzy

Another crawling library but with more than just crawl

You can crawl sites and transform content to json/map using CSS selectors with no other libraries, utilities included
and more than a simple integration, you can transform a simple site to json with fields like lists or another sub-json
structures

## Installation

If [available in Hex](https://hexdocs.pm/ex_crawlzy), the package can be installed
by adding `ex_crawlzy` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:ex_crawlzy, "~> 0.1.0"}
  ]
end
```

## Usage

Just use the function [`ExCrawlzy.crawl/1`](https://hexdocs.pm/ex_crawlzy/0.1.0/ExCrawlzy.html#crawl/1) to
crawl and [`ExCrawlzy.parse/2`](https://hexdocs.pm/ex_crawlzy/0.1.0/ExCrawlzy.html#parse/2) to parse to json

#### Basic usage

```elixir
site = "https://example.site"

fields = %{
  # shortcut for use a function from ExCrawlzy.Utils
  body: {"div#the_body", :text}
#  module/function way
#  body: {"div#the_body", {ExCrawlzy.Utils, :text}}
#  body: {"div#the_body", fn content -> 
#   ExCrawlzy.Utils.text(content)
#  end}
}

{:ok, content} = ExCrawlzy.crawl(site)
{:ok, %{body: body}} = ExCrawlzy.parse(fields, content)
```

#### Using Client

You can create a module pre-configured with key, selector and processing functions and just call using the function
`crawl/1` inside the same module

```elixir
defmodule ExampleCrawler do
  use ExCrawlzy.Client.Json
  
  add_field(:title, "head title", :text)
  add_field(:body, "div#the_body", :text)
  add_field(:inner_field, "div#the_body div#inner_field", :text)
  add_field(:inner_second_field, "div#inner_second_field", :text_alt)
  add_field(:number, "div#the_number", :text)
  add_field(:exist, "div#the_body div#exist", :exist)
  add_field(:not_exist, "div#the_body div#not_exist", :exist)
  add_field(:link, "a.link_class", :link)
  add_field(:img, "img.img_class", :img)

  def text_alt(sub_doc) do
    ExCrawlzy.Utils.text(sub_doc)
  end
end

site = "https://example.site"

{:ok, data} = ExampleCrawler.crawl(site)
```

#### List of elements

You can create a client that parses multiple elements from html using css selectors

Using `list_selector/1` you can define the selector that all elements matches, the next is define as the client 
`ExCrawlzy.Client.Json` and this are the inner elements

```elixir
defmodule ExampleCrawlerList do
  use ExCrawlzy.Client.List

  list_size(2)
  list_selector("div.possible_value")
  add_field(:field_1, "div.field_1", :text)
  add_field(:field_2, "div.field_2", :text)
end

site = "https://example_list.site"

{:ok, data} = ExampleCrawlerList.crawl(site)
```

#### A good example

```elixir
defmodule GithubProfilePinnedRepos do
  use ExCrawlzy.Client.List

  list_selector("div.pinned-item-list-item")
  add_field(:name, "a.mr-1 span.repo", :text)
  add_field(:link, "a.mr-1", :link)
  add_field(:access, "span.Label", :text)
  add_field(:description, "p.pinned-item-desc", :text)
  add_field(:language, "span.d-inline-block span[itemprop=\"programmingLanguage\"]", :text)

  def link(doc) do
    path = ExCrawlzy.Utils.props("href", doc)
    "https://github.com#{path}"
  end
end

site = "https://github.com/nicolkill"

{
  :ok, 
  [
    %{
      access: "Public",
      description: "An API Prototype Platform",
      link: "https://github.com/nicolkill/dbb",
      name: "dbb",
      language: "Elixir"
    },
    %{
      access: "Public",
      description: "JSON Schema verifier in Elixir",
      link: "https://github.com/nicolkill/map_schema_validator",
      name: "map_schema_validator",
      language: "Elixir"
    },
    %{
      access: "Public",
      description: "",
      link: "https://github.com/nicolkill/ex_crawlzy",
      name: "ex_crawlzy",
      language: "Elixir"
    }
  ]
} == ExampleCrawlerList.crawl(site)
```