README.md

# ExCavator

Excavate JavaScript variable values from HTML script tags using proper AST parsing — no regex needed.

Uses [OXC](https://oxc.rs/) (via `igniter_js`) for fast Rust-based JS parsing, then walks the ESTree AST in pure Elixir to extract variable bindings.

## Installation

Add `excavator` to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [
    {:excavator, "~> 0.1.0"}
  ]
end
```

## Usage

### Extract all variables from HTML

```elixir
html = """
<html>
  <script>
    window.__INITIAL_STATE__ = {"user": "alice", "token": "abc-xyz"};
    var apiKey = "secret_key_890";
    const data = JSON.parse('{"items":[1,2,3]}');
  </script>
</html>
"""

{:ok, results} = ExCavator.extract_all(html)
# [
#   %{name: "window.__INITIAL_STATE__", value: %{"user" => "alice", "token" => "abc-xyz"}, source: :assignment},
#   %{name: "apiKey", value: "secret_key_890", source: :variable_declaration},
#   %{name: "data", value: %{"items" => [1, 2, 3]}, source: :variable_declaration}
# ]
```

### Extract a specific variable

```elixir
{:ok, state} = ExCavator.extract(html, "__INITIAL_STATE__")
# %{"user" => "alice", "token" => "abc-xyz"}

{:ok, key} = ExCavator.extract(html, "apiKey")
# "secret_key_890"
```

### Parse JS directly (skip HTML)

```elixir
{:ok, results} = ExCavator.extract_from_js("const x = 42;")
# [%{name: "x", value: 42, source: :variable_declaration}]
```

## Supported patterns

- `var/let/const x = VALUE` — string, number, boolean, null, object, array
- `window.x = VALUE` — dotted member assignments (including nested: `window.a.b`)
- `const x = JSON.parse('...')` — string arg decoded with Jason
- Negative numbers (`-42`), template literals (`` `hello` ``)
- Nested objects and arrays

## How it works

1. **Floki** parses HTML and extracts inline `<script>` tag contents
2. **OXC** (via `igniter_js` Rustler NIF) parses JS to ESTree AST
3. Pure Elixir pattern matching walks the AST to find assignments and convert values

Here is the most pragmatic and common approach to solving this in Elixir.

### Step 1: Extract the `<script>` contents with Floki

To parse HTML in Elixir, **[Floki](https://hexdocs.pm/floki/readme.html)** is the gold standard. It allows you to query HTML using CSS selectors.

First, add Floki to your `mix.exs`:
```elixir
def deps do
  [
    {:floki, "~> 0.35.0"} # check hex.pm for the latest version
  ]
end
```

Then, parse the HTML to isolate the script tags:
```elixir
html = """
<html>
  <head>
    <script>
      var unimportant = "ignore me";
    </script>
    <script>
      window.__INITIAL_STATE__ = {"user_id": 123, "token": "abc-xyz"};
      var apiKey = "secret_key_890";
    </script>
  </head>
  <body>...</body>
</html>
"""

# Parse the document
{:ok, document} = Floki.parse_document(html)

# Find all script tags and extract their raw text content
script_contents = 
  document
  |> Floki.find("script")
  |> Enum.map(&Floki.text/1)
  |> Enum.join("\n") # Combine them if you want to search all at once
```

### Step 2: Extract the JavaScript Variables

Elixir doesn't have a built-in JavaScript AST (Abstract Syntax Tree) parser. For extracting variables during web scraping, developers almost always use **Regex** combined with a JSON parser (like `Jason`), rather than trying to fully parse the JavaScript execution context.

Here are the two most common scenarios:

#### Scenario A: Extracting a simple string/integer variable
If you just need a standard variable assignment (e.g., `var apiKey = "secret_key_890";`), Regex is your best friend.

```elixir
# Extracting the apiKey variable
regex = ~r/var apiKey = "(.*?)";/

case Regex.run(regex, script_contents) do
  [_, api_key] -> 
    IO.puts("Found API Key: #{api_key}")
  nil -> 
    IO.puts("API Key not found")
end
```

#### Scenario B: Extracting a JSON-like object
Often, modern web apps inject state into the HTML (like `window.__INITIAL_STATE__ = {...};`). You can use Regex to grab the JSON payload and decode it with `Jason`.

```elixir
# Add {:jason, "~> 1.4"} to your mix.exs deps

# Look for the variable, capture everything between the braces
regex = ~r/window\.__INITIAL_STATE__ = (\{.*?\});/s

case Regex.run(regex, script_contents) do
  [_, json_string] ->
    # Decode the captured string into an Elixir map
    case Jason.decode(json_string) do
      {:ok, parsed_state} -> 
        IO.inspect(parsed_state, label: "Parsed State")
        # Now you can access parsed_state["token"]
      {:error, _} -> 
        IO.puts("Found the variable, but it wasn't valid JSON")
    end
  nil ->
    IO.puts("Initial state not found")
end
```

### What if the JavaScript is too complex for Regex?

If the JavaScript is highly complex, minified unpredictably, or requires actual execution (e.g., the variable is generated by a function like `var token = generateToken();`), Regex won't cut it. 

In those rare cases, you have two options in Elixir:
1. **Tree-sitter bindings:** You can use the [tree_sitter](https://github.com/elixir-tree-sitter/tree_sitter) library with `tree_sitter_javascript` to parse the JS into an AST and traverse it. This is robust but requires a steep learning curve.
2. **Execute it externally:** Use a library like [NodeJS](https://github.com/revelrylabs/elixir-nodejs) to send the script content to a background Node.js process, evaluate it, and return the result back to Elixir.