Lua Hooks Guide

High-performance web monitoring with zero dependencies

7 Hook Stages Hot Reload Error Resilient

Overview

SpyWeb is a zero-dependency web monitoring engine that lets you customize every stage of the pipeline with Lua scripts (Luau dialect). Place a hooks.lua alongside your job's config.toml and define any combination of 7 hook functions.

All hooks are optional. Omit any you don't need — spyweb runs fine with just a config.toml.

Key Rules

  • Return the value (modified or not) to continue the pipeline
  • Return nil or false to drop/skip (behavior varies per hook)
  • Lua globals persist across runs — useful for pagination counters
  • Hot-reload creates a fresh Lua VM and resets all state
  • Hook errors are logged and skipped — they never crash the job
  • Efficiency: Only define the hooks you actually need. SpyWeb pre-detects which functions exist at startup and skips the processing logic entirely for any that are missing.

Pipeline Stages

Every scrape run flows through these stages in order:

before_fetch(request) executor thread
Modify URL, headers, or return nil to skip this run entirely
[HTTP fetch] background thread
Automatic — uses request config from previous stage
after_fetch(response) executor thread
Modify response body, or return nil to skip extraction
[CSS extraction] background thread
Automatic — raw parser with DOM fallback
after_extract(items) executor thread
Batch filter/modify all items. nil or empty = no items
filter_item(item) executor thread
Per-item filter. Replaces built-in keyword filter if defined
before_store(items) executor thread
Last chance before DB insert. nil = skip store + notify
[dedup + insert] background thread
Automatic — atomic check-and-insert
before_notify(items) executor thread
Reshape or silence notifications. Items already stored
before_webhook(payload) executor thread
Reshape or silence webhook POSTs. Full JSON payload
[notify + webhook] background thread
Automatic — desktop notification + webhook POST

File Setup

Each job with Lua hooks lives in its own directory under jobs/:

jobs/
└── my-scraper/
    ├── config.toml    # job configuration (required)
    └── hooks.lua       # Lua hooks (optional)

spyweb auto-discovers hooks.lua when loading the job. No config changes needed — just create the file.

Hook Reference

1 before_fetch(request)

Called before every HTTP request. Modify the URL, add headers, or skip entirely.

Field Type Mutable Description
request.url string Target URL
request.headers table HTTP headers (string→string)
Return Effect
request table Continue with modified request
nil / false Skip this entire run
-- Basic: add auth header
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-token"
    return request
end

2 after_fetch(response)

Called after HTTP fetch. Inspect status, modify body, or skip extraction.

Field Type Mutable Description
response.body string HTML body
response.status number HTTP status code
response.url string Final URL (after redirects)
function after_fetch(response)
    if response.status ~= 200 then
        print("Bad status: " .. response.status)
        return nil  -- skip extraction
    end
    -- Strip unwanted content from the body
    response.body = response.body:gsub("<script.-</script>", "")
    return response
end

3 after_extract(items)

Batch hook — receives all extracted items at once. Good for cross-item logic, dedup, or sorting.

No short-circuit: returning nil or empty table just means no items, not a pipeline abort.

function after_extract(items)
    -- Remove duplicate titles
    local seen = {}
    local unique = {}
    for _, item in ipairs(items) do
        local title = item.fields.title or ""
        if not seen[title] and title ~= "" then
            seen[title] = true
            table.insert(unique, item)
        end
    end
    return unique
end

4 filter_item(item)

Mutually exclusive with keyword filter. If filter_item() exists in your script, the built-in keyword filter is skipped entirely — even if you have keywords in your config.

Called once per item. Return the item to keep it, or nil to drop it before it reaches the database.

Field Type Mutable
item.fields table (string→string)
item.matches array of strings
function filter_item(item)
    local title = (item.fields.title or ""):lower()

    -- Drop unwanted items
    if title == "" or title:find("sponsored") then
        return nil
    end

    -- Mutate fields before storage
    item.fields.title = title:sub(1,1):upper() .. title:sub(2)

    return item
end

5 before_store(items)

⚠️ Footgun warning: Items dropped by before_store returning nil never enter the database. This means they will appear as new again on the next run, because the DB has no record of them. This is intentional but can cause infinite notification loops if misused.

Last chance to drop items before dedup + insert. Return nil to skip storing and notifying.

function before_store(items)
    -- Only store during business hours
    local hour = os.date("*t").hour
    if hour < 9 or hour > 17 then
        return nil  -- don't store, don't notify
    end
    return items
end

6 before_notify(items)

Reshape or silence notifications. Items are already stored at this point — dropping them here only affects what gets notified/webhoked.

function before_notify(items)
    -- Silence notifications if fewer than 3 new items
    if #items < 3 then return nil end

    -- Only notify for the first 5 items
    local capped = {}
    for i = 1, math.min(#items, 5) do
        capped[i] = items[i]
    end
    return capped
end

7 before_webhook(payload)

Reshape or silence webhook POSTs. Receives the full JSON payload (job_name, item_count, items).

function before_webhook(payload)
    -- Add a secret token to the payload
    payload.secret = "my-secure-token"
    
    -- Filter items in the webhook but keep them in the notification
    local filtered = {}
    for _, item in ipairs(payload.items) do
        if tonumber(item.fields.price) < 100 then
            table.insert(filtered, item)
        end
    end
    payload.items = filtered
    payload.item_count = #filtered
    
    if #filtered == 0 then return nil end -- skip webhook
    return payload
end

Global Functions

spyweb injects several global helper functions into the Lua environment.

http_get(url, [headers])

Performs a synchronous HTTP GET request. The headers argument is an optional Lua table containing key-value pairs for request headers.

local html = http_get("https://api.example.com", {
    ["Authorization"] = "Bearer token123",
    ["X-Custom-Header"] = "my-value"
})

http_post(url, body, [headers])

Performs a synchronous HTTP POST request. By default, it sets the Content-Type to application/x-www-form-urlencoded unless overridden in the headers argument.

local response = http_post("https://api.example.com", '{"foo":"bar"}', {
    ["Content-Type"] = "application/json",
    ["Accept"] = "application/json"
})

log(message)

Appends a message to hook.log inside the job's directory. This is the recommended way to debug hooks without using external libraries.

function after_extract(items)
    log("Extracted " .. #items .. " items from " .. request.url)
    return items
end

Config Reference

Job Config (config.toml)

Field Type Default Description
name string required Display name
url string required Target URL
selector string required CSS selector for containers
fields array required Fields to extract
enabled bool true Enable/disable
interval u32 30 Seconds between runs
keywords string[] none Filter by keywords
search_fields string[] none Limit keyword search to fields
debug bool false Save raw HTML + JSON
headers table none Custom HTTP headers
hash_fields string[] all Fields used for deduplication hash

Field Syntax

# Shorthand: "name:selector" or "name:selector@attribute"
fields = ["title:h2", "link:a@href", "desc:.summary"]

# Full form
fields = [
  { name = "title", selector = "h2", att = "text" },
  { name = "link", selector = "a", att = "href" },
]

Examples

Cycle through pages using a Lua global counter:

-- hook.lua — globals persist across runs
page = 1

function before_fetch(request)
    if string.find(request.url, "?") then
        request.url = request.url .. "&page=" .. page
    else
        request.url = request.url .. "?page=" .. page
    end
    page = page + 1
    if page > 5 then page = 1 end
    return request
end

function filter_item(item)
    if (item.fields.title or "") == "" then
        return nil
    end
    return item
end

Replace the keyword filter with custom Lua logic:

-- hook.lua — replaces built-in keyword filter
blocked = { ["spam co"] = true }

function filter_item(item)
    local title = (item.fields.title or ""):lower()
    local company = (item.fields.company or ""):lower()

    if blocked[company] then return nil end
    if title:find("intern") then return nil end

    -- Must mention a wanted tech
    local dominated_by = false
    for _, tech in ipairs({"rust","go","python"}) do
        if title:find(tech) then
            dominated_by = true
            break
        end
    end
    if not dominated_by then return nil end

    return item
end

Add auth headers dynamically:

-- hook.lua — dynamic headers
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-api-key"
    request.headers["X-Custom"] = "spyweb"
    return request
end

function after_fetch(response)
    if response.status == 401 then
        print("Auth failed!")
        return nil
    end
    return response
end

Tips & Gotchas

Topic Detail
Globals persist Lua globals survive across runs. Great for counters, caches, pagination state. Hot-reload resets them (fresh VM).
filter_item vs keywords Mutually exclusive. If filter_item() exists, the built-in keyword filter does not run at all.
Error handling Hook errors are logged to stderr and skipped. The pipeline continues with the original data. A broken script never kills a job.
before_store footgun Items dropped by before_store have no DB record and will reappear as new next run. Use filter_item for permanent drops.
Threading Hooks run on the async executor thread. Blocking HTTP and DB operations run in background threads. Lua code should be fast and non-blocking.
Luau dialect SpyWeb uses Luau by default for stability. For standard Lua 5.4 support (enabling C-modules), see the Building from Source guide.
Standard Libraries Access os.time, os.date, math.*, and require for code sharing. Modules are searched in the job's directory first, then in the global shared/ folder. The io library is disabled; use log() instead.
Logging Calls to log(msg) append to hook.log in the job's directory. This file is persistent and useful for long-term debugging.
Hot reload Edit config.toml or hooks.lua → spyweb auto-detects changes and reloads. All Lua state resets. No restart needed.
Deduplication By default, SpyWeb hashes all fields to determine uniqueness. Use hash_fields only if you want to lock an item's identity to a specific field (like a URL) and ignore changes in volatile data (like "time ago").