Lua Hooks Guide

Customize every stage of the scraping pipeline

6 Hook Stages Hot Reload Error Resilient

Overview

spyweb lets you customize every stage of the scraping pipeline with Lua scripts (Luau dialect). Place a hooks.lua alongside your job's config.toml and define any combination of 6 hook functions.

All hooks are optional. Omit any you don't need — spyweb runs fine with just a config.toml.

Key Rules

  • Return the value (modified or not) to continue the pipeline
  • Return nil or false to drop/skip (behavior varies per hook)
  • Lua globals persist across runs — useful for pagination counters
  • Hot-reload creates a fresh Lua VM and resets all state
  • Hook errors are logged and skipped — they never crash the job

Pipeline Stages

Every scrape run flows through these stages in order:

before_fetch(request) executor thread
Modify URL, headers, or return nil to skip this run entirely
[HTTP fetch] background thread
Automatic — uses request config from previous stage
after_fetch(response) executor thread
Modify response body, or return nil to skip extraction
[CSS extraction] background thread
Automatic — raw parser with DOM fallback
after_extract(items) executor thread
Batch filter/modify all items. nil or empty = no items
filter_item(item) executor thread
Per-item filter. Replaces built-in keyword filter if defined
before_store(items) executor thread
Last chance before DB insert. nil = skip store + notify
[dedup + insert] background thread
Automatic — atomic check-and-insert
before_notify(items) executor thread
Reshape or silence notifications. Items already stored
[notify + webhook] background thread
Automatic — desktop notification + webhook POST

File Setup

Each job with Lua hooks lives in its own directory under jobs/:

jobs/
└── my-scraper/
    ├── config.toml    # job configuration (required)
    └── hooks.lua       # Lua hooks (optional)

spyweb auto-discovers hooks.lua when loading the job. No config changes needed — just create the file.

Hook Reference

1 before_fetch(request)

Called before every HTTP request. Modify the URL, add headers, or skip entirely.

FieldTypeMutableDescription
request.urlstringTarget URL
request.headerstableHTTP headers (string→string)
ReturnEffect
request tableContinue with modified request
nil / falseSkip this entire run
-- Basic: add auth header
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-token"
    return request
end

2 after_fetch(response)

Called after HTTP fetch. Inspect status, modify body, or skip extraction.

FieldTypeMutableDescription
response.bodystringHTML body
response.statusnumberHTTP status code
response.urlstringFinal URL (after redirects)
function after_fetch(response)
    if response.status ~= 200 then
        print("Bad status: " .. response.status)
        return nil  -- skip extraction
    end
    -- Strip unwanted content from the body
    response.body = response.body:gsub("<script.-</script>", "")
    return response
end

3 after_extract(items)

Batch hook — receives all extracted items at once. Good for cross-item logic, dedup, or sorting.

No short-circuit: returning nil or empty table just means no items, not a pipeline abort.

function after_extract(items)
    -- Remove duplicate titles
    local seen = {}
    local unique = {}
    for _, item in ipairs(items) do
        local title = item.fields.title or ""
        if not seen[title] and title ~= "" then
            seen[title] = true
            table.insert(unique, item)
        end
    end
    return unique
end

4 filter_item(item)

Mutually exclusive with keyword filter. If filter_item() exists in your script, the built-in keyword filter is skipped entirely — even if you have keywords in your config.

Called once per item. Return the item to keep it, or nil to drop it before it reaches the database.

FieldTypeMutable
item.fieldstable (string→string)
item.matchesarray of strings
function filter_item(item)
    local title = (item.fields.title or ""):lower()

    -- Drop unwanted items
    if title == "" or title:find("sponsored") then
        return nil
    end

    -- Mutate fields before storage
    item.fields.title = title:sub(1,1):upper() .. title:sub(2)

    return item
end

5 before_store(items)

⚠️ Footgun warning: Items dropped by before_store returning nil never enter the database. This means they will appear as new again on the next run, because the DB has no record of them. This is intentional but can cause infinite notification loops if misused.

Last chance to drop items before dedup + insert. Return nil to skip storing and notifying.

function before_store(items)
    -- Only store during business hours
    local hour = os.date("*t").hour
    if hour < 9 or hour > 17 then
        return nil  -- don't store, don't notify
    end
    return items
end

6 before_notify(items)

Reshape or silence notifications. Items are already stored at this point — dropping them here only affects what gets notified/webhoked.

function before_notify(items)
    -- Silence notifications if fewer than 3 new items
    if #items < 3 then return nil end

    -- Only notify for the first 5 items
    local capped = {}
    for i = 1, math.min(#items, 5) do
        capped[i] = items[i]
    end
    return capped
end

Global Functions

spyweb injects several global helper functions into the Lua environment.

http_get(url, [headers])

Performs a synchronous HTTP GET request. The headers argument is an optional Lua table containing key-value pairs for request headers.

local html = http_get("https://api.example.com", {
    ["Authorization"] = "Bearer token123",
    ["X-Custom-Header"] = "my-value"
})

http_post(url, body, [headers])

Performs a synchronous HTTP POST request. By default, it sets the Content-Type to application/x-www-form-urlencoded unless overridden in the headers argument.

local response = http_post("https://api.example.com", '{"foo":"bar"}', {
    ["Content-Type"] = "application/json",
    ["Accept"] = "application/json"
})

Config Reference

Job Config (config.toml)

FieldTypeDefaultDescription
namestringrequiredDisplay name
urlstringrequiredTarget URL
selectorstringrequiredCSS selector for containers
fieldsarrayrequiredFields to extract
enabledbooltrueEnable/disable
intervalu3230Seconds between runs
keywordsstring[]noneFilter by keywords
search_fieldsstring[]noneLimit keyword search to fields
debugboolfalseSave raw HTML + JSON
headerstablenoneCustom HTTP headers

Field Syntax

# Shorthand: "name:selector" or "name:selector@attribute"
fields = ["title:h2", "link:a@href", "desc:.summary"]

# Full form
fields = [
  { name = "title", selector = "h2", att = "text" },
  { name = "link", selector = "a", att = "href" },
]

Examples

Cycle through pages using a Lua global counter:

-- hook.lua — globals persist across runs
page = 1

function before_fetch(request)
    if string.find(request.url, "?") then
        request.url = request.url .. "&page=" .. page
    else
        request.url = request.url .. "?page=" .. page
    end
    page = page + 1
    if page > 5 then page = 1 end
    return request
end

function filter_item(item)
    if (item.fields.title or "") == "" then
        return nil
    end
    return item
end

Replace the keyword filter with custom Lua logic:

-- hook.lua — replaces built-in keyword filter
blocked = { ["spam co"] = true }

function filter_item(item)
    local title = (item.fields.title or ""):lower()
    local company = (item.fields.company or ""):lower()

    if blocked[company] then return nil end
    if title:find("intern") then return nil end

    -- Must mention a wanted tech
    local dominated_by = false
    for _, tech in ipairs({"rust","go","python"}) do
        if title:find(tech) then
            dominated_by = true
            break
        end
    end
    if not dominated_by then return nil end

    return item
end

Add auth headers dynamically:

-- hook.lua — dynamic headers
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-api-key"
    request.headers["X-Custom"] = "spyweb"
    return request
end

function after_fetch(response)
    if response.status == 401 then
        print("Auth failed!")
        return nil
    end
    return response
end

Tips & Gotchas

TopicDetail
Globals persistLua globals survive across runs. Great for counters, caches, pagination state. Hot-reload resets them (fresh VM).
filter_item vs keywordsMutually exclusive. If filter_item() exists, the built-in keyword filter does not run at all.
Error handlingHook errors are logged to stderr and skipped. The pipeline continues with the original data. A broken script never kills a job.
before_store footgunItems dropped by before_store have no DB record and will reappear as new next run. Use filter_item for permanent drops.
ThreadingHooks run on the async executor thread. Blocking HTTP and DB operations run in background threads. Lua code should be fast and non-blocking.
Luau dialectspyweb uses Luau (Roblox's Lua). Standard Lua 5.x syntax works. Type annotations are optional but supported.
Hot reloadEdit config.toml or hook.lua → spyweb auto-detects changes and reloads. All Lua state resets. No restart needed.