The SpyWeb Master Guide

High-performance web monitoring with zero dependencies

7 Hook Stages Hot Reload Error Resilient

Overview

SpyWeb is a zero-dependency web monitoring engine that lets you customize every stage of the pipeline with Lua scripts (Luau dialect). Place a hooks.lua alongside your job's config.toml and define any combination of 7 hook functions.

All hooks are optional. Omit any you don't need — spyweb runs fine with just a config.toml.

Key Rules

  • Return the value (modified or not) to continue the pipeline
  • Return nil or false to drop/skip (behavior varies per hook)
  • Plain Lua globals persist across runs only while the job task stays alive
  • store_* is job-scoped persistent storage and global_store_* is shared across jobs
  • Hot-reload creates a fresh Lua VM and resets global Lua state, but stored values survive
  • Hook errors are logged and skipped — they never crash the job
  • Efficiency: Only define the hooks you actually need. SpyWeb pre-detects which functions exist at startup and skips the processing logic entirely for any that are missing.

Pipeline Stages

Every scrape run flows through these stages in order:

before_fetch(request) executor thread
Modify URL, headers, or return nil to skip this run entirely
[HTTP fetch] background thread
Automatic — uses request config from previous stage
after_fetch(response) executor thread
Modify response body, or return nil to skip extraction
[CSS extraction] background thread
Automatic — raw parser with DOM fallback
after_extract(items) executor thread
Batch filter/modify all items. nil or empty = no items
filter_item(item) executor thread
Per-item filter. Replaces built-in keyword filter if defined
before_store(items) executor thread
Last chance before DB insert. nil = skip store + notify
[dedup + insert] background thread
Automatic — atomic check-and-insert
before_notify(items) executor thread
Reshape or silence notifications. Items already stored
before_webhook(payload) executor thread
Reshape or silence webhook POSTs. Full JSON payload
[notify + webhook] background thread
Automatic — desktop notification + webhook POST

File Setup

Each job with Lua hooks lives in its own directory under jobs/:

jobs/
└── my-scraper/
    ├── config.toml    # job configuration (required)
    └── hooks.lua       # Lua hooks (optional)

spyweb auto-discovers hooks.lua when loading the job. No config changes needed — just create the file.

Hook Reference

1 before_fetch(request)

Called before every HTTP request. Modify the URL, add headers, or skip entirely.

Field Type Mutable Description
request.url string Target URL
request.headers table HTTP headers (string→string)
Return Effect
request table Continue with modified request
nil / false Skip this entire run
-- Basic: add auth header
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-token"
    return request
end

2 after_fetch(response)

Called after HTTP fetch. Inspect status, modify body, or skip extraction.

Field Type Mutable Description
response.body string HTML body
response.status number HTTP status code
response.url string Final URL (after redirects)
function after_fetch(response)
    if response.status ~= 200 then
        print("Bad status: " .. response.status)
        return nil  -- skip extraction
    end
    -- Strip unwanted content from the body
    response.body = response.body:gsub("<script.-</script>", "")
    return response
end

3 after_extract(items)

Batch hook — receives all extracted items at once. Good for cross-item logic, dedup, or sorting.

No short-circuit: returning nil or empty table just means no items, not a pipeline abort.

function after_extract(items)
    -- Remove duplicate titles
    local seen = {}
    local unique = {}
    for _, item in ipairs(items) do
        local title = item.fields.title or ""
        if not seen[title] and title ~= "" then
            seen[title] = true
            table.insert(unique, item)
        end
    end
    return unique
end

4 filter_item(item)

Mutually exclusive with keyword filter. If filter_item() exists in your script, the built-in keyword filter is skipped entirely — even if you have keywords in your config.

Called once per item. Return the item to keep it, or nil to drop it before it reaches the database.

Field Type Mutable
item.fields table (string→string)
item.matches array of strings
function filter_item(item)
    local title = (item.fields.title or ""):lower()

    -- Drop unwanted items
    if title == "" or title:find("sponsored") then
        return nil
    end

    -- Mutate fields before storage
    item.fields.title = title:sub(1,1):upper() .. title:sub(2)

    return item
end

5 before_store(items)

⚠️ Footgun warning: Items dropped by before_store returning nil never enter the database. This means they will appear as new again on the next run, because the DB has no record of them. This is intentional but can cause infinite notification loops if misused.

Last chance to drop items before dedup + insert. Return nil to skip storing and notifying.

function before_store(items)
    -- Only store during business hours
    local hour = os.date("*t").hour
    if hour < 9 or hour > 17 then
        return nil  -- don't store, don't notify
    end
    return items
end

6 before_notify(items)

Reshape or silence notifications. Items are already stored at this point — dropping them here only affects what gets notified/webhoked.

function before_notify(items)
    -- Silence notifications if fewer than 3 new items
    if #items < 3 then return nil end

    -- Only notify for the first 5 items
    local capped = {}
    for i = 1, math.min(#items, 5) do
        capped[i] = items[i]
    end
    return capped
end

7 before_webhook(payload)

Reshape or silence webhook POSTs. Receives the full JSON payload (job_name, item_count, items).

function before_webhook(payload)
    -- Add a secret token to the payload
    payload.secret = "my-secure-token"
    
    -- Filter items in the webhook but keep them in the notification
    local filtered = {}
    for _, item in ipairs(payload.items) do
        if tonumber(item.fields.price) < 100 then
            table.insert(filtered, item)
        end
    end
    payload.items = filtered
    payload.item_count = #filtered
    
    if #filtered == 0 then return nil end -- skip webhook
    return payload
end

Global Functions

spyweb injects several global helper functions into the Lua environment for networking, logging, and storage.

Storage & Persistence

SpyWeb supports state management through both Runtime Memory (transient) and an Embedded Database (persistent). Memory is fast but resets on reload; the database survives restarts.

💡 Race Condition Note: Both Runtime Memory and Job-Local Storage are safe by default. Hooks for a single job are execution-locked (sequential), so race conditions are impossible within a single job. Use global_store_incr strictly when you need to mutate state shared across multiple concurrent jobs.
Function Description
store_set(key, value) Save a string (prefixed with job name)
store_get(key) Retrieve a string or nil
store_delete(key) Remove a key

Example: Job-Scoped Counter

local c = tonumber(store_get("count") or "0")
store_set("count", tostring(c + 1))
Function Description
global_store_set(key, value) Save a string (shared across all jobs)
global_store_get(key) Retrieve a shared string or nil
global_store_delete(key) Remove a shared key
global_store_incr(k, def, delta) Atomic shared increment across all jobs

Example: Atomic Shared Logic

-- WRONG APPROACH (race condition prone)
local v = tonumber(global_store_get("visits") or "0")
global_store_set("visits", tostring(v + 1))

-- CORRECT APPROACH (race condition safe)
global_store_incr("visits", 0, 1)
⚠️ Resets on hot-reload or restart. Use for transient state only.

Standard Lua variables persist in memory as long as the job is active. Because each job runs in its own Isolated VM, these variables are naturally job-scoped and cannot be accessed by other jobs.

Example: Simple Memory Counter

-- This resets to 1 if you edit hooks.lua or restart SpyWeb
visit_count = (visit_count or 0) + 1
log("Session visit: " .. visit_count)

http_get(url, [headers])

Performs a synchronous HTTP GET request. The headers argument is an optional Lua table containing key-value pairs for request headers.

local html = http_get("https://api.example.com", {
    ["Authorization"] = "Bearer token123",
    ["X-Custom-Header"] = "my-value"
})

http_post(url, body, [headers])

Performs a synchronous HTTP POST request. By default, it sets the Content-Type to application/x-www-form-urlencoded unless overridden in the headers argument.

local response = http_post("https://api.example.com", '{"foo":"bar"}', {
    ["Content-Type"] = "application/json",
    ["Accept"] = "application/json"
})

log(message)

Appends a message to hook.log inside the job's directory. This is the recommended way to debug hooks without using external libraries.

function after_extract(items)
    log("Extracted " .. #items .. " items from " .. request.url)
    return items
end

Config Reference

Jobs in SpyWeb are configured via TOML. You can use a single jobs.toml or a modular jobs/my-job/config.toml structure.

Job Config

Field Type Default Description
name string required Unique job name
url string required Target URL
selector string required CSS selector for items
fields array required Fields to extract
interval u32 30 Run interval (seconds)
keywords string[] none Filter by keywords
hash_fields string[] all Fields for dedup hash
headers table none Custom HTTP headers

Validation Rules

  • interval must be > 0
  • All URLs must be absolute (including webhooks/proxies)
  • Job names and normalized IDs must be unique
  • hash_fields must exist in the extracted fields

Field Syntax

# Shorthand
fields = ["title:h2", "link:a@href"]

# Full form
fields = [
  { name = "title", selector = "h2", att = "text" }
]

Deduplication (hash_fields)

By default, SpyWeb hashes all extracted fields. Use hash_fields to ignore volatile data like "time ago" or "view count":

# Only use the link to determine if an item is new
hash_fields = ["link"]

Advanced Features

[proxy]
enabled = true
rotate = "RoundRobin" # or "Sticky", "Random"
urls = ["socks5://p1:1080", "http://p2:8080"]
[webhook]
enabled = true
url = "https://hooks.example.com/spyweb"
headers = { "Authorization" = "Bearer token" }
[notification]
enabled = true
title = "{item_count} new items in {job_name}"
body = "Title: {title}\nLink: {link}"

Available tags: {job_name}, {item_count}, {matches}, and any field name.

REST API

SpyWeb runs a built-in web server on 127.0.0.1:7979 which serves both the dashboard UI and a JSON API for querying your scraped records.

Endpoint Description
GET / HTML records viewer (The UI dashboard)
GET /api/records?job_id=<id> Fetch JSON records for a specific job
GET /api/records?job_id=<id>&limit=50&after=<ts> Paginated record access
GET /api/jobs List all active job IDs and their status
💡 Bring Your Own UI: The default dashboard is served from ui/index.html. Because SpyWeb serves this dynamically, you can replace it with your own React/Vue build instantly — no restart required.

Examples

The "Sentinel" Pattern: Transform SpyWeb into an autonomous monitoring agent. Features a Circuit Breaker with Self-Healing, failure masking, and Price Drop Detection using persistent storage to track state across scrape runs.

-- ==========================================================================
-- SPYWEB SENTINEL: The "Set & Forget" VPS Monitor
-- ==========================================================================
-- This script transforms SpyWeb into an autonomous monitoring agent ideal 
-- for long-term VPS deployments. 

local NTFY_TOPIC = "my-spyweb-alerts" -- CHANGE THIS to your unique topic
local FAIL_THRESHOLD = 30             -- Alert after 30 consecutive failures
local STOP_THRESHOLD = 50             -- Stop fetching after 50 failures

function send_push(title, message, priority)
    log("Pushing alert: " .. title)
    http_post("https://ntfy.sh/" .. NTFY_TOPIC, message, {
        ["Title"] = title,
        ["Priority"] = tostring(priority or 3),
        ["Tags"] = "spyweb,sentinel"
    })
end

local ONE_DAY = 24 * 60 * 60 -- 24 hours in seconds

-- Circuit Breaker & Self-Healing: Stop fetching on fatal errors, 
-- but try to revive the job automatically after 24 hours.
function before_fetch(request)
    local fail_count = tonumber(store_get("fail_count")) or 0
    local stop_time = tonumber(store_get("stop_time")) or 0

    if fail_count >= STOP_THRESHOLD then
        local now = os.time()
        local elapsed = now - stop_time
        
        if elapsed < ONE_DAY then
            return nil -- Still in cooldown
        else
            log("Cooldown finished. Attempting auto-recovery...")
        end
    end
    return request
end

-- Track failures and only alert when a real problem persists
function after_fetch(response)
    local fail_count = tonumber(store_get("fail_count")) or 0

    if response.status ~= 200 then
        fail_count = fail_count + 1
        store_set("fail_count", tostring(fail_count))
        
        if fail_count == FAIL_THRESHOLD then
            send_push("🚨 Job Failing", "Monitoring failed " .. fail_count .. " times.", 4)
        elseif fail_count == STOP_THRESHOLD then
            store_set("stop_time", tostring(os.time()))
            send_push("💀 FATAL: Backing Off", "Entering 24-hour cooldown.", 5)
        end
    else
        if fail_count >= FAIL_THRESHOLD then
            send_push("✅ Job Recovered", "Monitoring is back online.", 3)
        end
        store_delete("fail_count")
        store_delete("stop_time")
    end
    return response
end

-- Price Drop Alerts: push notifications when an item goes on sale
-- or drops below its previously recorded lowest price.
function before_notify(items)
    for _, item in ipairs(items) do
        local price = tonumber(item.fields.price or "0")
        local title = (item.fields.title or ""):upper()
        
        -- Check persistent store for the last seen minimum price
        local storage_key = "min_price_" .. (item.fields.id or item.fields.title)
        local last_min = tonumber(store_get(storage_key) or "999999")

        if price > 0 and price < last_min then
            send_push("📉 PRICE DROP: " .. item.fields.title, 
                      "Now $" .. price .. "! (Was $" .. last_min .. ")", 4)
            store_set(storage_key, tostring(price))
        elseif title:find("SALE") or title:find("OFF") then
            send_push("🎁 ON SALE: " .. item.fields.title, "Check it out!", 3)
        end
    end
    return items
end

-- --------------------------------------------------------------------------
-- 5. HEARTBEAT (Optional)
-- --------------------------------------------------------------------------
-- Enable this to get a once-a-day "Hi" so you know the VPS is still running.
-- store_get/store_set are better than plain globals here if you want heartbeat
-- state to survive restart too.
-- last_heartbeat = 0
-- function before_notify(items)
--     local now = os.time()
--     if now - last_heartbeat >= (24 * 60 * 60) then
--         send_push("💓 Heartbeat", "Hey! Just letting you know I'm alive and kicking.", 2)
--         last_heartbeat = now
--     end
--     return items
-- end

Cycle through pages using a Lua global counter:

-- hook.lua — globals persist while the job process stays alive
function before_fetch(request)
    page = page or 1
    if string.find(request.url, "?") then
        request.url = request.url .. "&page=" .. page
    else
        request.url = request.url .. "?page=" .. page
    end
    page = page + 1
    if page > 5 then page = 1 end
    -- Use store_set("page", ...) if you want this to survive restart.
    return request
end

function filter_item(item)
    if (item.fields.title or "") == "" then
        return nil
    end
    return item
end

Replace the keyword filter with custom Lua logic:

-- hook.lua — replaces built-in keyword filter
blocked = { ["spam co"] = true }

function filter_item(item)
    local title = (item.fields.title or ""):lower()
    local company = (item.fields.company or ""):lower()

    if blocked[company] then return nil end
    if title:find("intern") then return nil end

    -- Must mention a wanted tech
    local dominated_by = false
    for _, tech in ipairs({"rust","go","python"}) do
        if title:find(tech) then
            dominated_by = true
            break
        end
    end
    if not dominated_by then return nil end

    return item
end

Add auth headers dynamically:

-- hook.lua — dynamic headers
function before_fetch(request)
    request.headers["Authorization"] = "Bearer my-api-key"
    request.headers["X-Custom"] = "spyweb"
    return request
end

function after_fetch(response)
    if response.status == 401 then
        print("Auth failed!")
        return nil
    end
    return response
end

Tips & Gotchas

Topic Detail
Persistent storage store_* is job-scoped. global_store_* is shared. Atomic logic is only needed in global state via global_store_incr.
Globals persist Plain Lua globals survive across runs only while the current job task stays alive. Hot-reload resets them with a fresh VM.
filter_item vs keywords Mutually exclusive. If filter_item() exists, the built-in keyword filter does not run at all.
Error handling Hook errors are logged to stderr and skipped. The pipeline continues with the original data. A broken script never kills a job.
before_store footgun Items dropped by before_store have no DB record and will reappear as new next run. Use filter_item for permanent drops.
Threading Hooks run on the async executor thread. Blocking HTTP and DB operations run in background threads. Lua code should be fast and non-blocking.
Luau dialect SpyWeb uses Luau by default for stability. For standard Lua 5.4 support (enabling C-modules), see the Building from Source guide.
Standard Libraries Access os.time, os.date, math.*, and require for code sharing. Modules are searched in the job's directory first, then in the global shared/ folder. The io library is disabled; use log() instead.
Logging Calls to log(msg) append to hook.log in the job's directory. This file is persistent and useful for long-term debugging.
Hot reload Edit config.toml or hooks.lua → spyweb auto-detects changes and reloads. All Lua state resets. No restart needed.
Deduplication By default, SpyWeb hashes all fields to determine uniqueness. Use hash_fields only if you want to lock an item's identity to a specific field (like a URL) and ignore changes in volatile data (like "time ago").