Monitoring components

Think of monitoring like having a health checkup for your application. You want to know if it's running smoothly, where it might be struggling, and catch problems before your users do. Our monitoring setup combines several key components that work together to give us a complete picture of system health.

Synthetic monitoring: Your digital test pilot

Synthetic monitoring is like having a robot that pretends to be your user. It visits your site, clicks around, and reports back on how things are working. This proactive approach helps us catch issues before real users encounter them.

We use automated tests to simulate real user behavior. Think of it as quality assurance that never sleeps. These tests run from different locations around the world, checking if your blog loads quickly, if API calls to Arweave or Base are responding, and if everything renders correctly.

What it looks like in practice:

A script loads your homepage every 5 minutes from New York, London, and Tokyo
Another test checks if a blog post with on-chain data loads within 2 seconds
If something breaks, you know immediately (not when users start complaining)

Tools like Playwright, Puppeteer, or Selenium power these synthetic tests. They're particularly valuable for static sites on GitHub Pages where you can't always predict when something might go wrong.

Instrumentation: Teaching your code to speak

Instrumentation is the process of adding telemetry to your application. Think of it as teaching your code to talk about what it's doing. Instead of wondering why something is slow, you get concrete data about performance, errors, and usage patterns.

You embed monitoring code directly into your application. When someone requests a page, your instrumented code records how long it took to process, whether it succeeded, and any relevant details along the way.

Here's how we instrument different parts of our stack:

Node.js apps: Use prom-client to expose metrics like request latency and error rates through a /metrics endpoint
Elixir services: The telemetry library emits events (like DuckDB query duration) that Prometheus can scrape
On-chain interactions: Wrap API calls to track request duration, success rates, and gas usage
Infrastructure: Node Exporter collects system metrics like CPU and memory usage

For example, when your Markdown-to-HTML converter processes a file, instrumentation tracks how many files were processed and how long each conversion took. This data flows into Prometheus as metrics you can analyze and alert on.

Metrics: The numbers that matter

Metrics are numerical measurements that capture the state and performance of your system at specific points in time. Think of them as vital signs for your application. Just like a doctor checks your heart rate and blood pressure, metrics tell you if your system is healthy.

Metrics come in different types:

Counters track things that only go up (like total requests served)
Gauges measure current values that can go up or down (like active user sessions)
Histograms show distributions of values (like response time percentiles)

Common metrics we track:

Response time for web requests
Error rates and success rates
Database query performance
Memory and CPU usage
Active user connections

Prometheus excels at storing and querying time-series metrics data. It's optimized for fast aggregations and can handle millions of data points efficiently.

Logs: The system's diary

Logs are textual records of events that happen in your system. They're like a detailed diary of what your application did, when it did it, and what went wrong. While metrics tell you "what" happened, logs provide the context and details about "why."

Good logs include:

Timestamps for when events occurred
Log levels (INFO, WARN, ERROR) to indicate severity
Structured data that's easy to parse and search
Relevant context like user IDs or request IDs

What we log:

User actions and API requests
Database queries and their execution times
Error messages with stack traces
Business logic events (like successful payments)
Security-related events (login attempts, permission changes)

Tools like Loki, Elasticsearch, or simple file-based logging can store and index your logs for searching and analysis.

Traces: Following the request journey

Distributed tracing tracks a single request as it flows through multiple services in your system. Imagine following a package through the postal system, from sender to recipient, tracking every stop along the way.

A trace consists of multiple "spans," where each span represents work done by a particular service. When you load a blog post, the trace might include spans for:

Receiving the HTTP request
Querying the database for post content
Fetching metadata from Arweave
Rendering the final HTML
Sending the response

Why tracing matters:

Identifies bottlenecks in complex workflows
Shows how services depend on each other
Helps debug issues that span multiple components
Provides timing information for each step in a request

OpenTelemetry and Jaeger are popular tools for implementing distributed tracing. They add minimal overhead while providing valuable insights into system behavior.

Dashboard: Your monitoring command center

A dashboard is your visual window into system health. It takes all the raw data from metrics, logs, and traces and presents it in charts, graphs, and tables that humans can quickly understand. Think of it as the instrument panel in your car, showing speed, fuel level, and engine temperature at a glance.

Good dashboards answer key questions instantly:

Is my system currently healthy?
Are there any ongoing issues?
How is performance trending over time?
Where should I focus my attention?

Dashboard best practices:

Show the most critical metrics prominently
Use colors meaningfully (red for problems, green for healthy)
Include both current status and historical trends
Make it easy to drill down from high-level overviews to detailed data

Grafana is the most popular dashboarding tool in the monitoring space. It connects to multiple data sources and offers flexible visualization options.

Correlation layer: Connecting the dots

The correlation layer is what makes monitoring data truly powerful. It connects related information across metrics, logs, and traces so you can understand the complete picture when investigating issues.

Correlation typically works through shared identifiers:

Request IDs that appear in logs, metrics, and traces for the same user request
User IDs that help track behavior across different system components
Session IDs that group related activities together
Transaction IDs for business processes that span multiple services

Why correlation matters: When you see a spike in response times (metrics), correlation helps you find the related error messages (logs) and identify which specific service is slow (traces). Without correlation, you're solving puzzles with pieces from different boxes.

Modern observability platforms automatically correlate data when you include these shared identifiers in your telemetry.

Alert: Your early warning system

Alerts are automated notifications that wake you up when something goes wrong. They're like smoke detectors for your application, designed to catch problems before they become disasters.

Effective alerts have three key characteristics:

Actionable: They tell you about problems you can actually fix
Timely: They fire quickly enough to prevent or minimize impact
Reliable: They don't cry wolf with false positives

Common alert scenarios:

Website down or returning errors
Response times exceeding acceptable thresholds
Error rates climbing above normal levels
Infrastructure resources (CPU, memory, disk) reaching capacity
Business metrics falling outside expected ranges

Alert fatigue is real. Too many alerts train people to ignore them. Focus on alerting for conditions that require immediate human intervention, not every minor blip in your metrics.

Prometheus Alertmanager, PagerDuty, and similar tools can route alerts to the right people through email, Slack, SMS, or phone calls based on severity and time of day.

How these components work together

Each monitoring component serves a specific purpose, but their real power emerges when they work together:

Synthetic monitoring provides the user perspective, ensuring your application works as customers expect.
Instrumentation gives you internal visibility into how your code and infrastructure perform.
Metrics offer quantitative measurements that are easy to alert on and trend over time.
Logs provide detailed context when something goes wrong and you need to debug.
Traces show you the complete journey of requests through your distributed system.
Dashboards make all this data visible and actionable for your team.
Correlation connects related data so you can debug efficiently.
Alerts ensure you know about problems quickly enough to respond effectively.

Building a comprehensive monitoring system means implementing these components thoughtfully and ensuring they complement each other. Start with the basics (metrics and logs), then add more sophisticated components (tracing and correlation) as your system grows in complexity.

Next: Approach to build

Properties

Location

Stats