A comprehensive guide to Datadog's agent, metrics, logs, APM, infrastructure monitoring, alerting, dashboards, synthetics, security, and integrations โ built from official documentation.
Datadog is a cloud-scale observability and security platform that unifies metrics, logs, traces, real user monitoring, synthetic testing, and security signals into a single pane of glass across any infrastructure.
Collect 75โ100 system metrics every 15โ20 seconds from hosts, containers, cloud services, and serverless functions via the lightweight Agent.
End-to-end distributed tracing with flame graphs, service maps, error tracking, deployment comparison, and Continuous Profiler.
Centralized log ingestion, parsing pipelines, real-time Live Tail search, and Flex Logs tiered storage with up to 7-year retention.
Threshold, anomaly, forecast, composite, and ML-based monitors with multi-channel alerting via PagerDuty, Slack, OpsGenie, and more.
Cloud SIEM, Cloud Security Management, App & API Protection, Code Security, and Workload Protection on one unified platform.
Watchdog AI for automated anomaly detection, root cause analysis, LLM Observability, and Issue Correlation across services.
Three pillars of observability: Datadog unifies Metrics (what is happening), Logs (why it happened), and Traces (where in the call chain). With APM enabled, the Agent auto-injects trace IDs into logs โ a click on any log takes you directly to the correlated distributed trace.
A lightweight Agent deployed on each host collects and buffers telemetry, forwarding it to the Datadog SaaS backend over HTTPS (metrics) and SSL-encrypted TCP (logs) for processing, storage, and analysis.
Collector โ runs all configured checks and gathers metrics on a 15โ20 second interval. Passes output to the local Aggregator and Forwarder.
Forwarder โ sends payloads to Datadog over HTTPS. Buffers metrics in memory during network outages, preventing data loss. Discards oldest data only at memory limits.
APM Agent โ separate optional process collecting distributed traces. Enabled by default. Listens on port 8126.
Process Agent โ collects live process and container info. Requires explicit enablement for full process monitoring.
DogStatsD โ Go implementation of StatsD. Accepts custom metrics over UDP (port 8125) or Unix socket, aggregates and forwards to backend.
| Data Type | Default Retention |
|---|---|
| Metrics | 15 months |
| Custom span-based metrics | 15 months |
| Indexed spans / traces | 15 days |
| Ingested spans (in-flight) | 15 minutes |
| Standard Tier logs | 3โ30 days (configurable) |
| Flex Logs (frozen tier) | Up to 7 years |
| RUM sessions | 30 days |
The Datadog Agent is open-source software written in Go that runs on every monitored host. Agent 7 is the latest major version. Resource footprint: ~0.08% CPU avg, ~95 MB RAM, 880 MBโ1.3 GB disk on a c5.xlarge.
BASH โ Linux one-liner
# Install Agent 7 on Linux
DD_API_KEY="<YOUR_API_KEY>" DD_SITE="datadoghq.com" \
bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
# Service management
sudo systemctl start datadog-agent
sudo systemctl stop datadog-agent
sudo datadog-agent status
DOCKER
docker run -d --name dd-agent \
-e DD_API_KEY="<API_KEY>" \
-e DD_SITE="datadoghq.com" \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
gcr.io/datadoghq/agent:7
KUBERNETES โ Datadog Operator (recommended)
helm repo add datadog https://helm.datadoghq.com
helm install datadog-operator datadog/datadog-operator
# DatadogAgent custom resource
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
global:
clusterName: my-cluster
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
features:
apm:
enabled: true
logCollection:
enabled: true
containerCollectAll: true
liveProcesses:
enabled: true
YAML โ /etc/datadog-agent/datadog.yaml
api_key: <YOUR_API_KEY>
site: datadoghq.com # or datadoghq.eu, us3.datadoghq.com
hostname: my-host.example.com
## Tags applied to ALL telemetry from this host
tags:
- env:production
- team:platform
- region:us-east-1
## Enable features
logs_enabled: true
apm_config:
enabled: true
process_config:
process_collection:
enabled: true
## DogStatsD
dogstatsd_port: 8125
dogstatsd_socket: /var/run/datadog/dsd.socket
## Optional proxy
proxy:
https: http://proxy.corp:3128
| Environment | Method |
|---|---|
| Linux | install_script_agent7.sh |
| Windows | MSI installer / Chocolatey |
| macOS | Homebrew / DMG |
| Docker | gcr.io/datadoghq/agent:7 |
| Kubernetes | Datadog Operator (recommended) |
| Kubernetes (alt) | datadog/datadog Helm chart |
| AWS ECS | Daemon service / Fargate sidecar |
| AWS Lambda | Lambda Extension / Forwarder |
| IoT | IoT Agent (lightweight binary) |
YAML โ conf.d/nginx.d/conf.yaml
init_config:
instances:
- nginx_status_url: http://localhost/nginx_status
tags:
- service:nginx
- env:production
K8s โ Autodiscovery annotations
annotations:
ad.datadoghq.com/nginx.check_names: '["nginx"]'
ad.datadoghq.com/nginx.init_configs: '[{}]'
ad.datadoghq.com/nginx.instances:
'[{"nginx_status_url":"http://%%host%%/nginx_status"}]'
Fleet Automation: Remotely configure, upgrade, and manage all Agents across all environments from the Datadog UI โ no SSH required. Supports automatic rollback if an Agent fails to restart after upgrade.
Metrics are time-series data points identified by a name and tags. They can be collected by the Agent, submitted via API/DogStatsD, or imported from cloud provider integrations.
| Type | Description | Use Case |
|---|---|---|
| COUNT | Events in a flush interval | http.requests |
| RATE | Count divided by time interval | requests.per_second |
| GAUGE | Instantaneous value at flush | cpu.usage, mem.free |
| HISTOGRAM | Statistical distribution | response.time |
| DISTRIBUTION | Global percentile calculations | latency.p99 |
| SET | Count of unique elements | unique_users |
HISTOGRAM outputs: When using HISTOGRAM, DogStatsD automatically sends .avg, .count, .median, .95percentile, .max, and .min as separate metric streams.
PYTHON
from datadog import initialize, statsd
initialize(statsd_host='localhost', statsd_port=8125)
# Gauge
statsd.gauge('app.queue.depth', 42,
tags=['env:prod', 'service:worker'])
# Increment counter
statsd.increment('app.page.views',
tags=['page:home'])
# Histogram (timing)
statsd.histogram('db.query.time', 0.042,
tags=['query:get_user'])
# Distribution (global percentiles)
statsd.distribution('api.response.time', 125.3,
tags=['endpoint:/checkout'])
| Namespace | Captures |
|---|---|
| trace.<span>.hits | Request count per service |
| trace.<span>.errors | Error count per service |
| trace.<span>.apdex | Apdex score (HTTP/web) |
| runtime.* | Language runtime metrics |
Datadog Log Management centralizes logs from all sources with real-time search, Grok parsing pipelines, faceted exploration, alerting, and Flex Logs for long-term cost-effective retention.
YAML โ datadog.yaml
logs_enabled: true
YAML โ conf.d/python.d/conf.yaml
logs:
- type: file
path: /var/log/myapp/*.log
source: python
service: my-service
tags:
- env:production
DOCKER โ label-based collection
labels:
com.datadoghq.ad.logs:
'[{"source":"nginx","service":"web"}]'
| Limit | Value |
|---|---|
| Max log size (HTTPS) | 1 MB |
| Recommended per log | < 25 KB |
| Agent auto-split threshold | 900 KB |
| Max tags per log event | 100 |
| Max JSON attributes | 256 |
| Max attribute key length | 50 chars |
Fully indexed logs for real-time search, monitors, dashboards, Live Tail. Configurable 3โ30 day retention. Full Log Explorer capabilities.
Cost-effective tier for lower-query-frequency logs. In-place searchability without rehydration. Flex Frozen sub-tier stores up to 7 years for compliance and forensic investigation.
Query logs archived directly in cloud storage (S3, GCS, Azure Blob) or Flex Frozen โ without exporting or rehydrating. Ideal for audits and long-range analytics.
Trace-log correlation: APM auto-injects trace IDs into logs. Clicking a log entry with a trace ID jumps immediately to the correlated trace โ no manual query building.
GROK โ Apache combined access log rule
access_log %{ip:network.client.ip} %{notSpace:http.ident} %{notSpace:http.auth} \
\[%{date("dd/MMM/yyyy:HH:mm:ss Z"):date}\] \
"%{word:http.method} %{notSpace:http.url} %{notSpace:http.version}" \
%{integer:http.status_code} %{integer:network.bytes_written}
Datadog APM provides end-to-end distributed tracing, flame graphs, service maps, deployment tracking, and Continuous Profiler โ with deep correlation to logs, metrics, and RUM.
Installs Agent + instruments app in one step โ no code changes. The simplest starting point.
Language-specific libraries for Python, Java, Go, Ruby, Node.js, .NET, PHP, C++, Rust.
Send OTel metrics, traces, and logs into Datadog via the Collector with Datadog Exporter.
Add instrumentation to live running services via the Datadog UI โ no code deploys or restarts required.
Auto-generated dependency map of all services with real-time error rates and latency per connection.
Full call tree of any trace with time-spent visualization. Identify slowest code paths instantly.
Intelligent error grouping across services. Track new vs. regressing issues by deployed version.
Compare error rate, latency, and throughput before/during/after each deployment. Auto-detect faulty deploys via Watchdog.
Always-on low-overhead code profiling in production. See exactly which methods consume CPU, memory, and I/O.
Trace IDs injected into logs. View logs side-by-side with the trace that generated them.
PYTHON
pip install ddtrace
# Auto-instrument at startup (recommended)
DD_SERVICE="my-api" DD_ENV="prod" DD_VERSION="1.2.0" \
ddtrace-run python app.py
# Manual span creation
from ddtrace import tracer
with tracer.trace("db.query", resource="SELECT users") as span:
span.set_tag("db.type", "postgres")
result = db_query()
JAVA โ JVM flag
java -javaagent:/path/to/dd-java-agent.jar \
-Ddd.service=my-app \
-Ddd.env=production \
-Ddd.version=1.0.0 \
-jar app.jar
YAML โ OTel Collector Datadog Exporter
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
traces:
compute_stats_by_span_kind: true
service:
pipelines:
traces:
exporters: [datadog]
metrics:
exporters: [datadog]
Unified Service Tagging: Apply env, service, and version consistently across all telemetry types from a service to enable seamless pivoting between metrics, logs, and traces.
Monitor hosts, containers, Kubernetes clusters, cloud services, and serverless functions from a unified Infrastructure List with real-time metrics, health status, and live process monitoring.
| Category | Example Metrics |
|---|---|
| CPU | system.cpu.user, system.load.1 |
| Memory | system.mem.used, system.swap.used |
| Disk I/O | system.io.rkb_s, system.disk.used |
| Network | system.net.bytes_rcvd, bytes_sent |
The Cluster Agent efficiently gathers monitoring data from across an orchestrated cluster. It distributes check configurations to node Agents and ensures only one instance of each check runs per workload โ preventing duplicate data collection across replicas.
The Cluster Agent holds configs and dispatches them to node Agents every 10 seconds. If a node Agent stops reporting, the Cluster Agent removes it from the active pool and redistributes its configurations.
For AWS Lambda, Datadog collects metrics, traces, and logs via the Lambda Extension (preferred, runs in-process) or Lambda Forwarder (CloudWatch-based). Supports enhanced Lambda metrics, cold start detection, and X-Ray integration.
Monitors evaluate metric, log, or trace queries against defined conditions and trigger alerts with notifications to PagerDuty, Slack, email, OpsGenie, and more. Evaluation frequency defaults to 1 minute.
Alert when a metric threshold is crossed over a rolling window. Simple or multi-alert modes grouped by any tag.
Alert on indexed log count, attribute unique count, or measure. Supports group-by facets. Max 2-day rolling window.
Monitor service APM metrics (hits, errors, latency) or alert on Trace Analytics Indexed Span patterns.
ML-based detection learns seasonal patterns. Alerts on statistically unexpected deviations without manual thresholds.
Predicts when a metric will breach a threshold. Ideal for disk capacity and resource planning.
Alert when a Synthetic API test or browser test fails or exceeds latency thresholds.
Alert based on OK / WARNING / CRITICAL status submitted by Agent integration checks.
Combine monitors with boolean logic (AND, OR, NOT). Alert only when multiple conditions are simultaneously true.
Alert on slow queries, connection pool saturation, replication lag for PostgreSQL, MySQL, SQL Server, Oracle.
TERRAFORM
resource "datadog_monitor" "high_cpu" {
name = "High CPU Usage"
type = "metric alert"
message = "CPU > 90% on {{host.name}} @pagerduty"
query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 90"
monitor_thresholds {
critical = 90
warning = 75
}
notify_no_data = true
no_data_timeframe = 20
tags = ["env:production", "team:platform"]
}
| Option | Description |
|---|---|
| evaluation_window | Time range for query (last_5m, last_1h) |
| evaluation_frequency | How often query runs (default 1 min) |
| critical | Value triggering ALERT state |
| warning | Value triggering WARNING state |
| notify_no_data | Alert if no data is received |
| renotify_interval | Re-alert on sustained state (minutes) |
| require_full_window | Only evaluate with complete data window |
| multi_alert | Separate alert per dimension (e.g., per host) |
Three SLO types:
| Type | Based On |
|---|---|
| Metric-based | Good events / total events ratio |
| Monitor-based | Uptime % derived from monitor state |
| Time Slice | % of time windows metric was within threshold |
Dashboards provide real-time insight into system health and business KPIs. Build from any combination of metrics, logs, traces, RUM, and events with template variables for dynamic scoping.
| Type | Use Case |
|---|---|
| Timeboard | All widgets share the same time range. Best for metric correlation during investigations. |
| Screenboard | Free-form layout with independent time ranges per widget. Best for NOC status displays. |
| Notebook | Markdown + live graphs. Best for postmortems, runbooks, and incident investigations. |
TERRAFORM
resource "datadog_dashboard" "service_health" {
title = "Service Health"
layout_type = "ordered"
template_variable {
name = "env"
prefix = "env"
default = "production"
}
widget {
timeseries_definition {
request {
q = "avg:trace.web.request.duration{$env} by {service}"
}
title = "Request Latency by Service"
}
}
}
Datadog Sheets: Spreadsheet-style interface for analyzing telemetry โ perform lookups, build pivot tables, create calculated columns, join datasets. Results can be added to dashboards or shared with colleagues.
Synthetics proactively tests endpoints and journeys from Datadog-managed global locations. RUM captures real user interactions and performance from actual browsers and mobile apps.
| Type | Description |
|---|---|
| API Test | HTTP, gRPC, WebSocket, TCP, SSL, DNS checks. Assert on status codes, body, headers, latency. |
| Multistep API | Chain multiple API calls with variables passed between steps. Test full auth + action flows. |
| Browser Test | Headless Chrome tests that record and replay user journeys. Detect visual regressions and broken UI. |
| Mobile Test | Native iOS and Android app testing with real device simulation. |
BASH โ datadog-ci
npm install -g @datadog/datadog-ci
# Run Synthetic tests in CI pipeline
datadog-ci synthetics run-tests \
--public-id "abc-123-xyz" \
--apiKey $DD_API_KEY \
--failOnCriticalErrors
Datadog Feature Flags integrates with your existing feature flag provider to track flag evaluations alongside RUM data. Correlate feature flag rollouts directly with performance regressions and error spikes in the same view.
Datadog unifies observability and security on one platform โ eliminating the context-switching between tools that slows down incident response when a performance issue has a security dimension.
Detect, investigate, and respond to security threats across cloud and on-premises systems. Correlates logs, metrics, and network data to surface high-fidelity signals.
Continuously audits cloud configurations, assesses identity risks (CIEM), and detects runtime threats across AWS, GCP, and Azure.
Detects and blocks threats targeting production applications and APIs in real time, with APM trace context for each attack signal.
Detects and fixes vulnerabilities in first-party code, open-source dependencies (SCA), and infrastructure-as-code from dev through runtime (IAST).
Uses eBPF to monitor file, network, and process activity at the kernel level. Detects privilege escalation, cryptomining, and unusual process behavior.
Immutable audit log of all user and configuration changes across the Datadog platform โ who changed what monitor, API key, or Agent config and when.
Datadog ships 1,000+ vendor-backed integrations โ each providing ready-made Agent checks, dashboards, and monitors. Connect to cloud providers, databases, message queues, CI/CD tools, and observability standards.
Datadog Internal Developer Portal (IDP): Software Catalog + Self-Service Actions + Scorecards. Visualize service hierarchies, enable self-service infrastructure provisioning, and evaluate production-readiness before release.
Tags are key:value metadata attached to every metric, log, trace, and event. A consistent tagging strategy is the foundation of effective filtering, alerting, and root-cause analysis in Datadog.
| Tag | Purpose | Example |
|---|---|---|
| env | Deployment environment | production |
| service | Service / application name | checkout-api |
| version | Deployed code version | 1.4.2 |
ENV VARS โ Kubernetes pod spec
env:
- name: DD_ENV
value: production
- name: DD_SERVICE
value: checkout-api
- name: DD_VERSION
value: 1.4.2
- name: DD_AGENT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
key:value format consistently everywhereenv: (production, staging, dev)service: for service-level viewsversion: for deployment trackingteam: for ownership routing in monitor notificationsregion: and availability-zone: for geographic scoping| Source | Example Tags |
|---|---|
| Agent config (datadog.yaml) | env:prod, team:platform |
| Cloud provider metadata | aws:us-east-1, instance_type:c5 |
| Container labels / K8s labels | app:frontend, kube_namespace:default |
| Integration check config | db:postgres-prod |
| DogStatsD metric submission | endpoint:/checkout |
Datadog exposes a comprehensive REST API for programmatic access to all platform resources. Official SDKs, Terraform provider, and datadog-ci CLI enable full Datadog-as-Code workflows.
| Endpoint | Action |
|---|---|
| POST /api/v1/series | Submit custom metrics |
| POST /api/v2/logs/events | Send log events |
| GET /api/v1/monitors | List all monitors |
| POST /api/v1/monitor | Create a monitor |
| GET /api/v1/dashboard | List all dashboards |
| POST /api/v1/events | Post to Events Stream |
| POST /api/v2/query | Metrics query over time range |
PYTHON โ API client
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi
config = Configuration()
config.api_key["apiKeyAuth"] = "<DD_API_KEY>"
config.api_key["appKeyAuth"] = "<DD_APP_KEY>"
with ApiClient(config) as api_client:
api = MonitorsApi(api_client)
print(api.list_monitors())
TERRAFORM โ Provider setup
terraform {
required_providers {
datadog = {
source = "DataDog/datadog"
version = "~> 3.0"
}
}
}
# Auth via DD_API_KEY + DD_APP_KEY env vars
provider "datadog" {
api_url = "https://api.datadoghq.com/"
}
BASH
npm install -g @datadog/datadog-ci
# Upload source maps for RUM error tracking
datadog-ci sourcemaps upload ./dist \
--service my-app --release-version 1.4.2
# Report CI test results
datadog-ci junit upload --service my-app ./test-results.xml
Core terms used across the Datadog platform and documentation.