Complete Technical Reference Guide

Datadog
Observability Platform

A comprehensive guide to Datadog's agent, metrics, logs, APM, infrastructure monitoring, alerting, dashboards, synthetics, security, and integrations โ€” built from official documentation.

Agent v7 Metrics & Logs APM & Tracing Monitors & Alerts Security 1,000+ Integrations OpenTelemetry Kubernetes
01 โ€” Product Overview

What is Datadog?

Datadog is a cloud-scale observability and security platform that unifies metrics, logs, traces, real user monitoring, synthetic testing, and security signals into a single pane of glass across any infrastructure.

๐Ÿ“ก

Infrastructure Monitoring

Collect 75โ€“100 system metrics every 15โ€“20 seconds from hosts, containers, cloud services, and serverless functions via the lightweight Agent.

๐Ÿ”ฌ

APM & Distributed Tracing

End-to-end distributed tracing with flame graphs, service maps, error tracking, deployment comparison, and Continuous Profiler.

๐Ÿ“‹

Log Management

Centralized log ingestion, parsing pipelines, real-time Live Tail search, and Flex Logs tiered storage with up to 7-year retention.

๐Ÿ””

Monitors & Alerting

Threshold, anomaly, forecast, composite, and ML-based monitors with multi-channel alerting via PagerDuty, Slack, OpsGenie, and more.

๐Ÿ›ก

Security

Cloud SIEM, Cloud Security Management, App & API Protection, Code Security, and Workload Protection on one unified platform.

๐Ÿค–

AI & ML Features

Watchdog AI for automated anomaly detection, root cause analysis, LLM Observability, and Issue Correlation across services.

Three pillars of observability: Datadog unifies Metrics (what is happening), Logs (why it happened), and Traces (where in the call chain). With APM enabled, the Agent auto-injects trace IDs into logs โ€” a click on any log takes you directly to the correlated distributed trace.

02 โ€” Platform Architecture

How Datadog Works

A lightweight Agent deployed on each host collects and buffers telemetry, forwarding it to the Datadog SaaS backend over HTTPS (metrics) and SSL-encrypted TCP (logs) for processing, storage, and analysis.

YOUR INFRASTRUCTURE
Linux Host
Agent + Checks
Kubernetes
DaemonSet + Cluster Agent
Docker
Container Agent
AWS Lambda
Extension / Forwarder
Windows
Agent Service
โ†“ collected by โ†“
DATADOG AGENT PROCESSES
Collector
Runs checks, gathers metrics
Forwarder
Buffers + sends over HTTPS
DogStatsD
Custom metrics UDP/UDS :8125
APM Agent
Traces from apps :8126
Process Agent
Live process info
โ†“ HTTPS (metrics) / SSL-TCP (logs) โ†“
DATADOG BACKEND (SaaS)
Metrics Store
15-month retention
Log Management
Standard / Flex Tiers
Trace Storage
15 days indexed spans
Watchdog AI
ML anomaly detection
โ†“ visualize / alert / notify โ†“
Dashboards
Monitors
APM Traces
Log Explorer
PagerDuty / Slack

Agent Internal Components

Collector โ€” runs all configured checks and gathers metrics on a 15โ€“20 second interval. Passes output to the local Aggregator and Forwarder.

Forwarder โ€” sends payloads to Datadog over HTTPS. Buffers metrics in memory during network outages, preventing data loss. Discards oldest data only at memory limits.

APM Agent โ€” separate optional process collecting distributed traces. Enabled by default. Listens on port 8126.

Process Agent โ€” collects live process and container info. Requires explicit enablement for full process monitoring.

DogStatsD โ€” Go implementation of StatsD. Accepts custom metrics over UDP (port 8125) or Unix socket, aggregates and forwards to backend.

Data Retention

Data TypeDefault Retention
Metrics15 months
Custom span-based metrics15 months
Indexed spans / traces15 days
Ingested spans (in-flight)15 minutes
Standard Tier logs3โ€“30 days (configurable)
Flex Logs (frozen tier)Up to 7 years
RUM sessions30 days
03 โ€” The Datadog Agent

Agent Installation & Configuration

The Datadog Agent is open-source software written in Go that runs on every monitored host. Agent 7 is the latest major version. Resource footprint: ~0.08% CPU avg, ~95 MB RAM, 880 MBโ€“1.3 GB disk on a c5.xlarge.

Installation

BASH โ€” Linux one-liner
# Install Agent 7 on Linux
DD_API_KEY="<YOUR_API_KEY>" DD_SITE="datadoghq.com" \
  bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"

# Service management
sudo systemctl start   datadog-agent
sudo systemctl stop    datadog-agent
sudo datadog-agent     status
DOCKER
docker run -d --name dd-agent \
  -e DD_API_KEY="<API_KEY>" \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7
KUBERNETES โ€” Datadog Operator (recommended)
helm repo add datadog https://helm.datadoghq.com
helm install datadog-operator datadog/datadog-operator

# DatadogAgent custom resource
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    clusterName: my-cluster
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
  features:
    apm:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true
    liveProcesses:
      enabled: true

Core Configuration โ€” datadog.yaml

YAML โ€” /etc/datadog-agent/datadog.yaml
api_key: <YOUR_API_KEY>
site:    datadoghq.com         # or datadoghq.eu, us3.datadoghq.com
hostname: my-host.example.com

## Tags applied to ALL telemetry from this host
tags:
  - env:production
  - team:platform
  - region:us-east-1

## Enable features
logs_enabled:       true
apm_config:
  enabled:          true
process_config:
  process_collection:
    enabled:        true

## DogStatsD
dogstatsd_port:     8125
dogstatsd_socket:   /var/run/datadog/dsd.socket

## Optional proxy
proxy:
  https:            http://proxy.corp:3128

Deployment Options

EnvironmentMethod
Linuxinstall_script_agent7.sh
WindowsMSI installer / Chocolatey
macOSHomebrew / DMG
Dockergcr.io/datadoghq/agent:7
KubernetesDatadog Operator (recommended)
Kubernetes (alt)datadog/datadog Helm chart
AWS ECSDaemon service / Fargate sidecar
AWS LambdaLambda Extension / Forwarder
IoTIoT Agent (lightweight binary)

Integration Check Config

YAML โ€” conf.d/nginx.d/conf.yaml
init_config:

instances:
  - nginx_status_url: http://localhost/nginx_status
    tags:
      - service:nginx
      - env:production
K8s โ€” Autodiscovery annotations
annotations:
  ad.datadoghq.com/nginx.check_names:  '["nginx"]'
  ad.datadoghq.com/nginx.init_configs: '[{}]'
  ad.datadoghq.com/nginx.instances:
    '[{"nginx_status_url":"http://%%host%%/nginx_status"}]'

Fleet Automation: Remotely configure, upgrade, and manage all Agents across all environments from the Datadog UI โ€” no SSH required. Supports automatic rollback if an Agent fails to restart after upgrade.

04 โ€” Metrics

Metrics Collection & Custom Metrics

Metrics are time-series data points identified by a name and tags. They can be collected by the Agent, submitted via API/DogStatsD, or imported from cloud provider integrations.

Metric Types

TypeDescriptionUse Case
COUNTEvents in a flush intervalhttp.requests
RATECount divided by time intervalrequests.per_second
GAUGEInstantaneous value at flushcpu.usage, mem.free
HISTOGRAMStatistical distributionresponse.time
DISTRIBUTIONGlobal percentile calculationslatency.p99
SETCount of unique elementsunique_users

HISTOGRAM outputs: When using HISTOGRAM, DogStatsD automatically sends .avg, .count, .median, .95percentile, .max, and .min as separate metric streams.

Submitting Custom Metrics via DogStatsD

PYTHON
from datadog import initialize, statsd
initialize(statsd_host='localhost', statsd_port=8125)

# Gauge
statsd.gauge('app.queue.depth', 42,
  tags=['env:prod', 'service:worker'])

# Increment counter
statsd.increment('app.page.views',
  tags=['page:home'])

# Histogram (timing)
statsd.histogram('db.query.time', 0.042,
  tags=['query:get_user'])

# Distribution (global percentiles)
statsd.distribution('api.response.time', 125.3,
  tags=['endpoint:/checkout'])

APM Metric Namespaces

NamespaceCaptures
trace.<span>.hitsRequest count per service
trace.<span>.errorsError count per service
trace.<span>.apdexApdex score (HTTP/web)
runtime.*Language runtime metrics
05 โ€” Log Management

Log Collection, Pipelines & Storage

Datadog Log Management centralizes logs from all sources with real-time search, Grok parsing pipelines, faceted exploration, alerting, and Flex Logs for long-term cost-effective retention.

๐Ÿ“„
Collection
File, Docker, K8s, AWS, API
โ†’
โš™๏ธ
Processing
Grok parsing pipelines
โ†’
๐Ÿท
Enrichment
Tags, attributes, lookups
โ†’
๐Ÿ—„
Indexing
Retention filters + Flex
โ†’
๐Ÿ”
Log Explorer
Search, Live Tail, facets
โ†’
๐Ÿ””
Monitors
Threshold & anomaly alerts

Enabling Log Collection

YAML โ€” datadog.yaml
logs_enabled: true
YAML โ€” conf.d/python.d/conf.yaml
logs:
  - type:    file
    path:    /var/log/myapp/*.log
    source:  python
    service: my-service
    tags:
      - env:production
DOCKER โ€” label-based collection
labels:
  com.datadoghq.ad.logs:
    '[{"source":"nginx","service":"web"}]'

Log Limits

LimitValue
Max log size (HTTPS)1 MB
Recommended per log< 25 KB
Agent auto-split threshold900 KB
Max tags per log event100
Max JSON attributes256
Max attribute key length50 chars

Storage Tiers

๐Ÿ”ฅ Standard Tier

Fully indexed logs for real-time search, monitors, dashboards, Live Tail. Configurable 3โ€“30 day retention. Full Log Explorer capabilities.

โ„๏ธ Flex Logs

Cost-effective tier for lower-query-frequency logs. In-place searchability without rehydration. Flex Frozen sub-tier stores up to 7 years for compliance and forensic investigation.

๐Ÿ—ƒ Archive Search

Query logs archived directly in cloud storage (S3, GCS, Azure Blob) or Flex Frozen โ€” without exporting or rehydrating. Ideal for audits and long-range analytics.

Trace-log correlation: APM auto-injects trace IDs into logs. Clicking a log entry with a trace ID jumps immediately to the correlated trace โ€” no manual query building.

Grok Parser Example

GROK โ€” Apache combined access log rule
access_log %{ip:network.client.ip} %{notSpace:http.ident} %{notSpace:http.auth} \
  \[%{date("dd/MMM/yyyy:HH:mm:ss Z"):date}\] \
  "%{word:http.method} %{notSpace:http.url} %{notSpace:http.version}" \
  %{integer:http.status_code} %{integer:network.bytes_written}
06 โ€” APM & Distributed Tracing

Application Performance Monitoring

Datadog APM provides end-to-end distributed tracing, flame graphs, service maps, deployment tracking, and Continuous Profiler โ€” with deep correlation to logs, metrics, and RUM.

Instrumentation Methods

โšก

Single Step Instrumentation

Installs Agent + instruments app in one step โ€” no code changes. The simplest starting point.

๐Ÿ“š

Tracing Libraries

Language-specific libraries for Python, Java, Go, Ruby, Node.js, .NET, PHP, C++, Rust.

๐Ÿ”ญ

OpenTelemetry

Send OTel metrics, traces, and logs into Datadog via the Collector with Datadog Exporter.

๐Ÿ”ง

Dynamic Instrumentation

Add instrumentation to live running services via the Datadog UI โ€” no code deploys or restarts required.

Key APM Features

๐Ÿ—บ

Service Map

Auto-generated dependency map of all services with real-time error rates and latency per connection.

๐Ÿ”ฅ

Flame Graphs

Full call tree of any trace with time-spent visualization. Identify slowest code paths instantly.

โŒ

Error Tracking

Intelligent error grouping across services. Track new vs. regressing issues by deployed version.

๐Ÿš€

Deployment Tracking

Compare error rate, latency, and throughput before/during/after each deployment. Auto-detect faulty deploys via Watchdog.

๐Ÿ“Š

Continuous Profiler

Always-on low-overhead code profiling in production. See exactly which methods consume CPU, memory, and I/O.

๐Ÿ”—

Trace-Log Correlation

Trace IDs injected into logs. View logs side-by-side with the trace that generated them.

Python APM

PYTHON
pip install ddtrace

# Auto-instrument at startup (recommended)
DD_SERVICE="my-api" DD_ENV="prod" DD_VERSION="1.2.0" \
  ddtrace-run python app.py

# Manual span creation
from ddtrace import tracer

with tracer.trace("db.query", resource="SELECT users") as span:
    span.set_tag("db.type", "postgres")
    result = db_query()

Java APM

JAVA โ€” JVM flag
java -javaagent:/path/to/dd-java-agent.jar \
  -Ddd.service=my-app \
  -Ddd.env=production \
  -Ddd.version=1.0.0 \
  -jar app.jar

OpenTelemetry Collector

YAML โ€” OTel Collector Datadog Exporter
exporters:
  datadog:
    api:
      key:  ${DD_API_KEY}
      site: datadoghq.com
    traces:
      compute_stats_by_span_kind: true

service:
  pipelines:
    traces:
      exporters: [datadog]
    metrics:
      exporters: [datadog]

Unified Service Tagging: Apply env, service, and version consistently across all telemetry types from a service to enable seamless pivoting between metrics, logs, and traces.

07 โ€” Infrastructure Monitoring

Infrastructure & Container Monitoring

Monitor hosts, containers, Kubernetes clusters, cloud services, and serverless functions from a unified Infrastructure List with real-time metrics, health status, and live process monitoring.

Infrastructure Views

  • Infrastructure List โ€” every host with key metrics and tag filtering
  • Host Map โ€” hexagonal heatmap of all hosts by any metric
  • Containers page โ€” resource metrics and faceted search across containers
  • Container Images โ€” every image in your env + vulnerability data
  • Orchestrator Explorer โ€” monitor pods, deployments, namespaces
  • Control Plane Monitoring โ€” API server, scheduler, controller manager, etcd
  • Live Processes โ€” real-time process list with CPU, memory, I/O
  • Network Performance Monitoring โ€” eBPF-based traffic flow visibility

Key System Metrics

CategoryExample Metrics
CPUsystem.cpu.user, system.load.1
Memorysystem.mem.used, system.swap.used
Disk I/Osystem.io.rkb_s, system.disk.used
Networksystem.net.bytes_rcvd, bytes_sent

Cloud Integrations

AWSEC2, ECS, EKS, Lambda, RDS, S3, CloudWatchโ€ฆ
GCPGCE, GKE, Cloud SQL, Cloud Run, Pub/Subโ€ฆ
AzureVMs, AKS, Functions, Event Hubs, Blobโ€ฆ

Kubernetes Cluster Agent

The Cluster Agent efficiently gathers monitoring data from across an orchestrated cluster. It distributes check configurations to node Agents and ensures only one instance of each check runs per workload โ€” preventing duplicate data collection across replicas.

The Cluster Agent holds configs and dispatches them to node Agents every 10 seconds. If a node Agent stops reporting, the Cluster Agent removes it from the active pool and redistributes its configurations.

Serverless

For AWS Lambda, Datadog collects metrics, traces, and logs via the Lambda Extension (preferred, runs in-process) or Lambda Forwarder (CloudWatch-based). Supports enhanced Lambda metrics, cold start detection, and X-Ray integration.

08 โ€” Monitors & Alerting

Monitors, Alerts & SLOs

Monitors evaluate metric, log, or trace queries against defined conditions and trigger alerts with notifications to PagerDuty, Slack, email, OpsGenie, and more. Evaluation frequency defaults to 1 minute.

Monitor Types

๐Ÿ“Š

Metric Monitor

Alert when a metric threshold is crossed over a rolling window. Simple or multi-alert modes grouped by any tag.

๐Ÿ“‹

Log Monitor

Alert on indexed log count, attribute unique count, or measure. Supports group-by facets. Max 2-day rolling window.

๐Ÿ”ฌ

APM Monitor

Monitor service APM metrics (hits, errors, latency) or alert on Trace Analytics Indexed Span patterns.

๐Ÿค–

Anomaly Monitor

ML-based detection learns seasonal patterns. Alerts on statistically unexpected deviations without manual thresholds.

๐Ÿ”ฎ

Forecast Monitor

Predicts when a metric will breach a threshold. Ideal for disk capacity and resource planning.

๐ŸŒ

Synthetic Monitor

Alert when a Synthetic API test or browser test fails or exceeds latency thresholds.

๐Ÿ”ง

Service Check

Alert based on OK / WARNING / CRITICAL status submitted by Agent integration checks.

๐Ÿ”—

Composite Monitor

Combine monitors with boolean logic (AND, OR, NOT). Alert only when multiple conditions are simultaneously true.

๐Ÿ’พ

Database Monitoring

Alert on slow queries, connection pool saturation, replication lag for PostgreSQL, MySQL, SQL Server, Oracle.

Monitor โ€” Terraform

TERRAFORM
resource "datadog_monitor" "high_cpu" {
  name    = "High CPU Usage"
  type    = "metric alert"
  message = "CPU > 90% on {{host.name}} @pagerduty"

  query = "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 90"

  monitor_thresholds {
    critical = 90
    warning  = 75
  }
  notify_no_data    = true
  no_data_timeframe = 20
  tags = ["env:production", "team:platform"]
}

Notification Channels

@pagerduty
@slack-channel
@email
@opsgenie
@victorops
@webhook
@teams
@jira

Monitor Configuration Reference

OptionDescription
evaluation_windowTime range for query (last_5m, last_1h)
evaluation_frequencyHow often query runs (default 1 min)
criticalValue triggering ALERT state
warningValue triggering WARNING state
notify_no_dataAlert if no data is received
renotify_intervalRe-alert on sustained state (minutes)
require_full_windowOnly evaluate with complete data window
multi_alertSeparate alert per dimension (e.g., per host)

Service Level Objectives (SLOs)

Three SLO types:

TypeBased On
Metric-basedGood events / total events ratio
Monitor-basedUptime % derived from monitor state
Time Slice% of time windows metric was within threshold
09 โ€” Dashboards & Visualization

Dashboards

Dashboards provide real-time insight into system health and business KPIs. Build from any combination of metrics, logs, traces, RUM, and events with template variables for dynamic scoping.

Dashboard Types

TypeUse Case
TimeboardAll widgets share the same time range. Best for metric correlation during investigations.
ScreenboardFree-form layout with independent time ranges per widget. Best for NOC status displays.
NotebookMarkdown + live graphs. Best for postmortems, runbooks, and incident investigations.

Widget Types

Timeseries
Query Value
Top List
Table
Distribution / Heatmap
Pie Chart
Scatter Plot
Geo Map
SLO Widget
Service Map
Log Stream
Alert Graph
Change
Funnel
Wildcard (Vega-Lite)

Template Variables โ€” Terraform

TERRAFORM
resource "datadog_dashboard" "service_health" {
  title       = "Service Health"
  layout_type = "ordered"

  template_variable {
    name    = "env"
    prefix  = "env"
    default = "production"
  }

  widget {
    timeseries_definition {
      request {
        q = "avg:trace.web.request.duration{$env} by {service}"
      }
      title = "Request Latency by Service"
    }
  }
}

Datadog Sheets: Spreadsheet-style interface for analyzing telemetry โ€” perform lookups, build pivot tables, create calculated columns, join datasets. Results can be added to dashboards or shared with colleagues.

10 โ€” Synthetic Monitoring & RUM

Synthetics & Real User Monitoring

Synthetics proactively tests endpoints and journeys from Datadog-managed global locations. RUM captures real user interactions and performance from actual browsers and mobile apps.

Synthetic Test Types

TypeDescription
API TestHTTP, gRPC, WebSocket, TCP, SSL, DNS checks. Assert on status codes, body, headers, latency.
Multistep APIChain multiple API calls with variables passed between steps. Test full auth + action flows.
Browser TestHeadless Chrome tests that record and replay user journeys. Detect visual regressions and broken UI.
Mobile TestNative iOS and Android app testing with real device simulation.

Continuous Testing (CI/CD)

BASH โ€” datadog-ci
npm install -g @datadog/datadog-ci

# Run Synthetic tests in CI pipeline
datadog-ci synthetics run-tests \
  --public-id "abc-123-xyz" \
  --apiKey $DD_API_KEY \
  --failOnCriticalErrors

Real User Monitoring (RUM)

  • Session Replay โ€” pixel-perfect video-like replay of real user sessions
  • Core Web Vitals โ€” LCP, FID, CLS tracking with custom user timings
  • Error Tracking โ€” frontend JS errors grouped and prioritized by user impact
  • RUM-APM Correlation โ€” link frontend sessions to backend distributed traces
  • Funnel Analysis โ€” track conversion rates through multi-step user flows
  • RUM Recommendations โ€” AI-powered performance improvement suggestions (Preview)
  • Mobile RUM โ€” iOS and Android performance monitoring and crash tracking

Feature Flags & A/B Testing

Datadog Feature Flags integrates with your existing feature flag provider to track flag evaluations alongside RUM data. Correlate feature flag rollouts directly with performance regressions and error spikes in the same view.

11 โ€” Security

Security Products

Datadog unifies observability and security on one platform โ€” eliminating the context-switching between tools that slows down incident response when a performance issue has a security dimension.

๐Ÿ›ก

Cloud SIEM

Detect, investigate, and respond to security threats across cloud and on-premises systems. Correlates logs, metrics, and network data to surface high-fidelity signals.

โ˜๏ธ

Cloud Security Management

Continuously audits cloud configurations, assesses identity risks (CIEM), and detects runtime threats across AWS, GCP, and Azure.

๐Ÿ”’

App & API Protection

Detects and blocks threats targeting production applications and APIs in real time, with APM trace context for each attack signal.

๐Ÿ’ป

Code Security

Detects and fixes vulnerabilities in first-party code, open-source dependencies (SCA), and infrastructure-as-code from dev through runtime (IAST).

โš™๏ธ

Workload Protection

Uses eBPF to monitor file, network, and process activity at the kernel level. Detects privilege escalation, cryptomining, and unusual process behavior.

๐Ÿ”

Audit Trail

Immutable audit log of all user and configuration changes across the Datadog platform โ€” who changed what monitor, API key, or Agent config and when.

12 โ€” Integrations

Integrations Ecosystem

Datadog ships 1,000+ vendor-backed integrations โ€” each providing ready-made Agent checks, dashboards, and monitors. Connect to cloud providers, databases, message queues, CI/CD tools, and observability standards.

Cloud Platforms

CLOUDAWS
CLOUDGoogle Cloud
CLOUDAzure
CLOUDAlibaba Cloud

Databases & Caches

PostgreSQL
MySQL
MongoDB
Redis
Cassandra
Elasticsearch
Oracle
SQL Server
CockroachDB

Message Queues & Streaming

Apache Kafka
RabbitMQ
Amazon SQS / Kinesis
Google Pub/Sub

Web Servers & Proxies

Nginx
Apache
HAProxy
Envoy
Istio
Traefik

DevOps & CI/CD

GitHub Actions
GitLab
Jenkins
CircleCI
ArgoCD
Terraform
Ansible
Chef
Puppet

Alerting & Incident Management

PagerDuty
OpsGenie
VictorOps
Slack
Microsoft Teams
Jira
ServiceNow
Webhooks

Observability Standards

OpenTelemetry (OTLP)
Prometheus
StatsD
JMX / JVM metrics

Datadog Internal Developer Portal (IDP): Software Catalog + Self-Service Actions + Scorecards. Visualize service hierarchies, enable self-service infrastructure provisioning, and evaluate production-readiness before release.

13 โ€” Tagging Strategy

Tags & Unified Service Tagging

Tags are key:value metadata attached to every metric, log, trace, and event. A consistent tagging strategy is the foundation of effective filtering, alerting, and root-cause analysis in Datadog.

Unified Service Tagging (Required)

TagPurposeExample
envDeployment environmentproduction
serviceService / application namecheckout-api
versionDeployed code version1.4.2
ENV VARS โ€” Kubernetes pod spec
env:
  - name:  DD_ENV
    value: production
  - name:  DD_SERVICE
    value: checkout-api
  - name:  DD_VERSION
    value: 1.4.2
  - name:  DD_AGENT_HOST
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP

Tagging Best Practices

  • Use lowercase key:value format consistently everywhere
  • Always tag with env: (production, staging, dev)
  • Always tag with service: for service-level views
  • Always tag with version: for deployment tracking
  • Use team: for ownership routing in monitor notifications
  • Use region: and availability-zone: for geographic scoping
  • Avoid high-cardinality values on metrics (user_id, request_id) โ€” use traces for those
  • Limit custom metric tag cardinality to control monthly custom metric costs
  • Apply tags at the Agent level for host-wide application

Tag Sources (All Applied Automatically)

SourceExample Tags
Agent config (datadog.yaml)env:prod, team:platform
Cloud provider metadataaws:us-east-1, instance_type:c5
Container labels / K8s labelsapp:frontend, kube_namespace:default
Integration check configdb:postgres-prod
DogStatsD metric submissionendpoint:/checkout
14 โ€” API, SDKs & Infrastructure as Code

Datadog API & IaC

Datadog exposes a comprehensive REST API for programmatic access to all platform resources. Official SDKs, Terraform provider, and datadog-ci CLI enable full Datadog-as-Code workflows.

REST API โ€” Key Endpoints

EndpointAction
POST /api/v1/seriesSubmit custom metrics
POST /api/v2/logs/eventsSend log events
GET /api/v1/monitorsList all monitors
POST /api/v1/monitorCreate a monitor
GET /api/v1/dashboardList all dashboards
POST /api/v1/eventsPost to Events Stream
POST /api/v2/queryMetrics query over time range
PYTHON โ€” API client
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi

config = Configuration()
config.api_key["apiKeyAuth"] = "<DD_API_KEY>"
config.api_key["appKeyAuth"] = "<DD_APP_KEY>"

with ApiClient(config) as api_client:
    api = MonitorsApi(api_client)
    print(api.list_monitors())

Terraform Provider

TERRAFORM โ€” Provider setup
terraform {
  required_providers {
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.0"
    }
  }
}
# Auth via DD_API_KEY + DD_APP_KEY env vars
provider "datadog" {
  api_url = "https://api.datadoghq.com/"
}

Terraform Resources

datadog_monitor
datadog_dashboard
datadog_service_level_objective
datadog_synthetics_test
datadog_logs_index
datadog_logs_pipeline
datadog_metric_tag_configuration
datadog_user / datadog_role
datadog_security_monitoring_rule
datadog_integration_aws
datadog_downtime

datadog-ci CLI

BASH
npm install -g @datadog/datadog-ci

# Upload source maps for RUM error tracking
datadog-ci sourcemaps upload ./dist \
  --service my-app --release-version 1.4.2

# Report CI test results
datadog-ci junit upload --service my-app ./test-results.xml
15 โ€” Glossary

Key Terminology

Core terms used across the Datadog platform and documentation.

Agent
Open-source Go software that runs on monitored hosts to collect metrics, logs, traces, and events and forward them to Datadog.
DogStatsD
StatsD-compatible daemon built into the Agent for receiving custom application metrics over UDP (port 8125) or Unix socket.
APM
Application Performance Monitoring โ€” distributed tracing, service maps, latency analysis, and Continuous Profiler for application code.
Span
A named, timed unit of work in a distributed trace. Represents one operation โ€” an HTTP call, DB query, or function invocation.
Trace
A collection of spans representing the complete end-to-end journey of a single request through a distributed system.
Tag
A key:value pair attached to metrics, logs, traces, and events to enable filtering, scoping, and grouping in dashboards and monitors.
Unified Service Tagging
Applying env, service, and version tags consistently across all telemetry types, enabling seamless correlation between metrics, logs, and traces.
Monitor
A rule that evaluates metric, log, or trace data against conditions and triggers notifications when thresholds are crossed.
SLO
Service Level Objective โ€” a measurable reliability target (e.g., 99.9% uptime) tracked over a rolling or calendar time window.
Watchdog
Datadog's ML-powered anomaly detection engine. Automatically surfaces unusual patterns in metrics, logs, and traces without manual threshold configuration.
Autodiscovery
Mechanism for automatically detecting and configuring integration checks based on container labels or Kubernetes annotations in dynamic environments.
Cluster Agent
A special Kubernetes deployment that efficiently coordinates monitoring data collection across an entire cluster, distributing checks to node Agents.
Forwarder
Agent component that buffers telemetry in memory and sends it to Datadog backend over HTTPS. Handles network interruptions without data loss.
RUM
Real User Monitoring โ€” captures actual end-user interactions, page loads, JS errors, and performance metrics from real browsers and mobile apps.
Session Replay
Pixel-perfect playback of a real user session, showing exactly what the user experienced in their browser for UX debugging.
Synthetic Test
A scripted, scheduled test that proactively checks API endpoints or user journeys from Datadog-managed global locations.
Flex Logs
Cost-effective log storage tier supporting in-place search without rehydration. Flex Frozen sub-tier provides up to 7-year compliance retention.
Grok Parser
A pattern-based log parsing tool in Datadog pipelines that extracts structured attributes from raw unstructured log text using named capture groups.
Fleet Automation
Datadog feature for remotely managing, configuring, and upgrading all Agents across all environments directly from the Datadog UI.
Apdex
Application Performance Index โ€” a 0โ€“1 score measuring user satisfaction based on response time thresholds. Available for HTTP/web APM services.
DORA Metrics
Deployment Frequency, Lead Time, Change Failure Rate, and Time to Restore โ€” tracked in Datadog IDP to measure software delivery performance.
Issue Correlation
AI-powered feature that automatically maps related issues across services, tracing problems to their true origin to reduce alert noise.