A comprehensive guide to Ocient's architecture, storage technology, query engine, machine learning, geospatial capabilities, deployment options, and administration — drawn directly from official documentation.
Ocient is a modern, purpose-built hyperscale data warehouse engineered to run real-time OLAP analytics on the world's largest and most complex structured and semi-structured datasets using standard ANSI SQL.
Founded in 2016 by industry veterans, Ocient set out to solve a fundamental problem: traditional analytical databases don't scale cost-effectively to extremely large datasets, while Big Data solutions can scale but are expensive, complex, and too slow for interactive analysis. Ocient bridges that gap — delivering interactive query speeds at petabyte scale.
"Queries that used to take hours, or not run at all, now execute in seconds, and systems that used to fill up multiple data center racks now require up to 90% less space and energy." — Ocient Documentation
Purpose-built execution engine and I/O layer delivers 10×–50× faster performance than competitive solutions on hyperscale datasets.
Full SQL dialect closely following PostgreSQL conventions. JDBC, pyocient, and ODBC drivers available. No proprietary language learning required.
Designed to run on standard hardware. Supports on-premises, OcientCloud®, and public cloud (AWS, GCP) deployments.
OcientML® places the entire machine learning stack inside the database. Train and score models against petabytes of data with SQL.
OcientGeo® provides native geospatial and spatiotemporal analytics including point, polygon, and complex geographic analysis.
Up to 90% smaller data center footprint. OcientCloud® runs on 100% renewable energy in a LEED-certified facility.
Key differentiation: Ocient consolidates real-time analytics, traditional OLAP, ETL/ELT pipelines, geospatial analysis, and in-database machine learning onto a single unified platform — eliminating the cost and complexity of multiple specialized systems.
Ocient is a distributed system built around the Compute-Adjacent Storage Architecture™ (CASA) — co-locating NVMe SSD storage with compute resources to eliminate common bottlenecks.
Every Ocient system is composed of three distinct node roles, each with a specific responsibility. These roles are consistent across all environments — only the number of nodes varies.
SQL nodes are the entry point to the system. They receive incoming SQL from JDBC/pyocient clients, parse statements, and create an execution plan using one of two optimization methods. Once planned, the work is distributed to foundation nodes. SQL nodes also handle final aggregations and joins on intermediate result sets returned from foundation nodes, then package and return results to the client.
Administrators connect to SQL nodes via CLI or SQL client to issue DDL/DCL commands, which the node then propagates throughout the system. Multiple SQL nodes provide automatic load balancing across connections.
Foundation nodes are the heart of Ocient — they store user data in columnar format on NVMe SSDs and perform the bulk of query processing. The CASA principle means data and compute are co-located: when a query arrives, a foundation node processes as much of it as possible against its own local data before returning intermediate results upstream.
Foundation nodes contain the majority of storage in an Ocient system and are typically the largest in number. They are connected to SQL nodes via 100 Gbps high-speed network, and all foundation nodes are connected to all SQL nodes.
Loader nodes handle the full ETL/ELT ingestion lifecycle: extracting from batch file sources (e.g., S3) or streaming sources (e.g., Kafka), transforming using SQL functions, indexing data, and loading into foundation nodes. They operate in a horizontal scale-out fashion, so adding more loader nodes increases ingestion throughput transparently.
Loader nodes also enforce exactly-once delivery guarantees — ensuring data is never duplicated even in the case of network or system failure during loading.
Two separate networks connect the nodes of an Ocient system. A 100 Gbps high-speed network handles query execution traffic between SQL nodes and foundation nodes, and data movement between loader nodes and foundation nodes. A 10 Gbps network handles administrative flows — DDL/DCL command propagation across the system.
Every foundation node is connected to every SQL node. All nodes are connected through these two networks, ensuring high throughput and low latency for every workload type.
CASA (Compute-Adjacent Storage Architecture): By co-locating NVMe drive storage with compute resources, Ocient avoids the network bottlenecks common in cloud-style separated storage/compute architectures. Data never needs to leave the node for first-pass processing.
Ocient stores data in a columnar format organized into segments and segment groups, protected by erasure coding for fault tolerance without the overhead of full data replication.
While tables are created and queried using familiar SQL syntax, Ocient stores data on disk in a highly compressed columnar format. Segments are the fundamental storage unit: they contain rows organized by column, along with embedded indexes and statistical metadata used to accelerate query processing.
As data is ingested by loader nodes, it is initially stored in row-based pages on foundation nodes for rapid ingestion throughput. As pages accumulate, loader nodes convert them into columnar segments — highly compressed structures that include data, multiple indexes, and metadata.
Multiple segments combine to form segment groups. A segment group has a fixed width (number of segments) and a defined number of parity blocks for resilience. Segment groups are physically stored in a storage cluster — a set of foundation nodes with an associated storage space.
When configuring an Ocient system, administrators define at least one storage space and storage cluster. At the storage space level, administrators set the width (number of segments per group) and parity_width (number of parity blocks), which determines the level of fault tolerance.
Ocient uses erasure coding — not data replication — for hardware fault tolerance. Erasure coding computes parity blocks that allow the system to reconstruct any missing data, without needing to store a second (or third) full copy of all data.
This design means an Ocient system requires significantly less storage than replication-based approaches, while still providing full recovery from hardware failures. The coding block is the smallest unit of recovery and the unit of parity calculation.
Each table can designate a column as its TimeKey. This column partitions data on disk by time, enabling the system to rapidly skip irrelevant time ranges during query execution without reading unnecessary data. Since most analytical queries include a time filter, this is a critical performance mechanism.
In addition to the TimeKey, tables can specify a Clustering Key: one or more columns that are frequently queried together. The system subdivides time-partitioned segments further on disk according to these key columns, enabling fast lookup of records with matching key values within a partition.
Ocient uses SQL-defined data pipelines as the primary mechanism for ingesting data from both batch file sources and real-time streaming sources, with real-time transformation and exactly-once delivery semantics.
A data pipeline is a database object that defines end-to-end data processing: extraction from a source, optional transformations, and load into one or more Ocient tables. Pipelines are defined, started, stopped, and modified using standard DDL statements.
A pipeline definition contains three sections: the data source, the extract format, and the transformation/target specification:
-- Create a pipeline loading CSV data from S3 into the orders table
CREATE PIPELINE orders_pipeline
SOURCE S3
ENDPOINT 'https://s3.us-east-1.amazonaws.com'
BUCKET 'my-data-bucket'
FILTER 'orders/'
EXTRACT FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
INTO "public"."orders"
SELECT
$1 AS id,
$2 AS user_id,
$3 AS product_id,
TO_TIMESTAMP($4, 'YYYY-MM-DD HH24:MI:SS') AS order_ts,
CAST($5 AS DECIMAL(10,2)) AS amount;
-- Lifecycle management
START PIPELINE orders_pipeline;
STOP PIPELINE orders_pipeline;
DROP PIPELINE orders_pipeline;
PREVIEW PIPELINE orders_pipeline;
| State | Description |
|---|---|
| CREATED | Pipeline defined but never started. No tasks created, no files listed. |
| RUNNING | Actively processing data. At least one task is queued, running, or cancelling. |
| STOPPED | User-initiated stop. All tasks complete, failed, or cancelled. Position retained. |
| COMPLETED | All assigned work finished within error limits. All tasks complete. |
| FAILED | Error limits exceeded. At least one task failed. Pipeline will not retry. |
Exactly-Once Delivery: Loader nodes maintain pipeline position and deduplication state. If a pipeline is stopped and restarted, it continues from where it left off — no data is duplicated and no data is lost.
Scale-Out Loading: Pipelines execute across all available loader nodes in parallel. Work is partitioned across tasks (file chunks or Kafka partitions). Add more loader nodes to increase throughput without any configuration changes.
ELT Support: In addition to pipelines, Ocient supports CREATE TABLE AS SELECT (CTAS) and INSERT INTO … SELECT for ELT workflows that extract data and write results directly into new or existing tables.
Ocient's query engine was built from scratch using modern design principles, deeply integrated with the storage and I/O layer to minimize on-disk reads and maximize parallel processing throughput.
When a user issues a SQL statement through a JDBC or pyocient client:
The foundation nodes do as much work as possible locally (filtering, projection, local aggregation) before returning data up to the SQL node, minimizing data movement across the network.
For each query, the Ocient system compiles a custom I/O pipeline for every relevant data segment. These pipelines are tailored to use any applicable keys or indexes to reduce the volume of data that must be read from disk.
This design means that query performance directly benefits from proper index configuration — the system does not perform table scans when indexes can be used to skip irrelevant data blocks.
SQL nodes use two complementary query optimization strategies. The cost-based optimizer analyzes data statistics and available indexes to construct efficient execution plans. For complex workloads, the system can also apply rule-based optimizations to rewrite query plans before execution.
INNER, LEFT/RIGHT OUTER, CROSS joins across large tables. Join pushdown to foundation nodes where possible.
Standard (SUM, AVG, COUNT, MIN, MAX) plus sorted aggregates and window/analytic functions.
Full OVER() clause support with PARTITION BY, ORDER BY, ROWS/RANGE frames, LAG, LEAD, RANK, DENSE_RANK, NTILE.
CREATE/ALTER/DROP TABLE, VIEW, INDEX, SCHEMA, DATABASE. GRANT/REVOKE role-based access control.
Query JSON and complex data types including arrays, tuples, and IP addresses inline with SQL operators.
Configurable result set caching to avoid re-executing identical queries. Managed per database via DBA settings.
Assign priority levels to users and groups. Control resource allocation and query priority across concurrent workloads.
SQL dialect closely follows PostgreSQL conventions. Most PostgreSQL functions work identically in Ocient.
-- JDBC Connection String
jdbc:ocient://<sql_node_host>:4050/<database>
// Java JDBC Example
Class.forName("com.ocient.jdbc.JDBCDriver");
Connection conn = DriverManager.getConnection(
"jdbc:ocient://host:4050/mydb",
"username", "password"
);
# Python pyocient Example
import pyocient
conn = pyocient.connect(
dsn="ocient://user:pass@host:4050/mydb"
)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM events WHERE ts > '2024-01-01'")
rows = cursor.fetchall()
Multi-layer indexing is central to Ocient's performance model. Indexes are embedded in segments alongside data and require no separate storage overhead for segment keys.
Partitions all data in a table by time. Queries with time filters skip entire partitions, dramatically reducing I/O. Defined at table creation time and cannot be changed later. Recommended for any time-series dataset. No additional storage required.
Sorts and subdivides data within time partitions by one or more columns frequently queried together. Enables fast lookup without full partition scans. Ideal for high-cardinality filter columns like user_id, device_id, or IP address. No additional storage required.
Dense or sparse B-tree style index on numeric columns. Dramatically reduces I/O for equality and range queries on numeric columns that are not part of the clustering key. Useful for columns with medium-to-high cardinality.
Full text index on VARCHAR columns for exact-match and prefix queries. Enables fast lookup on string identifier columns without full segment scans.
An N-gram index on VARCHAR columns enables efficient LIKE queries with wildcard patterns (e.g., WHERE col LIKE '%substring%'). Particularly useful for log analysis and text search workloads.
Spatial index on ST_POINT and ST_POLYGON columns for bounding box and containment queries. Used by OcientGeo® functions to efficiently resolve geographic predicates without scanning entire segments.
-- Create table with TimeKey and Clustering Key
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_ts TIMESTAMP,
event_type VARCHAR(64),
ip_addr IP,
payload VARCHAR(512)
)
TIMEKEY event_ts
CLUSTERING KEY (user_id, event_type);
-- Add secondary indexes after table creation
CREATE INDEX idx_ip ON events(ip_addr) USING NUMERIC;
CREATE INDEX idx_payload ON events(payload) USING NGRAM;
CREATE INDEX idx_event_type ON events(event_type) USING STRING;
Best Practice: Configure all keys and indexes before loading large amounts of data. Data loaded before index creation is stored without the index structures; re-indexing existing data requires a segment rebuild operation.
OcientML® places the entire machine learning stack inside the Ocient Hyperscale Data Warehouse, eliminating the need for separate ML tooling, data movement, or external model training infrastructure.
Key insight: Traditional ML workflows require extracting data from a data warehouse, loading it into a separate ML platform, training the model, and then returning predictions. OcientML® collapses this entire workflow into a single system — train and score directly against petabytes of fresh data using SQL.
Ordinary least squares regression for continuous numeric target prediction. Train and score in SQL.
Binary and multinomial classification. Outputs class probabilities alongside predictions.
Unsupervised cluster assignment. Assign cluster IDs to new data using a trained K-means model in SQL queries.
Instance-based classification and regression. Find nearest neighbors at query time across large datasets.
Interpretable classification and regression trees with configurable depth and split criteria.
Multi-layer perceptrons for both classification and regression tasks. Configurable layers and activation functions.
Probabilistic classifier based on Bayes' theorem. Fast and effective for text classification and anomaly detection.
Time series forecasting using autoregressive models. Forecast future values based on historical patterns.
Dimensionality reduction. Reduce feature space for downstream analysis or visualization.
Classification and regression with kernel support. Effective for high-dimensional data.
-- Train a logistic regression model
CREATE MODEL churn_model
TYPE LOGISTIC
TARGET churned
FEATURES (tenure_days, monthly_spend, support_tickets, last_login_days)
AS SELECT tenure_days, monthly_spend, support_tickets,
last_login_days, churned
FROM customer_features
WHERE training_set = TRUE;
-- Score new customers using the trained model
SELECT
customer_id,
PREDICT(churn_model, tenure_days, monthly_spend, support_tickets, last_login_days) AS churn_probability
FROM customers
WHERE active = TRUE;
-- K-Means clustering on user behavior
CREATE MODEL user_segments
TYPE KMEANS
CLUSTERS 5
AS SELECT page_views, session_duration, purchase_count
FROM user_behavior;
Trained models are stored as database objects, subject to the same role-based access controls as tables and views. Models can be exported to BI tools via built-in connectors or scored directly inside dashboards using the SQL interface.
OcientGeo® provides a comprehensive suite of geospatial and spatiotemporal analysis capabilities built directly into the SQL engine — enabling complex geographic queries at petabyte scale without external GIS systems.
| Type | Description |
|---|---|
| ST_POINT | Single geographic coordinate (lon, lat) |
| ST_LINESTRING | Ordered sequence of points forming a line or path |
| ST_POLYGON | Closed polygon defined by a ring of coordinates |
| GEOGRAPHY | Generic geography type with spherical earth calculations |
| GEOHASH | Compact string encoding of a geographic location |
-- Find all events within 5km of a location
SELECT event_id, event_ts,
ST_Distance(location,
ST_GeoGPoint(-87.6298, 41.8781)) AS dist_m
FROM mobile_events
WHERE ST_DWithin(
location,
ST_GeoGPoint(-87.6298, 41.8781),
5000 -- meters
)
ORDER BY dist_m;
-- Count devices in a polygon region
SELECT COUNT(*) AS devices_in_zone
FROM device_pings
WHERE ST_Contains(
ST_Polygon_FromEWKT('POLYGON((-87.7 41.9,-87.5 41.9,...))'),
device_location
);
-- Aggregate by geohash grid cell
SELECT
ST_GeoHash(location, 6) AS cell,
COUNT(*) AS event_count
FROM events
GROUP BY cell
ORDER BY event_count DESC;
Network tower coverage analysis, customer location density mapping, spatiotemporal churn analysis.
Geo-targeted ad delivery, store visit attribution, location-based audience segmentation.
Territory management, satellite imagery analysis, geofence monitoring and compliance.
Fraud detection via location anomaly, risk exposure by geography, branch coverage modeling.
Ocient is designed for flexible deployment with consistent feature parity across all options. Every deployment includes 24/7 critical on-call support, subscription licensing, training, and access to updates.
Management Services: Regardless of deployment model, organizations can engage the Ocient Management Services team to handle system setup, 24/7 monitoring, software updates, and ongoing operations. This is offered as a fully managed layer on top of any deployment modality.
Ocient provides a Simulator — a single-node development and testing environment — for developers and data teams to explore Ocient's SQL dialect, pipeline functionality, and ML capabilities without provisioning a full multi-node system. The Simulator is ideal for query development, schema design, and pipeline prototyping.
As a unified platform, Ocient consolidates security capabilities in one place, simplifying compliance and reducing the attack surface compared to multi-system analytics stacks.
Optional TLS/SSL encryption for data in transit between clients and SQL nodes. Encryption at rest configurable at the storage space level.
Native username/password authentication, SSO integration (Single Sign-On), and configurable authentication policies per user or group.
SQL GRANT/REVOKE DCL statements for fine-grained table, schema, database, and system-level permissions. User groups and roles for scalable access management.
Comprehensive log-level monitoring and auditing of all query and administrative activity. System catalog tables expose audit trails for compliance reporting.
SOC 2 Type 2 certified. OcientCloud® operates in a LEED-certified facility. Designed to support HIPAA, FedRAMP-aligned, and government-regulated workloads.
Separate admin network (10 Gbps) isolated from query traffic (100 Gbps). Configurable network access controls and firewall integration at the infrastructure level.
ML models created with OcientML® are governed by the same RBAC framework as database objects — access to train, score, or drop models is controlled via SQL GRANT/REVOKE.
User group priority settings control query scheduling and resource allocation. Prevent any single workload or user from monopolizing system resources.
Ocient provides comprehensive tooling for both Database Administrators (DBAs) managing data structures, users, and performance; and System Administrators managing node configuration, monitoring, and maintenance.
DBAs interact with Ocient through standard SQL DDL and DCL commands via any connected SQL client. Key responsibilities include:
System administrators handle the physical infrastructure layer of an Ocient deployment:
All operational metadata in Ocient is exposed through queryable system catalog tables (prefixed with sys_). DBAs and system admins can query these tables using standard SQL to monitor the state of the system in real time.
-- View all pipeline events
SELECT * FROM sys_pipeline_events
WHERE pipeline_name = 'orders_pipeline'
ORDER BY event_ts DESC;
-- View pipeline errors for troubleshooting
SELECT * FROM sys_pipeline_errors
WHERE error_ts > NOW() - INTERVAL '1 hour';
-- Check active queries and their state
SELECT query_id, user_name, state, elapsed_ms, query_text
FROM sys_queries
WHERE state = 'RUNNING';
-- Check node health and storage utilization
SELECT node_id, node_role, status, storage_used_gb, storage_total_gb
FROM sys_nodes;
Ocient provides a native Datadog integration for shipping OpenMetrics to Datadog dashboards. Install the integration with:
# Install Datadog integration
agent integration install -t datadog-ocient==1.0.0
# ocient.d/conf.yaml
instances:
- use_openmetrics: true
openmetrics_endpoint: http://<master_node_host>:9090/metrics
Metrics include query performance counters, disk usage, database table statistics, loading throughput, and system health indicators.
Ocient integrates with the broader data ecosystem through standard SQL interfaces and purpose-built connectors for BI tools, data engineering platforms, and monitoring systems.
Official Java JDBC driver for connecting any JVM-based application, BI tool, or ETL framework to Ocient. Default port 4050. Used by Tableau, DBeaver, Metabase, and JDBC-compatible tools.
com.ocient.jdbc.JDBCDriver
jdbc:ocient://host:4050/db
Official Python driver (DB-API 2.0 compliant). Enables Python applications, data science notebooks, and Airflow DAGs to connect to Ocient with native Python syntax.
pip install pyocient
import pyocient
conn = pyocient.connect(dsn="...")
Ocient provides a machine-readable documentation endpoint at https://docs.ocient.com/llms.txt for optimal content extraction by LLM-based AI tools. This enables AI coding assistants to provide accurate SQL and API suggestions when working with Ocient systems.
Real-time bidding analytics, attribution modeling, audience segmentation at billions of events per day.
Network visibility, data retention & regulatory compliance, CDR analysis, spatiotemporal network performance.
National security analytics, large-scale surveillance data processing, satellite imagery, intelligence workflows.
Real-time fraud detection, risk analytics, trade surveillance, regulatory reporting on massive transaction histories.
Log analytics, infrastructure performance monitoring, security event correlation at hyperscale data volumes.
Location intelligence, proximity analysis, spatiotemporal data fusion for mapping and territory analytics.
Core terminology used throughout the Ocient Hyperscale Data Warehouse documentation and product.