Skip to main content

DuckDB: In-Process Analytical Database

DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. Often called the "SQLite for analytics," DuckDB combines the simplicity of embedded databases with the power of columnar analytical processing, making it ideal for data science, analytics, and ETL workflows.

Overview

What is DuckDB?

DuckDB is a columnar analytical database that runs embedded within applications, similar to how SQLite works for OLTP workloads. It's designed to efficiently execute complex analytical queries on structured data without requiring a separate database server.

Key Characteristics

  • In-Process: Runs within your application process
  • Columnar Storage: Optimized for analytical queries
  • ACID Compliant: Full transaction support
  • SQL Standard: Extensive SQL feature support
  • Zero Dependencies: Single binary with no external dependencies
  • Cross-Platform: Works on Linux, macOS, Windows, and more

Design Philosophy

DuckDB follows the "embedded analytics" philosophy:

  • Simple Deployment: No server setup or configuration
  • High Performance: Vectorized execution engine
  • Developer Friendly: Easy integration with existing workflows
  • Data Science Focus: Built for analytical workloads

Architecture and Design

Core Architecture

┌─────────────────────────────────────┐
│ Application │
├─────────────────────────────────────┤
│ DuckDB API │
├─────────────────────────────────────┤
│ Query Processor │
│ ┌─────────┬─────────┬─────────┐ │
│ │ Parser │Optimizer│Executor │ │
│ └─────────┴─────────┴─────────┘ │
├─────────────────────────────────────┤
│ Storage Engine │
│ ┌─────────────┬─────────────────┐ │
│ │ Columnar │ Transaction │ │
│ │ Storage │ Manager │ │
│ └─────────────┴─────────────────┘ │
└─────────────────────────────────────┘

Vectorized Execution Engine

DuckDB uses vectorized query execution for high performance:

Traditional Row-by-Row:
┌─────┐ ┌─────┐ ┌─────┐
│ Row │ -> │ Op │ -> │ Row │
└─────┘ └─────┘ └─────┘

DuckDB Vectorized:
┌─────────┐ ┌─────┐ ┌─────────┐
│ Vector │ -> │ Op │ -> │ Vector │
│(1024 rows)│ └─────┘ │(1024 rows)│
└─────────┘ └─────────┘

Columnar Storage Format

Row-Oriented (Traditional):
┌─────┬─────┬─────┬─────┐
│ ID │Name │ Age │City │
├─────┼─────┼─────┼─────┤
│ 1 │John │ 25 │ NYC │
│ 2 │Jane │ 30 │ LA │
│ 3 │Bob │ 35 │ CHI │
└─────┴─────┴─────┴─────┘

Column-Oriented (DuckDB):
ID: [1, 2, 3]
Name: [John, Jane, Bob]
Age: [25, 30, 35]
City: [NYC, LA, CHI]

Installation and Setup

Installation Methods

1. Command Line Interface

# Download and install DuckDB CLI
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/

# Verify installation
duckdb --version

2. Python Integration

# Install DuckDB Python package
pip install duckdb

# Optional: Install with additional extensions
pip install duckdb[pandas,arrow]

3. Language Bindings

# R
install.packages("duckdb")

# Node.js
npm install duckdb

# Java
# Add to pom.xml
<dependency>
<groupId>org.duckdb</groupId>
<artifactId>duckdb_jdbc</artifactId>
<version>0.9.2</version>
</dependency>

Basic Usage

Command Line Interface

# Start DuckDB CLI
duckdb

# Connect to a database file
duckdb mydata.duckdb

# Run SQL commands
D SELECT 'Hello, DuckDB!' as greeting;
┌─────────────────┐
│ greeting │
│ varchar │
├─────────────────┤
│ Hello, DuckDB! │
└─────────────────┘

Python Integration

import duckdb

# Create connection
conn = duckdb.connect('mydata.duckdb')

# Execute query
result = conn.execute("SELECT 42 as answer").fetchall()
print(result) # [(42,)]

# Close connection
conn.close()

Core Features and Capabilities

1. SQL Support

DuckDB provides extensive SQL standard compliance:

Data Types

-- Numeric types
CREATE TABLE numeric_example (
tiny_int TINYINT, -- 1 byte
small_int SMALLINT, -- 2 bytes
integer INTEGER, -- 4 bytes
big_int BIGINT, -- 8 bytes
decimal_val DECIMAL(10,2), -- Arbitrary precision
real_val REAL, -- 4 bytes float
double_val DOUBLE -- 8 bytes float
);

-- String and binary types
CREATE TABLE string_example (
varchar_col VARCHAR(100),
text_col TEXT,
blob_col BLOB
);

-- Date and time types
CREATE TABLE datetime_example (
date_col DATE,
time_col TIME,
timestamp_col TIMESTAMP,
timestamptz_col TIMESTAMPTZ,
interval_col INTERVAL
);

-- Complex types
CREATE TABLE complex_example (
array_col INTEGER[],
struct_col STRUCT(name VARCHAR, age INTEGER),
map_col MAP(VARCHAR, INTEGER),
json_col JSON
);

Advanced SQL Features

-- Window functions
SELECT
name,
salary,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
ROW_NUMBER() OVER (ORDER BY salary DESC) as salary_rank
FROM employees;

-- Common Table Expressions (CTEs)
WITH high_earners AS (
SELECT * FROM employees WHERE salary > 100000
),
dept_stats AS (
SELECT
department,
COUNT(*) as emp_count,
AVG(salary) as avg_salary
FROM high_earners
GROUP BY department
)
SELECT * FROM dept_stats ORDER BY avg_salary DESC;

-- Recursive CTEs
WITH RECURSIVE fibonacci(n, a, b) AS (
SELECT 1, 0, 1
UNION ALL
SELECT n + 1, b, a + b
FROM fibonacci
WHERE n < 10
)
SELECT n, b as fibonacci_number FROM fibonacci;

2. Data Import and Export

CSV Files

-- Import CSV
CREATE TABLE sales AS
SELECT * FROM read_csv_auto('sales_data.csv');

-- Export CSV
COPY sales TO 'output.csv' (HEADER, DELIMITER ',');

-- Advanced CSV reading
SELECT * FROM read_csv(
'data.csv',
columns = {'id': 'INTEGER', 'name': 'VARCHAR', 'date': 'DATE'},
header = true,
delimiter = ',',
quote = '"'
);

Parquet Files

-- Read Parquet
SELECT * FROM read_parquet('data.parquet');

-- Write Parquet
COPY (SELECT * FROM sales WHERE year = 2023)
TO 'sales_2023.parquet' (FORMAT PARQUET);

-- Multiple Parquet files
SELECT * FROM read_parquet('data/*.parquet');

JSON Data

-- Read JSON
SELECT * FROM read_json_auto('data.json');

-- Extract JSON fields
SELECT
json_extract(data, '$.name') as name,
json_extract(data, '$.age') as age
FROM json_table;

-- JSON array processing
SELECT
unnest(json_extract(data, '$.items')) as item
FROM json_data;

3. Data Analysis Functions

Statistical Functions

-- Descriptive statistics
SELECT
COUNT(*) as count,
AVG(salary) as mean_salary,
MEDIAN(salary) as median_salary,
STDDEV(salary) as std_dev,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) as q3
FROM employees;

-- Correlation analysis
SELECT CORR(salary, experience_years) as salary_experience_correlation
FROM employees;

-- Linear regression
SELECT
REGR_SLOPE(salary, experience_years) as slope,
REGR_INTERCEPT(salary, experience_years) as intercept,
REGR_R2(salary, experience_years) as r_squared
FROM employees;

Time Series Analysis

-- Time series aggregation
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as monthly_revenue,
LAG(SUM(amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) as prev_month,
(SUM(amount) - LAG(SUM(amount)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
LAG(SUM(amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) * 100 as growth_rate
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- Moving averages
SELECT
date,
value,
AVG(value) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7day
FROM time_series_data;

Integration with Data Science Tools

1. Python Pandas Integration

import pandas as pd
import duckdb

# Create sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
})

# Query DataFrame directly with DuckDB
result = duckdb.sql("SELECT * FROM df WHERE age > 25").df()
print(result)

# Register DataFrame for multiple queries
duckdb.register('employees', df)
result = duckdb.sql("SELECT AVG(salary) FROM employees").fetchone()[0]
print(f"Average salary: ${result:,.2f}")

# Convert DuckDB result to pandas
conn = duckdb.connect()
conn.execute("CREATE TABLE sales AS SELECT * FROM read_csv('sales.csv')")
df_result = conn.execute("SELECT * FROM sales").df()

2. Apache Arrow Integration

import pyarrow as pa
import duckdb

# Create Arrow table
table = pa.table({
'id': [1, 2, 3, 4],
'value': [10.5, 20.3, 30.1, 40.7]
})

# Query Arrow table
result = duckdb.sql("SELECT * FROM table WHERE value > 20").arrow()
print(result)

# Zero-copy integration
conn = duckdb.connect()
conn.register('arrow_data', table)
result = conn.execute("SELECT SUM(value) FROM arrow_data").fetchone()[0]

3. Jupyter Notebook Integration

# Install DuckDB magic commands
%load_ext duckdb_magic

# Use DuckDB magic
%%duckdb
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

# Visualize results
import matplotlib.pyplot as plt

result = %duckdb SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department
result.plot(x='department', y='avg_salary', kind='bar')
plt.show()

Performance Optimization

1. Query Optimization

Efficient Filtering

-- Use column statistics for better performance
SELECT * FROM large_table WHERE date_column >= '2023-01-01';

-- Avoid functions in WHERE clauses
-- Bad:
SELECT * FROM sales WHERE YEAR(order_date) = 2023;
-- Good:
SELECT * FROM sales WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01';

Projection Pushdown

-- Select only needed columns
SELECT customer_id, total_amount
FROM orders
WHERE order_date >= '2023-01-01';

-- Avoid SELECT *
-- SELECT * FROM large_table; -- Reads all columns

Join Optimization

-- Use appropriate join types
SELECT c.name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name;

-- Consider join order for large tables
-- DuckDB automatically optimizes, but be aware of data sizes

2. Memory Management

import duckdb

# Configure memory limit
conn = duckdb.connect()
conn.execute("SET memory_limit='4GB'")

# Monitor memory usage
conn.execute("SELECT * FROM duckdb_memory()")

# Use streaming for large results
conn.execute("SET enable_streaming_result=true")

3. Parallel Processing

-- Configure thread count
SET threads=8;

-- Check current configuration
SELECT * FROM duckdb_settings() WHERE name LIKE '%thread%';

-- Parallel CSV reading
SELECT * FROM read_csv('large_file.csv', parallel=true);

Advanced Features

1. Extensions

DuckDB supports various extensions for additional functionality:

-- Install and load extensions
INSTALL httpfs;
LOAD httpfs;

-- Read from S3
SELECT * FROM read_parquet('s3://bucket/data.parquet');

-- Install spatial extension
INSTALL spatial;
LOAD spatial;

-- Spatial queries
SELECT ST_Distance(
ST_Point(lon1, lat1),
ST_Point(lon2, lat2)
) as distance
FROM locations;

2. User-Defined Functions (UDFs)

import duckdb

def custom_function(x):
return x * 2 + 1

# Register Python function
conn = duckdb.connect()
conn.create_function('my_func', custom_function)

# Use in SQL
result = conn.execute("SELECT my_func(5)").fetchone()[0]
print(result) # 11

3. Prepared Statements

import duckdb

conn = duckdb.connect()

# Prepare statement
stmt = conn.prepare("SELECT * FROM employees WHERE department = ? AND salary > ?")

# Execute with parameters
result1 = stmt.execute(['Engineering', 80000]).fetchall()
result2 = stmt.execute(['Sales', 60000]).fetchall()

Use Cases and Applications

1. Data Science and Analytics

import duckdb
import pandas as pd
import matplotlib.pyplot as plt

# Load and analyze large dataset
conn = duckdb.connect()

# ETL pipeline
conn.execute("""
CREATE TABLE clean_data AS
SELECT
customer_id,
DATE_TRUNC('month', order_date) as month,
SUM(amount) as monthly_spend,
COUNT(*) as order_count
FROM raw_orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id, DATE_TRUNC('month', order_date)
""")

# Statistical analysis
stats = conn.execute("""
SELECT
month,
AVG(monthly_spend) as avg_spend,
STDDEV(monthly_spend) as std_spend,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY monthly_spend) as median_spend
FROM clean_data
GROUP BY month
ORDER BY month
""").df()

# Visualization
stats.plot(x='month', y='avg_spend', kind='line')
plt.show()

2. ETL and Data Processing

-- Complex ETL pipeline
CREATE TABLE processed_sales AS
WITH daily_sales AS (
SELECT
DATE(order_timestamp) as sale_date,
product_id,
SUM(quantity * unit_price) as daily_revenue,
COUNT(*) as transaction_count
FROM raw_transactions
WHERE order_timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(order_timestamp), product_id
),
product_metrics AS (
SELECT
product_id,
AVG(daily_revenue) as avg_daily_revenue,
SUM(daily_revenue) as total_revenue,
SUM(transaction_count) as total_transactions
FROM daily_sales
GROUP BY product_id
)
SELECT
p.product_name,
pm.avg_daily_revenue,
pm.total_revenue,
pm.total_transactions,
pm.total_revenue / pm.total_transactions as avg_transaction_value
FROM product_metrics pm
JOIN products p ON pm.product_id = p.product_id
ORDER BY pm.total_revenue DESC;

3. Business Intelligence

-- Executive dashboard queries
-- Revenue trends
SELECT
DATE_TRUNC('quarter', order_date) as quarter,
SUM(amount) as revenue,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) / COUNT(DISTINCT customer_id) as revenue_per_customer
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY DATE_TRUNC('quarter', order_date)
ORDER BY quarter;

-- Customer segmentation
WITH customer_metrics AS (
SELECT
customer_id,
SUM(amount) as total_spent,
COUNT(*) as order_count,
MAX(order_date) as last_order_date,
CURRENT_DATE - MAX(order_date) as days_since_last_order
FROM orders
GROUP BY customer_id
)
SELECT
CASE
WHEN total_spent > 10000 THEN 'High Value'
WHEN total_spent > 1000 THEN 'Medium Value'
ELSE 'Low Value'
END as value_segment,
CASE
WHEN days_since_last_order <= 30 THEN 'Active'
WHEN days_since_last_order <= 90 THEN 'At Risk'
ELSE 'Churned'
END as activity_segment,
COUNT(*) as customer_count,
AVG(total_spent) as avg_spent
FROM customer_metrics
GROUP BY 1, 2
ORDER BY 1, 2;

Comparison with Other Databases

DuckDB vs SQLite

AspectDuckDBSQLite
WorkloadOLAP (Analytics)OLTP (Transactions)
StorageColumnarRow-based
PerformanceFast aggregationsFast point queries
Use CaseData analysisApplication database
SQL FeaturesAdvanced analyticsBasic SQL

DuckDB vs ClickHouse

AspectDuckDBClickHouse
DeploymentEmbeddedServer-based
ComplexitySimpleComplex setup
ConcurrencySingle processHigh concurrency
ScaleSingle machineDistributed
Use CaseLocal analyticsLarge-scale analytics

DuckDB vs Pandas

AspectDuckDBPandas
MemoryOut-of-coreIn-memory
PerformanceVectorized SQLPython loops
Data SizeLarger datasetsMemory-limited
Query LanguageSQLPython API
Learning CurveSQL knowledgePython knowledge

Best Practices

1. Data Modeling

-- Use appropriate data types
CREATE TABLE optimized_table (
id INTEGER PRIMARY KEY,
timestamp TIMESTAMP,
category VARCHAR(50), -- Fixed size for known categories
amount DECIMAL(10,2), -- Appropriate precision
metadata JSON -- Use JSON for flexible data
);

-- Create indexes for frequent queries
CREATE INDEX idx_timestamp ON sales(order_date);
CREATE INDEX idx_category ON sales(category);

2. Query Patterns

-- Batch operations for better performance
INSERT INTO target_table
SELECT * FROM source_table
WHERE condition;

-- Use EXPLAIN to understand query plans
EXPLAIN SELECT * FROM large_table WHERE date_col >= '2023-01-01';

-- Leverage columnar storage
-- Good: SELECT specific columns
SELECT customer_id, amount FROM orders;
-- Avoid: SELECT * when not needed

3. Memory and Performance

import duckdb

# Configure for your workload
conn = duckdb.connect()
conn.execute("SET memory_limit='8GB'")
conn.execute("SET threads=4")
conn.execute("SET enable_progress_bar=true")

# Use connection pooling for applications
class DuckDBPool:
def __init__(self, database_path, max_connections=10):
self.database_path = database_path
self.connections = []
self.max_connections = max_connections

def get_connection(self):
if self.connections:
return self.connections.pop()
return duckdb.connect(self.database_path)

def return_connection(self, conn):
if len(self.connections) < self.max_connections:
self.connections.append(conn)
else:
conn.close()

Troubleshooting and Debugging

1. Common Issues

Memory Issues

-- Check memory usage
SELECT * FROM duckdb_memory();

-- Reduce memory usage
SET memory_limit='2GB';
SET enable_streaming_result=true;

Performance Issues

-- Analyze query performance
EXPLAIN ANALYZE SELECT * FROM large_table WHERE condition;

-- Check statistics
SELECT * FROM duckdb_statistics();

-- Update statistics if needed
ANALYZE table_name;

2. Debugging Queries

import duckdb

conn = duckdb.connect()

# Enable profiling
conn.execute("SET enable_profiling=true")
conn.execute("SET profiling_output='query_profile.json'")

# Run query
result = conn.execute("SELECT * FROM large_table").fetchall()

# Check profile
profile = conn.execute("SELECT * FROM duckdb_profiling()").fetchall()
print(profile)

3. Error Handling

import duckdb

try:
conn = duckdb.connect('database.duckdb')
result = conn.execute("SELECT * FROM non_existent_table")
except duckdb.CatalogException as e:
print(f"Table not found: {e}")
except duckdb.ConversionException as e:
print(f"Data conversion error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
finally:
if 'conn' in locals():
conn.close()

Production Deployment

1. Application Integration

# Flask web application example
from flask import Flask, request, jsonify
import duckdb
import os

app = Flask(__name__)

class DatabaseManager:
def __init__(self, db_path):
self.db_path = db_path
self.init_database()

def init_database(self):
conn = duckdb.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS analytics_data (
timestamp TIMESTAMP,
event_type VARCHAR,
user_id INTEGER,
properties JSON
)
""")
conn.close()

def get_connection(self):
return duckdb.connect(self.db_path)

db_manager = DatabaseManager('analytics.duckdb')

@app.route('/api/events', methods=['POST'])
def add_event():
data = request.json
conn = db_manager.get_connection()
try:
conn.execute("""
INSERT INTO analytics_data
VALUES (?, ?, ?, ?)
""", (data['timestamp'], data['event_type'],
data['user_id'], data['properties']))
return jsonify({'status': 'success'})
finally:
conn.close()

@app.route('/api/analytics', methods=['GET'])
def get_analytics():
conn = db_manager.get_connection()
try:
result = conn.execute("""
SELECT
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM analytics_data
WHERE timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY event_type
ORDER BY event_count DESC
""").fetchall()

return jsonify([{
'event_type': row[0],
'event_count': row[1],
'unique_users': row[2]
} for row in result])
finally:
conn.close()

2. Backup and Recovery

#!/bin/bash
# Backup script for DuckDB

DB_PATH="/path/to/production.duckdb"
BACKUP_DIR="/backups/duckdb"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p $BACKUP_DIR

# Export data to Parquet (portable format)
duckdb $DB_PATH << EOF
EXPORT DATABASE '$BACKUP_DIR/backup_$TIMESTAMP' (FORMAT PARQUET);
EOF

# Compress backup
tar -czf $BACKUP_DIR/backup_$TIMESTAMP.tar.gz $BACKUP_DIR/backup_$TIMESTAMP/
rm -rf $BACKUP_DIR/backup_$TIMESTAMP/

# Keep only last 7 days of backups
find $BACKUP_DIR -name "backup_*.tar.gz" -mtime +7 -delete

echo "Backup completed: backup_$TIMESTAMP.tar.gz"

3. Monitoring and Observability

import duckdb
import time
import logging
from contextlib import contextmanager

class DuckDBMonitor:
def __init__(self, db_path):
self.db_path = db_path
self.logger = logging.getLogger(__name__)

@contextmanager
def monitored_connection(self):
start_time = time.time()
conn = None
try:
conn = duckdb.connect(self.db_path)
yield conn
except Exception as e:
self.logger.error(f"Database error: {e}")
raise
finally:
if conn:
# Log performance metrics
duration = time.time() - start_time
self.logger.info(f"Query duration: {duration:.2f}s")

# Check memory usage
memory_info = conn.execute("SELECT * FROM duckdb_memory()").fetchall()
self.logger.info(f"Memory usage: {memory_info}")

conn.close()

# Usage
monitor = DuckDBMonitor('production.duckdb')

with monitor.monitored_connection() as conn:
result = conn.execute("SELECT COUNT(*) FROM large_table").fetchone()

Future Roadmap and Ecosystem

Upcoming Features

  1. Enhanced Streaming: Better support for streaming data processing
  2. Distributed Computing: Multi-node processing capabilities
  3. ML Integration: Native machine learning functions
  4. Cloud Integration: Better cloud storage integration

Ecosystem Tools

  • DuckDB Extensions: Spatial, HTTP, JSON, etc.
  • BI Tools: Grafana, Metabase, Superset integration
  • Data Tools: dbt, Airbyte, Dagster support
  • Language Bindings: Python, R, Java, Node.js, Rust, Go

Conclusion

DuckDB represents a paradigm shift in analytical databases by bringing high-performance OLAP capabilities to embedded environments. Its key strengths include:

Advantages

  • Simplicity: No server setup or configuration required
  • Performance: Vectorized execution and columnar storage
  • Integration: Seamless integration with data science tools
  • Standards: Full SQL compliance and ACID transactions
  • Portability: Single file database with cross-platform support

Best Use Cases

  • Data Science: Local analytics and exploration
  • ETL Pipelines: Data transformation and processing
  • Embedded Analytics: Applications requiring analytical capabilities
  • Prototyping: Quick analytical application development
  • Edge Computing: Analytics at the edge without infrastructure

When to Choose DuckDB

  • Need analytical performance without server complexity
  • Working with medium-sized datasets (GB to TB range)
  • Integrating analytics into existing applications
  • Developing data science workflows
  • Requiring SQL compliance and ACID properties

DuckDB fills the gap between simple embedded databases like SQLite and complex distributed analytical systems like ClickHouse, making it an excellent choice for modern analytical workloads that prioritize simplicity and performance.

Resources

Official Resources

Community and Learning

  • OLAP Database Comparison
  • ClickHouse Guide
  • Apache Druid