Apache Livy: REST API for Apache Spark
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
Overview
What is Apache Livy?
Apache Livy is a REST service for Apache Spark that provides:
- Remote Spark Context Management: Create, manage, and destroy Spark contexts remotely
- Job Submission: Submit Spark applications via REST API
- Interactive Sessions: Support for interactive Spark sessions
- Multiple Language Support: Scala, Python, R, and SQL
- Security: Authentication and authorization support
Key Benefits
- Remote Access: Access Spark clusters without direct cluster access
- Multi-tenancy: Multiple users can share the same Spark cluster safely
- Language Agnostic: Support for multiple programming languages
- Session Management: Automatic session lifecycle management
- Security: Built-in security features for enterprise environments
Architecture
┌─────────────────┐ REST API ┌─────────────────┐
│ Client Apps │ ────────────► │ Livy Server │
│ │ │ │
│ - Web Apps │ │ - Session Mgmt │
│ - Notebooks │ │ - Job Queue │
│ - Scripts │ │ - Security │
└─────────────────┘ └─────────────────┘
│
│ Spark Submit
▼
┌─────────────────┐
│ Spark Cluster │
│ │
│ - Driver │
│ - Executors │
│ - Resources │
└─────────────────┘
Installation and Setup
Prerequisites
# Apache Spark (required)
export SPARK_HOME=/path/to/spark
# Java 8 or 11
java -version
# Python (for PySpark support)
python --version
# R (for SparkR support)
R --version
Installation Methods
1. Download and Install
# Download Livy
wget https://downloads.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating-bin.zip
unzip apache-livy-0.8.0-incubating-bin.zip
cd apache-livy-0.8.0-incubating-bin
# Set environment variables
export LIVY_HOME=$(pwd)
export PATH=$LIVY_HOME/bin:$PATH
2. Build from Source
# Clone repository
git clone https://github.com/apache/incubator-livy.git
cd incubator-livy
# Build with Maven
mvn clean package -DskipTests
# Built artifacts in assembly/target/
3. Docker Installation
# Pull Livy Docker image
docker pull apache/livy:0.8.0-incubating
# Run Livy server
docker run -d \
-p 8998:8998 \
-e SPARK_HOME=/opt/spark \
apache/livy:0.8.0-incubating
Configuration
livy.conf
# Server configuration
livy.server.host = 0.0.0.0
livy.server.port = 8998
# Spark configuration
livy.spark.master = yarn
livy.spark.deploy-mode = cluster
# Session configuration
livy.session.state-retain.sec = 600
livy.session.timeout = 1h
livy.server.session.timeout-check = true
# Security
livy.server.auth.type = kerberos
livy.server.launch.kerberos.keytab = /path/to/livy.keytab
livy.server.launch.kerberos.principal = livy/hostname@REALM
# Impersonation
livy.impersonation.enabled = true
# File upload
livy.file.local-dir-whitelist = /tmp/livy-uploads
livy.file.upload.max-size = 100MB
# Recovery
livy.server.recovery.mode = recovery
livy.server.recovery.state-store = filesystem
livy.server.recovery.state-store.url = file:///tmp/livy-recovery
spark-defaults.conf (for Livy)
# Spark configuration for Livy sessions
spark.master=yarn
spark.submit.deployMode=cluster
spark.executor.memory=2g
spark.executor.cores=2
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.maxExecutors=10
# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
# History server
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://namenode:port/spark-history
Core Concepts
1. Sessions
Livy manages Spark applications through sessions:
Interactive Sessions
- Spark Sessions: For Scala/Java code
- PySpark Sessions: For Python code
- SparkR Sessions: For R code
- SQL Sessions: For SQL queries
Batch Sessions
- Batch Jobs: Submit complete Spark applications
- One-time Execution: Jobs run to completion and terminate
2. Session Lifecycle
Created → Starting → Idle → Busy → Idle → ... → Shutting_down → Dead
↓ ↓ ↓ ↓ ↓ ↓ ↓
POST Spark Ready Running Code Waiting DELETE Cleanup
Starting Execution for Code
3. REST API Endpoints
Session Management
GET /sessions- List all sessionsPOST /sessions- Create new sessionGET /sessions/{sessionId}- Get session infoDELETE /sessions/{sessionId}- Delete session
Code Execution
POST /sessions/{sessionId}/statements- Execute codeGET /sessions/{sessionId}/statements- List statementsGET /sessions/{sessionId}/statements/{statementId}- Get statement result
Batch Jobs
GET /batches- List batch jobsPOST /batches- Submit batch jobGET /batches/{batchId}- Get batch infoDELETE /batches/{batchId}- Kill batch job
Usage Examples
1. Interactive Sessions
Creating Sessions
# Create Spark session
curl -X POST \
http://localhost:8998/sessions \
-H 'Content-Type: application/json' \
-d '{
"kind": "spark",
"conf": {
"spark.executor.memory": "2g",
"spark.executor.cores": "2"
}
}'
# Response
{
"id": 0,
"appId": null,
"owner": null,
"proxyUser": null,
"state": "starting",
"kind": "spark",
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": []
}
Executing Code
# Execute Scala code
curl -X POST \
http://localhost:8998/sessions/0/statements \
-H 'Content-Type: application/json' \
-d '{
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)"
}'
# Response
{
"id": 0,
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)",
"state": "waiting",
"output": null,
"progress": 0.0
}
Getting Results
# Get statement result
curl http://localhost:8998/sessions/0/statements/0
# Response
{
"id": 0,
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)",
"state": "available",
"output": {
"status": "ok",
"execution_count": 0,
"data": {
"text/plain": "res0: Int = 15"
}
},
"progress": 1.0
}
2. Python Client Example
import requests
import json
import time
class LivyClient:
def __init__(self, livy_url="http://localhost:8998"):
self.livy_url = livy_url
self.session_id = None
def create_session(self, kind="pyspark", **conf):
"""Create a new Livy session"""
data = {
"kind": kind,
"conf": conf
}
response = requests.post(
f"{self.livy_url}/sessions",
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)
if response.status_code == 201:
session_info = response.json()
self.session_id = session_info["id"]
# Wait for session to be ready
self._wait_for_session_ready()
return session_info
else:
raise Exception(f"Failed to create session: {response.text}")
def _wait_for_session_ready(self):
"""Wait for session to be in 'idle' state"""
while True:
session_info = self.get_session_info()
state = session_info["state"]
if state == "idle":
break
elif state in ["error", "dead"]:
raise Exception(f"Session failed with state: {state}")
time.sleep(2)
def get_session_info(self):
"""Get session information"""
response = requests.get(f"{self.livy_url}/sessions/{self.session_id}")
return response.json()
def execute_code(self, code):
"""Execute code in the session"""
data = {"code": code}
response = requests.post(
f"{self.livy_url}/sessions/{self.session_id}/statements",
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)
if response.status_code == 201:
statement_info = response.json()
statement_id = statement_info["id"]
# Wait for execution to complete
return self._wait_for_statement_completion(statement_id)
else:
raise Exception(f"Failed to execute code: {response.text}")
def _wait_for_statement_completion(self, statement_id):
"""Wait for statement execution to complete"""
while True:
response = requests.get(
f"{self.livy_url}/sessions/{self.session_id}/statements/{statement_id}"
)
statement_info = response.json()
state = statement_info["state"]
if state == "available":
return statement_info["output"]
elif state == "error":
raise Exception(f"Statement execution failed: {statement_info}")
time.sleep(1)
def delete_session(self):
"""Delete the session"""
if self.session_id:
response = requests.delete(f"{self.livy_url}/sessions/{self.session_id}")
return response.status_code == 200
# Usage example
client = LivyClient()
try:
# Create session
session = client.create_session(
kind="pyspark",
**{
"spark.executor.memory": "2g",
"spark.executor.cores": "2"
}
)
print(f"Created session: {session['id']}")
# Execute PySpark code
code = """
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
# Show results
df.show()
df.count()
"""
result = client.execute_code(code)
print("Execution result:", result)
finally:
# Clean up
client.delete_session()
3. Batch Job Submission
# Submit batch job
curl -X POST \
http://localhost:8998/batches \
-H 'Content-Type: application/json' \
-d '{
"file": "hdfs://namenode:port/path/to/my-spark-app.jar",
"className": "com.example.MySparkApp",
"args": ["arg1", "arg2"],
"conf": {
"spark.executor.memory": "4g",
"spark.executor.cores": "4"
}
}'
# Response
{
"id": 0,
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": []
}
4. File Upload and Usage
# Upload JAR file
curl -X POST \
http://localhost:8998/sessions/0/upload-jar \
-F "jar=@/path/to/my-library.jar"
# Upload Python file
curl -X POST \
http://localhost:8998/sessions/0/upload-pyfile \
-F "file=@/path/to/my-module.py"
# Use uploaded files in code
curl -X POST \
http://localhost:8998/sessions/0/statements \
-H 'Content-Type: application/json' \
-d '{
"code": "import my_module\nmy_module.my_function()"
}'
Advanced Features
1. Session Configuration
# Advanced session configuration
session_config = {
"kind": "pyspark",
"proxyUser": "data_scientist",
"jars": [
"hdfs://namenode:port/path/to/library1.jar",
"hdfs://namenode:port/path/to/library2.jar"
],
"pyFiles": [
"hdfs://namenode:port/path/to/module1.py",
"hdfs://namenode:port/path/to/module2.py"
],
"files": [
"hdfs://namenode:port/path/to/config.properties"
],
"driverMemory": "2g",
"driverCores": 2,
"executorMemory": "4g",
"executorCores": 4,
"numExecutors": 10,
"conf": {
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
}
}
2. Security Configuration
Kerberos Authentication
# livy.conf
livy.server.auth.type = kerberos
livy.server.auth.kerberos.principal = HTTP/livy-server@REALM
livy.server.auth.kerberos.keytab = /path/to/livy.keytab
# Enable SPNEGO for web UI
livy.server.auth.kerberos.name-rules = RULE:[2:$1@$0](.*@REALM)s/@.*//
SSL Configuration
# Enable SSL
livy.keystore = /path/to/keystore.jks
livy.keystore.password = keystorepassword
livy.key-password = keypassword
# Client certificate authentication
livy.server.auth.type = certificate
3. Monitoring and Logging
class LivyMonitor:
def __init__(self, livy_url):
self.livy_url = livy_url
def get_all_sessions(self):
"""Get information about all sessions"""
response = requests.get(f"{self.livy_url}/sessions")
return response.json()["sessions"]
def get_session_logs(self, session_id, from_line=0, size=100):
"""Get session logs"""
params = {"from": from_line, "size": size}
response = requests.get(
f"{self.livy_url}/sessions/{session_id}/log",
params=params
)
return response.json()
def get_session_metrics(self, session_id):
"""Get session metrics and statistics"""
session_info = requests.get(
f"{self.livy_url}/sessions/{session_id}"
).json()
return {
"session_id": session_id,
"state": session_info["state"],
"app_id": session_info.get("appId"),
"spark_ui_url": session_info.get("appInfo", {}).get("sparkUiUrl"),
"driver_log_url": session_info.get("appInfo", {}).get("driverLogUrl")
}
def monitor_sessions(self, interval=30):
"""Continuously monitor all sessions"""
while True:
sessions = self.get_all_sessions()
print(f"\n=== Session Status at {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
for session in sessions:
metrics = self.get_session_metrics(session["id"])
print(f"Session {metrics['session_id']}: {metrics['state']} "
f"(App: {metrics['app_id']})")
time.sleep(interval)
# Usage
monitor = LivyMonitor("http://localhost:8998")
monitor.monitor_sessions()
4. Integration with Jupyter Notebooks
# Install sparkmagic
pip install sparkmagic
# Configure sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
# Configure Livy endpoint
cat > ~/.sparkmagic/config.json << EOF
{
"kernel_python_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_scala_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
}
}
EOF
Performance Optimization
1. Session Management
class OptimizedLivyClient:
def __init__(self, livy_url, pool_size=5):
self.livy_url = livy_url
self.session_pool = []
self.pool_size = pool_size
self._initialize_pool()
def _initialize_pool(self):
"""Pre-create sessions for better performance"""
for i in range(self.pool_size):
session = self._create_session()
self.session_pool.append(session)
def get_session(self):
"""Get an available session from pool"""
if self.session_pool:
return self.session_pool.pop()
else:
return self._create_session()
def return_session(self, session_id):
"""Return session to pool"""
if len(self.session_pool) < self.pool_size:
self.session_pool.append(session_id)
else:
self._delete_session(session_id)
def execute_with_pool(self, code):
"""Execute code using session pool"""
session_id = self.get_session()
try:
result = self._execute_code(session_id, code)
return result
finally:
self.return_session(session_id)
2. Batch Processing Optimization
# Optimized batch job configuration
curl -X POST \
http://localhost:8998/batches \
-H 'Content-Type: application/json' \
-d '{
"file": "hdfs://namenode:port/path/to/optimized-app.jar",
"className": "com.example.OptimizedSparkApp",
"conf": {
"spark.executor.memory": "8g",
"spark.executor.cores": "4",
"spark.executor.instances": "20",
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.sql.adaptive.skewJoin.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.dynamicAllocation.enabled": "false"
}
}'
Best Practices
1. Session Management
# Best practices for session management
class BestPracticeLivyClient:
def __init__(self, livy_url):
self.livy_url = livy_url
self.session_timeout = 3600 # 1 hour
def create_session_with_retry(self, max_retries=3):
"""Create session with retry logic"""
for attempt in range(max_retries):
try:
session = self._create_session()
return session
except Exception as e:
if attempt == max_retries - 1:
raise e
time.sleep(2 ** attempt) # Exponential backoff
def execute_with_timeout(self, code, timeout=300):
"""Execute code with timeout"""
start_time = time.time()
statement_id = self._submit_code(code)
while time.time() - start_time < timeout:
result = self._get_statement_result(statement_id)
if result["state"] == "available":
return result["output"]
elif result["state"] == "error":
raise Exception(f"Execution failed: {result}")
time.sleep(1)
raise TimeoutError(f"Code execution timed out after {timeout} seconds")
def cleanup_idle_sessions(self, max_idle_time=1800):
"""Clean up sessions that have been idle too long"""
sessions = self._get_all_sessions()
current_time = time.time()
for session in sessions:
if session["state"] == "idle":
# Check last activity time
last_activity = self._get_last_activity_time(session["id"])
if current_time - last_activity > max_idle_time:
self._delete_session(session["id"])
2. Error Handling
import logging
from enum import Enum
class LivyError(Exception):
pass
class SessionState(Enum):
NOT_STARTED = "not_started"
STARTING = "starting"
IDLE = "idle"
BUSY = "busy"
SHUTTING_DOWN = "shutting_down"
ERROR = "error"
DEAD = "dead"
class RobustLivyClient:
def __init__(self, livy_url):
self.livy_url = livy_url
self.logger = logging.getLogger(__name__)
def execute_code_safely(self, code, session_id=None):
"""Execute code with comprehensive error handling"""
try:
if session_id is None:
session_id = self._get_or_create_session()
# Check session health
if not self._is_session_healthy(session_id):
self.logger.warning(f"Session {session_id} unhealthy, recreating")
session_id = self._recreate_session(session_id)
# Execute code
result = self._execute_code(session_id, code)
return result
except requests.exceptions.ConnectionError:
self.logger.error("Cannot connect to Livy server")
raise LivyError("Livy server unavailable")
except requests.exceptions.Timeout:
self.logger.error("Request to Livy server timed out")
raise LivyError("Request timeout")
except Exception as e:
self.logger.error(f"Unexpected error: {e}")
raise LivyError(f"Execution failed: {e}")
def _is_session_healthy(self, session_id):
"""Check if session is in a healthy state"""
try:
session_info = self._get_session_info(session_id)
state = SessionState(session_info["state"])
return state in [SessionState.IDLE, SessionState.BUSY]
except:
return False
3. Resource Management
class ResourceManagedLivyClient:
def __init__(self, livy_url, max_concurrent_sessions=10):
self.livy_url = livy_url
self.max_concurrent_sessions = max_concurrent_sessions
self.active_sessions = set()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.cleanup_all_sessions()
def create_managed_session(self, **config):
"""Create session with resource limits"""
if len(self.active_sessions) >= self.max_concurrent_sessions:
raise LivyError("Maximum concurrent sessions reached")
session = self._create_session(**config)
self.active_sessions.add(session["id"])
return session
def cleanup_all_sessions(self):
"""Clean up all managed sessions"""
for session_id in list(self.active_sessions):
try:
self._delete_session(session_id)
self.active_sessions.remove(session_id)
except Exception as e:
self.logger.warning(f"Failed to cleanup session {session_id}: {e}")
# Usage with context manager
with ResourceManagedLivyClient("http://localhost:8998") as client:
session = client.create_managed_session(kind="pyspark")
result = client.execute_code("spark.range(100).count()")
# Sessions automatically cleaned up on exit
Troubleshooting
Common Issues and Solutions
1. Session Creation Failures
def diagnose_session_creation_failure(livy_client):
"""Diagnose why session creation is failing"""
# Check Livy server status
try:
response = requests.get(f"{livy_client.livy_url}/sessions")
if response.status_code != 200:
print(f"Livy server error: {response.status_code}")
return
except requests.exceptions.ConnectionError:
print("Cannot connect to Livy server")
return
# Check resource availability
sessions = response.json()["sessions"]
active_sessions = [s for s in sessions if s["state"] not in ["dead", "error"]]
print(f"Active sessions: {len(active_sessions)}")
# Check Spark cluster resources
for session in active_sessions:
if session.get("appInfo", {}).get("sparkUiUrl"):
print(f"Session {session['id']}: {session['appInfo']['sparkUiUrl']}")
2. Performance Issues
def analyze_performance_issues(session_id, livy_client):
"""Analyze performance issues in Livy session"""
# Get session information
session_info = livy_client.get_session_info(session_id)
# Check if Spark UI is available
spark_ui_url = session_info.get("appInfo", {}).get("sparkUiUrl")
if spark_ui_url:
print(f"Check Spark UI: {spark_ui_url}")
# Get recent logs
logs = livy_client.get_session_logs(session_id, size=50)
# Look for common performance indicators
performance_indicators = [
"GC overhead limit exceeded",
"OutOfMemoryError",
"Task serialization failed",
"Shuffle fetch failed"
]
for log_line in logs.get("log", []):
for indicator in performance_indicators:
if indicator in log_line:
print(f"Performance issue detected: {log_line}")
3. Memory Issues
# Increase session memory limits
curl -X POST \
http://localhost:8998/sessions \
-H 'Content-Type: application/json' \
-d '{
"kind": "pyspark",
"conf": {
"spark.executor.memory": "8g",
"spark.driver.memory": "4g",
"spark.executor.memoryFraction": "0.8",
"spark.sql.execution.arrow.pyspark.enabled": "true"
}
}'
Integration Examples
1. Web Application Integration
from flask import Flask, request, jsonify
import threading
import queue
app = Flask(__name__)
class LivyService:
def __init__(self):
self.livy_client = LivyClient("http://localhost:8998")
self.job_queue = queue.Queue()
self.results = {}
self.worker_thread = threading.Thread(target=self._worker)
self.worker_thread.daemon = True
self.worker_thread.start()
def _worker(self):
"""Background worker to process jobs"""
while True:
job_id, code = self.job_queue.get()
try:
result = self.livy_client.execute_code(code)
self.results[job_id] = {"status": "completed", "result": result}
except Exception as e:
self.results[job_id] = {"status": "error", "error": str(e)}
finally:
self.job_queue.task_done()
def submit_job(self, job_id, code):
"""Submit job for execution"""
self.results[job_id] = {"status": "running"}
self.job_queue.put((job_id, code))
def get_result(self, job_id):
"""Get job result"""
return self.results.get(job_id, {"status": "not_found"})
livy_service = LivyService()
@app.route('/submit', methods=['POST'])
def submit_job():
data = request.json
job_id = data['job_id']
code = data['code']
livy_service.submit_job(job_id, code)
return jsonify({"status": "submitted", "job_id": job_id})
@app.route('/result/<job_id>')
def get_result(job_id):
result = livy_service.get_result(job_id)
return jsonify(result)
if __name__ == '__main__':
app.run(debug=True)
2. Airflow Integration
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def submit_spark_job_via_livy(**context):
"""Submit Spark job via Livy"""
livy_client = LivyClient("http://livy-server:8998")
# Create session
session = livy_client.create_session(
kind="pyspark",
**{
"spark.executor.memory": "4g",
"spark.executor.cores": "2"
}
)
try:
# Execute Spark code
code = """
# Your Spark ETL code here
df = spark.read.parquet("hdfs://path/to/input")
processed_df = df.groupBy("category").count()
processed_df.write.mode("overwrite").parquet("hdfs://path/to/output")
"""
result = livy_client.execute_code(code)
# Check if execution was successful
if result["status"] == "ok":
return "Job completed successfully"
else:
raise Exception(f"Job failed: {result}")
finally:
# Clean up session
livy_client.delete_session()
# Define DAG
dag = DAG(
'spark_etl_via_livy',
default_args={
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2023, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
schedule_interval=timedelta(hours=1),
catchup=False
)
# Define task
spark_task = PythonOperator(
task_id='run_spark_etl',
python_callable=submit_spark_job_via_livy,
dag=dag
)
Conclusion
Apache Livy provides a powerful REST interface for Apache Spark that enables:
Key Benefits
- Remote Access: Access Spark clusters without direct access
- Multi-language Support: Scala, Python, R, and SQL
- Session Management: Automatic lifecycle management
- Security: Enterprise-grade authentication and authorization
Best Use Cases
- Web Applications: Integrate Spark into web services
- Notebooks: Interactive data science environments
- Multi-tenant Environments: Shared Spark clusters
- Microservices: Spark as a service architecture
When to Choose Livy
- Need REST API access to Spark
- Multiple users sharing Spark cluster
- Integration with web applications
- Remote Spark job submission requirements
Livy bridges the gap between Spark's powerful processing capabilities and modern application architectures, making Spark accessible through standard REST APIs.
Resources
Official Resources
Integration Resources
Related Documentation
- Apache Spark Guide
- JupyterHub Setup
- Alluxio Integration