Skip to main content

Apache Livy: REST API for Apache Spark

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.

Overview

What is Apache Livy?

Apache Livy is a REST service for Apache Spark that provides:

  • Remote Spark Context Management: Create, manage, and destroy Spark contexts remotely
  • Job Submission: Submit Spark applications via REST API
  • Interactive Sessions: Support for interactive Spark sessions
  • Multiple Language Support: Scala, Python, R, and SQL
  • Security: Authentication and authorization support

Key Benefits

  • Remote Access: Access Spark clusters without direct cluster access
  • Multi-tenancy: Multiple users can share the same Spark cluster safely
  • Language Agnostic: Support for multiple programming languages
  • Session Management: Automatic session lifecycle management
  • Security: Built-in security features for enterprise environments

Architecture

┌─────────────────┐    REST API    ┌─────────────────┐
│ Client Apps │ ────────────► │ Livy Server │
│ │ │ │
│ - Web Apps │ │ - Session Mgmt │
│ - Notebooks │ │ - Job Queue │
│ - Scripts │ │ - Security │
└─────────────────┘ └─────────────────┘

│ Spark Submit

┌─────────────────┐
│ Spark Cluster │
│ │
│ - Driver │
│ - Executors │
│ - Resources │
└─────────────────┘

Installation and Setup

Prerequisites

# Apache Spark (required)
export SPARK_HOME=/path/to/spark

# Java 8 or 11
java -version

# Python (for PySpark support)
python --version

# R (for SparkR support)
R --version

Installation Methods

1. Download and Install

# Download Livy
wget https://downloads.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating-bin.zip
unzip apache-livy-0.8.0-incubating-bin.zip
cd apache-livy-0.8.0-incubating-bin

# Set environment variables
export LIVY_HOME=$(pwd)
export PATH=$LIVY_HOME/bin:$PATH

2. Build from Source

# Clone repository
git clone https://github.com/apache/incubator-livy.git
cd incubator-livy

# Build with Maven
mvn clean package -DskipTests

# Built artifacts in assembly/target/

3. Docker Installation

# Pull Livy Docker image
docker pull apache/livy:0.8.0-incubating

# Run Livy server
docker run -d \
-p 8998:8998 \
-e SPARK_HOME=/opt/spark \
apache/livy:0.8.0-incubating

Configuration

livy.conf

# Server configuration
livy.server.host = 0.0.0.0
livy.server.port = 8998

# Spark configuration
livy.spark.master = yarn
livy.spark.deploy-mode = cluster

# Session configuration
livy.session.state-retain.sec = 600
livy.session.timeout = 1h
livy.server.session.timeout-check = true

# Security
livy.server.auth.type = kerberos
livy.server.launch.kerberos.keytab = /path/to/livy.keytab
livy.server.launch.kerberos.principal = livy/hostname@REALM

# Impersonation
livy.impersonation.enabled = true

# File upload
livy.file.local-dir-whitelist = /tmp/livy-uploads
livy.file.upload.max-size = 100MB

# Recovery
livy.server.recovery.mode = recovery
livy.server.recovery.state-store = filesystem
livy.server.recovery.state-store.url = file:///tmp/livy-recovery

spark-defaults.conf (for Livy)

# Spark configuration for Livy sessions
spark.master=yarn
spark.submit.deployMode=cluster
spark.executor.memory=2g
spark.executor.cores=2
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.maxExecutors=10

# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer

# History server
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://namenode:port/spark-history

Core Concepts

1. Sessions

Livy manages Spark applications through sessions:

Interactive Sessions

  • Spark Sessions: For Scala/Java code
  • PySpark Sessions: For Python code
  • SparkR Sessions: For R code
  • SQL Sessions: For SQL queries

Batch Sessions

  • Batch Jobs: Submit complete Spark applications
  • One-time Execution: Jobs run to completion and terminate

2. Session Lifecycle

Created → Starting → Idle → Busy → Idle → ... → Shutting_down → Dead
↓ ↓ ↓ ↓ ↓ ↓ ↓
POST Spark Ready Running Code Waiting DELETE Cleanup
Starting Execution for Code

3. REST API Endpoints

Session Management

  • GET /sessions - List all sessions
  • POST /sessions - Create new session
  • GET /sessions/{sessionId} - Get session info
  • DELETE /sessions/{sessionId} - Delete session

Code Execution

  • POST /sessions/{sessionId}/statements - Execute code
  • GET /sessions/{sessionId}/statements - List statements
  • GET /sessions/{sessionId}/statements/{statementId} - Get statement result

Batch Jobs

  • GET /batches - List batch jobs
  • POST /batches - Submit batch job
  • GET /batches/{batchId} - Get batch info
  • DELETE /batches/{batchId} - Kill batch job

Usage Examples

1. Interactive Sessions

Creating Sessions

# Create Spark session
curl -X POST \
http://localhost:8998/sessions \
-H 'Content-Type: application/json' \
-d '{
"kind": "spark",
"conf": {
"spark.executor.memory": "2g",
"spark.executor.cores": "2"
}
}'

# Response
{
"id": 0,
"appId": null,
"owner": null,
"proxyUser": null,
"state": "starting",
"kind": "spark",
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": []
}

Executing Code

# Execute Scala code
curl -X POST \
http://localhost:8998/sessions/0/statements \
-H 'Content-Type: application/json' \
-d '{
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)"
}'

# Response
{
"id": 0,
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)",
"state": "waiting",
"output": null,
"progress": 0.0
}

Getting Results

# Get statement result
curl http://localhost:8998/sessions/0/statements/0

# Response
{
"id": 0,
"code": "val data = Array(1, 2, 3, 4, 5)\nval distData = sc.parallelize(data)\ndistData.reduce((a, b) => a + b)",
"state": "available",
"output": {
"status": "ok",
"execution_count": 0,
"data": {
"text/plain": "res0: Int = 15"
}
},
"progress": 1.0
}

2. Python Client Example

import requests
import json
import time

class LivyClient:
def __init__(self, livy_url="http://localhost:8998"):
self.livy_url = livy_url
self.session_id = None

def create_session(self, kind="pyspark", **conf):
"""Create a new Livy session"""
data = {
"kind": kind,
"conf": conf
}

response = requests.post(
f"{self.livy_url}/sessions",
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)

if response.status_code == 201:
session_info = response.json()
self.session_id = session_info["id"]

# Wait for session to be ready
self._wait_for_session_ready()
return session_info
else:
raise Exception(f"Failed to create session: {response.text}")

def _wait_for_session_ready(self):
"""Wait for session to be in 'idle' state"""
while True:
session_info = self.get_session_info()
state = session_info["state"]

if state == "idle":
break
elif state in ["error", "dead"]:
raise Exception(f"Session failed with state: {state}")

time.sleep(2)

def get_session_info(self):
"""Get session information"""
response = requests.get(f"{self.livy_url}/sessions/{self.session_id}")
return response.json()

def execute_code(self, code):
"""Execute code in the session"""
data = {"code": code}

response = requests.post(
f"{self.livy_url}/sessions/{self.session_id}/statements",
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)

if response.status_code == 201:
statement_info = response.json()
statement_id = statement_info["id"]

# Wait for execution to complete
return self._wait_for_statement_completion(statement_id)
else:
raise Exception(f"Failed to execute code: {response.text}")

def _wait_for_statement_completion(self, statement_id):
"""Wait for statement execution to complete"""
while True:
response = requests.get(
f"{self.livy_url}/sessions/{self.session_id}/statements/{statement_id}"
)
statement_info = response.json()
state = statement_info["state"]

if state == "available":
return statement_info["output"]
elif state == "error":
raise Exception(f"Statement execution failed: {statement_info}")

time.sleep(1)

def delete_session(self):
"""Delete the session"""
if self.session_id:
response = requests.delete(f"{self.livy_url}/sessions/{self.session_id}")
return response.status_code == 200

# Usage example
client = LivyClient()

try:
# Create session
session = client.create_session(
kind="pyspark",
**{
"spark.executor.memory": "2g",
"spark.executor.cores": "2"
}
)
print(f"Created session: {session['id']}")

# Execute PySpark code
code = """
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Show results
df.show()
df.count()
"""

result = client.execute_code(code)
print("Execution result:", result)

finally:
# Clean up
client.delete_session()

3. Batch Job Submission

# Submit batch job
curl -X POST \
http://localhost:8998/batches \
-H 'Content-Type: application/json' \
-d '{
"file": "hdfs://namenode:port/path/to/my-spark-app.jar",
"className": "com.example.MySparkApp",
"args": ["arg1", "arg2"],
"conf": {
"spark.executor.memory": "4g",
"spark.executor.cores": "4"
}
}'

# Response
{
"id": 0,
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": []
}

4. File Upload and Usage

# Upload JAR file
curl -X POST \
http://localhost:8998/sessions/0/upload-jar \
-F "jar=@/path/to/my-library.jar"

# Upload Python file
curl -X POST \
http://localhost:8998/sessions/0/upload-pyfile \
-F "file=@/path/to/my-module.py"

# Use uploaded files in code
curl -X POST \
http://localhost:8998/sessions/0/statements \
-H 'Content-Type: application/json' \
-d '{
"code": "import my_module\nmy_module.my_function()"
}'

Advanced Features

1. Session Configuration

# Advanced session configuration
session_config = {
"kind": "pyspark",
"proxyUser": "data_scientist",
"jars": [
"hdfs://namenode:port/path/to/library1.jar",
"hdfs://namenode:port/path/to/library2.jar"
],
"pyFiles": [
"hdfs://namenode:port/path/to/module1.py",
"hdfs://namenode:port/path/to/module2.py"
],
"files": [
"hdfs://namenode:port/path/to/config.properties"
],
"driverMemory": "2g",
"driverCores": 2,
"executorMemory": "4g",
"executorCores": 4,
"numExecutors": 10,
"conf": {
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
}
}

2. Security Configuration

Kerberos Authentication

# livy.conf
livy.server.auth.type = kerberos
livy.server.auth.kerberos.principal = HTTP/livy-server@REALM
livy.server.auth.kerberos.keytab = /path/to/livy.keytab

# Enable SPNEGO for web UI
livy.server.auth.kerberos.name-rules = RULE:[2:$1@$0](.*@REALM)s/@.*//

SSL Configuration

# Enable SSL
livy.keystore = /path/to/keystore.jks
livy.keystore.password = keystorepassword
livy.key-password = keypassword

# Client certificate authentication
livy.server.auth.type = certificate

3. Monitoring and Logging

class LivyMonitor:
def __init__(self, livy_url):
self.livy_url = livy_url

def get_all_sessions(self):
"""Get information about all sessions"""
response = requests.get(f"{self.livy_url}/sessions")
return response.json()["sessions"]

def get_session_logs(self, session_id, from_line=0, size=100):
"""Get session logs"""
params = {"from": from_line, "size": size}
response = requests.get(
f"{self.livy_url}/sessions/{session_id}/log",
params=params
)
return response.json()

def get_session_metrics(self, session_id):
"""Get session metrics and statistics"""
session_info = requests.get(
f"{self.livy_url}/sessions/{session_id}"
).json()

return {
"session_id": session_id,
"state": session_info["state"],
"app_id": session_info.get("appId"),
"spark_ui_url": session_info.get("appInfo", {}).get("sparkUiUrl"),
"driver_log_url": session_info.get("appInfo", {}).get("driverLogUrl")
}

def monitor_sessions(self, interval=30):
"""Continuously monitor all sessions"""
while True:
sessions = self.get_all_sessions()

print(f"\n=== Session Status at {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
for session in sessions:
metrics = self.get_session_metrics(session["id"])
print(f"Session {metrics['session_id']}: {metrics['state']} "
f"(App: {metrics['app_id']})")

time.sleep(interval)

# Usage
monitor = LivyMonitor("http://localhost:8998")
monitor.monitor_sessions()

4. Integration with Jupyter Notebooks

# Install sparkmagic
pip install sparkmagic

# Configure sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

# Configure Livy endpoint
cat > ~/.sparkmagic/config.json << EOF
{
"kernel_python_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_scala_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
}
}
EOF

Performance Optimization

1. Session Management

class OptimizedLivyClient:
def __init__(self, livy_url, pool_size=5):
self.livy_url = livy_url
self.session_pool = []
self.pool_size = pool_size
self._initialize_pool()

def _initialize_pool(self):
"""Pre-create sessions for better performance"""
for i in range(self.pool_size):
session = self._create_session()
self.session_pool.append(session)

def get_session(self):
"""Get an available session from pool"""
if self.session_pool:
return self.session_pool.pop()
else:
return self._create_session()

def return_session(self, session_id):
"""Return session to pool"""
if len(self.session_pool) < self.pool_size:
self.session_pool.append(session_id)
else:
self._delete_session(session_id)

def execute_with_pool(self, code):
"""Execute code using session pool"""
session_id = self.get_session()
try:
result = self._execute_code(session_id, code)
return result
finally:
self.return_session(session_id)

2. Batch Processing Optimization

# Optimized batch job configuration
curl -X POST \
http://localhost:8998/batches \
-H 'Content-Type: application/json' \
-d '{
"file": "hdfs://namenode:port/path/to/optimized-app.jar",
"className": "com.example.OptimizedSparkApp",
"conf": {
"spark.executor.memory": "8g",
"spark.executor.cores": "4",
"spark.executor.instances": "20",
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.sql.adaptive.skewJoin.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.dynamicAllocation.enabled": "false"
}
}'

Best Practices

1. Session Management

# Best practices for session management
class BestPracticeLivyClient:
def __init__(self, livy_url):
self.livy_url = livy_url
self.session_timeout = 3600 # 1 hour

def create_session_with_retry(self, max_retries=3):
"""Create session with retry logic"""
for attempt in range(max_retries):
try:
session = self._create_session()
return session
except Exception as e:
if attempt == max_retries - 1:
raise e
time.sleep(2 ** attempt) # Exponential backoff

def execute_with_timeout(self, code, timeout=300):
"""Execute code with timeout"""
start_time = time.time()
statement_id = self._submit_code(code)

while time.time() - start_time < timeout:
result = self._get_statement_result(statement_id)
if result["state"] == "available":
return result["output"]
elif result["state"] == "error":
raise Exception(f"Execution failed: {result}")
time.sleep(1)

raise TimeoutError(f"Code execution timed out after {timeout} seconds")

def cleanup_idle_sessions(self, max_idle_time=1800):
"""Clean up sessions that have been idle too long"""
sessions = self._get_all_sessions()
current_time = time.time()

for session in sessions:
if session["state"] == "idle":
# Check last activity time
last_activity = self._get_last_activity_time(session["id"])
if current_time - last_activity > max_idle_time:
self._delete_session(session["id"])

2. Error Handling

import logging
from enum import Enum

class LivyError(Exception):
pass

class SessionState(Enum):
NOT_STARTED = "not_started"
STARTING = "starting"
IDLE = "idle"
BUSY = "busy"
SHUTTING_DOWN = "shutting_down"
ERROR = "error"
DEAD = "dead"

class RobustLivyClient:
def __init__(self, livy_url):
self.livy_url = livy_url
self.logger = logging.getLogger(__name__)

def execute_code_safely(self, code, session_id=None):
"""Execute code with comprehensive error handling"""
try:
if session_id is None:
session_id = self._get_or_create_session()

# Check session health
if not self._is_session_healthy(session_id):
self.logger.warning(f"Session {session_id} unhealthy, recreating")
session_id = self._recreate_session(session_id)

# Execute code
result = self._execute_code(session_id, code)
return result

except requests.exceptions.ConnectionError:
self.logger.error("Cannot connect to Livy server")
raise LivyError("Livy server unavailable")

except requests.exceptions.Timeout:
self.logger.error("Request to Livy server timed out")
raise LivyError("Request timeout")

except Exception as e:
self.logger.error(f"Unexpected error: {e}")
raise LivyError(f"Execution failed: {e}")

def _is_session_healthy(self, session_id):
"""Check if session is in a healthy state"""
try:
session_info = self._get_session_info(session_id)
state = SessionState(session_info["state"])
return state in [SessionState.IDLE, SessionState.BUSY]
except:
return False

3. Resource Management

class ResourceManagedLivyClient:
def __init__(self, livy_url, max_concurrent_sessions=10):
self.livy_url = livy_url
self.max_concurrent_sessions = max_concurrent_sessions
self.active_sessions = set()

def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.cleanup_all_sessions()

def create_managed_session(self, **config):
"""Create session with resource limits"""
if len(self.active_sessions) >= self.max_concurrent_sessions:
raise LivyError("Maximum concurrent sessions reached")

session = self._create_session(**config)
self.active_sessions.add(session["id"])
return session

def cleanup_all_sessions(self):
"""Clean up all managed sessions"""
for session_id in list(self.active_sessions):
try:
self._delete_session(session_id)
self.active_sessions.remove(session_id)
except Exception as e:
self.logger.warning(f"Failed to cleanup session {session_id}: {e}")

# Usage with context manager
with ResourceManagedLivyClient("http://localhost:8998") as client:
session = client.create_managed_session(kind="pyspark")
result = client.execute_code("spark.range(100).count()")
# Sessions automatically cleaned up on exit

Troubleshooting

Common Issues and Solutions

1. Session Creation Failures

def diagnose_session_creation_failure(livy_client):
"""Diagnose why session creation is failing"""

# Check Livy server status
try:
response = requests.get(f"{livy_client.livy_url}/sessions")
if response.status_code != 200:
print(f"Livy server error: {response.status_code}")
return
except requests.exceptions.ConnectionError:
print("Cannot connect to Livy server")
return

# Check resource availability
sessions = response.json()["sessions"]
active_sessions = [s for s in sessions if s["state"] not in ["dead", "error"]]

print(f"Active sessions: {len(active_sessions)}")

# Check Spark cluster resources
for session in active_sessions:
if session.get("appInfo", {}).get("sparkUiUrl"):
print(f"Session {session['id']}: {session['appInfo']['sparkUiUrl']}")

2. Performance Issues

def analyze_performance_issues(session_id, livy_client):
"""Analyze performance issues in Livy session"""

# Get session information
session_info = livy_client.get_session_info(session_id)

# Check if Spark UI is available
spark_ui_url = session_info.get("appInfo", {}).get("sparkUiUrl")
if spark_ui_url:
print(f"Check Spark UI: {spark_ui_url}")

# Get recent logs
logs = livy_client.get_session_logs(session_id, size=50)

# Look for common performance indicators
performance_indicators = [
"GC overhead limit exceeded",
"OutOfMemoryError",
"Task serialization failed",
"Shuffle fetch failed"
]

for log_line in logs.get("log", []):
for indicator in performance_indicators:
if indicator in log_line:
print(f"Performance issue detected: {log_line}")

3. Memory Issues

# Increase session memory limits
curl -X POST \
http://localhost:8998/sessions \
-H 'Content-Type: application/json' \
-d '{
"kind": "pyspark",
"conf": {
"spark.executor.memory": "8g",
"spark.driver.memory": "4g",
"spark.executor.memoryFraction": "0.8",
"spark.sql.execution.arrow.pyspark.enabled": "true"
}
}'

Integration Examples

1. Web Application Integration

from flask import Flask, request, jsonify
import threading
import queue

app = Flask(__name__)

class LivyService:
def __init__(self):
self.livy_client = LivyClient("http://localhost:8998")
self.job_queue = queue.Queue()
self.results = {}
self.worker_thread = threading.Thread(target=self._worker)
self.worker_thread.daemon = True
self.worker_thread.start()

def _worker(self):
"""Background worker to process jobs"""
while True:
job_id, code = self.job_queue.get()
try:
result = self.livy_client.execute_code(code)
self.results[job_id] = {"status": "completed", "result": result}
except Exception as e:
self.results[job_id] = {"status": "error", "error": str(e)}
finally:
self.job_queue.task_done()

def submit_job(self, job_id, code):
"""Submit job for execution"""
self.results[job_id] = {"status": "running"}
self.job_queue.put((job_id, code))

def get_result(self, job_id):
"""Get job result"""
return self.results.get(job_id, {"status": "not_found"})

livy_service = LivyService()

@app.route('/submit', methods=['POST'])
def submit_job():
data = request.json
job_id = data['job_id']
code = data['code']

livy_service.submit_job(job_id, code)
return jsonify({"status": "submitted", "job_id": job_id})

@app.route('/result/<job_id>')
def get_result(job_id):
result = livy_service.get_result(job_id)
return jsonify(result)

if __name__ == '__main__':
app.run(debug=True)

2. Airflow Integration

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

def submit_spark_job_via_livy(**context):
"""Submit Spark job via Livy"""
livy_client = LivyClient("http://livy-server:8998")

# Create session
session = livy_client.create_session(
kind="pyspark",
**{
"spark.executor.memory": "4g",
"spark.executor.cores": "2"
}
)

try:
# Execute Spark code
code = """
# Your Spark ETL code here
df = spark.read.parquet("hdfs://path/to/input")
processed_df = df.groupBy("category").count()
processed_df.write.mode("overwrite").parquet("hdfs://path/to/output")
"""

result = livy_client.execute_code(code)

# Check if execution was successful
if result["status"] == "ok":
return "Job completed successfully"
else:
raise Exception(f"Job failed: {result}")

finally:
# Clean up session
livy_client.delete_session()

# Define DAG
dag = DAG(
'spark_etl_via_livy',
default_args={
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2023, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
schedule_interval=timedelta(hours=1),
catchup=False
)

# Define task
spark_task = PythonOperator(
task_id='run_spark_etl',
python_callable=submit_spark_job_via_livy,
dag=dag
)

Conclusion

Apache Livy provides a powerful REST interface for Apache Spark that enables:

Key Benefits

  • Remote Access: Access Spark clusters without direct access
  • Multi-language Support: Scala, Python, R, and SQL
  • Session Management: Automatic lifecycle management
  • Security: Enterprise-grade authentication and authorization

Best Use Cases

  • Web Applications: Integrate Spark into web services
  • Notebooks: Interactive data science environments
  • Multi-tenant Environments: Shared Spark clusters
  • Microservices: Spark as a service architecture

When to Choose Livy

  • Need REST API access to Spark
  • Multiple users sharing Spark cluster
  • Integration with web applications
  • Remote Spark job submission requirements

Livy bridges the gap between Spark's powerful processing capabilities and modern application architectures, making Spark accessible through standard REST APIs.

Resources

Official Resources

Integration Resources

  • Apache Spark Guide
  • JupyterHub Setup
  • Alluxio Integration