The Internals of
Structured Query Execution

Apache Spark 2.4.1 / Spark SQL

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

Structured Queries

  1. Structured query is a query over data that is described by a schema
    • In other words, data has a structure
  2. Schema is a tuple of three elements:
    1. Name
    2. Type
    3. Nullability
  3. Remember SQL? It's a Structured Query Language

Examples of Structured Queries

    // Dataset API for Scala
      .join(spark.table("t2"), "id")
      .where($"id" % 2 === 0)

    // SQL
    sql("""SELECT t1.* FROM t1
           INNER JOIN t2 ON =
           WHERE % 2 = 0""")

Why is Structure important?

  1. Storage
  2. The less space to store large data sets the better
  3. Tungsten project (More in the following slides)

Query Languages in Spark SQL

  1. High-level declarative query languages:
    • Good ol' SQL
    • Untyped row-based DataFrame
    • Typed Dataset
  2. Another "axis" are programming languages
    • Scala, Java, Python, R


  1. Structured Query Language
  2. SQL 92
  3. Hive QL
  4. ANTLR grammar (from Presto)

DataFrames (and Schema)

  1. DataFrame — a collection of rows with a schema
  2. Row and RowEncoder
  3. DataFrame = Dataset[Row]
    • Type alias in Scala
  4. Switch to The Internals of Spark SQL

Datasets (and Encoders)

  1. Dataset - strongly-typed API for working with structured data in Scala
  2. Encoder - Serialization and deserialization API
    • Converts a JVM object of type T to and from an InternalRow
    • ExpressionEncoder is the one and only implementation
  3. Switch to The Internals of Spark SQL

Project Catalyst

  1. Catalyst — Tree Manipulation Framework
  2. TreeNode with child nodes
    • Children are TreeNodes again
    • Recursive data structure
  3. Rules to manipulate TreeNodes
    • Rules executed sequentially
    • Loops supported


  1. TreeNode is a node with child nodes (children)
    • Recursive data structure
    • Builds a tree of TreeNodes

What a nice class definition in Scala, isn't it?


Query Plans

  1. QueryPlan — the base node for relational operators
  2. LogicalPlan — the base logical query plan
  3. SparkPlan — the base physical query plan
  4. Switch to The Internals of Spark SQL

Example of TreeNode (SparkPlan)

Example of TreeNode (explain output)

Catalyst Expressions

  1. Expression — an executable node (in a Catalyst tree)
  2. Evaluates to a JVM object given InternalRow

    def eval(input: InternalRow = null): Any


Example: RemoveAllHints Logical Rule


Query Execution

  1. QueryExecution - the heart of any structured query
    • Structured Query Execution Pipeline
    • Execution Phases
  2. Use Dataset.explain to know the plans
  3. Use Dataset.queryExecution to access the phases
  4. QueryExecution.toRdd to generate a RDD (more later)

QueryExecution Pipeline

Spark Analyzer

  1. Spark Analyzer - Validates logical query plans and expressions
  2. RuleExecutor of LogicalPlans
  3. Once a structured query has passed Spark Analyzer the query will be executed
  4. Allows for registering new rules

Catalyst (Base) Optimizer

  1. Catalyst Query Optimizer
    • Base of query optimizers
    • Optimizes logical plans and expressions
    • Predefined batches of rules
  2. Cost-Based Optimization
  3. Allows for extendedOperatorOptimizationRules
  4. Switch to The Internals of Spark SQL

Spark Logical Optimizer

  1. Spark Logical Optimizer
    • Custom Catalyst query optimizer
    • Adds new optimization rules
  2. Defines extension points
    • preOptimizationBatches
    • postHocOptimizationBatches
    • ExperimentalMethods.extraOptimizations

Spark Planner

  1. Spark Planner
    • Plans optimized logical plans to physical plans
  2. At least one physical plan for any given logical plan
    • Exactly one in Spark 2.3
  3. Defines extension point
    • ExperimentalMethods.extraStrategies

Spark Physical Optimizer (Preparations)

  1. Physical query optimizations (aka preparations rules)
    • Whole-Stage Code Generation
    • Reuse Exchanges and Subqueries
    • EnsureRequirements (with Bucketing)
  2. Note the type Rule[SparkPlan]
    • Rules take a SparkPlan and produce a SparkPlan


  • Hooks and Extension Points
  • Customize a SparkSession with user-defined extensions
    • Custom query execution rules
    • Custom relational entity parser
  • injectCheckRule
  • injectOptimizerRule
  • injectParser
  • injectPlannerStrategy
  • injectPostHocResolutionRule
  • injectResolutionRule
  • Use Builder.withExtensions or spark.sql.extensions

SparkSessionExtensions Example

        import org.apache.spark.sql.SparkSession
        val spark = SparkSession
          .withExtensions { extensions =>
            extensions.injectResolutionRule { session =>
            extensions.injectOptimizerRule { session =>

Spark Core's RDD

  1. Anything executable on Spark has to be a RDD
    • ...or simply a job (of stages)
  2. RDD describes a distributed computation
  3. RDDs live in SparkContext on the driver
  4. RDD is composed of partitions and compute method
    • Partitions are parts of your data
    • compute method is the code you wrote
  5. Partitions become tasks at runtime
  6. Task is your code executed on a part of data
  7. Tasks are executed on executors

(No) RDD at runtime

RDD Lineage

  1. RDD Lineage shows RDD with dependencies
  2. RDD lineage is a graph of computation steps
  3. RDD.toDebugString

QueryExecution Pipeline...Again

Structured Queries and RDDs

  1. QueryExecution.toRdd - the very last phase in a query execution
  2. Spark SQL generates an RDD to execute a structured query
  3. Spark SQL uses higher-level structured queries to express RDD-based distributed computations
    • RDD API is like assembler (or JVM bytecode)
    • Dataset API is like Scala or Python

Debugging Query Execution

  1. debug Scala package object with debug and debugCodegen
      import org.apache.spark.sql.execution.debug._
      val q = spark.range(10).where('id === 4)
      val q = sql("SELECT * FROM RANGE(10) WHERE id = 4")
  2. Switch to The Internals of Spark SQL

Whole-Stage Java Code Generation

  1. Whole-Stage CodeGen physical query optimization
  2. Collapses a query plan tree into a single optimized function
  3. Applied to structured queries through CollapseCodegenStages physical optimization
    • spark.sql.codegen.wholeStage internal property
  4. Switch to The Internals of Spark SQL

Whole-Stage Code Gen in Web UI

CollapseCodegenStages Optimization Rule

Tungsten Execution Backend

  1. Tungsten Execution Backend (aka Project Tungsten)
    • Optimizing Spark jobs for CPU and memory efficiency
    • It is assumed that network and disk I/O are not performance bottlenecks
  2. InternalRow data abstraction
  3. UnsafeRow
    • Backed by raw memory
    • Uses sun.misc.Unsafe
  4. Switch to The Internals of Spark SQL


  1. Structured Queries
  2. Query Languages
  3. Project Catalyst
  4. Query Execution
  5. SparkSessionExtensions
  6. Spark Core's RDD
  7. Debugging Query Execution
  8. Whole-Stage Java Code Generation
  9. Tungsten Execution Backend