Spark — Save Dataset In Memory Outside Heap

2 min readSep 12, 2021

This article is for people who have some idea of Spark , Dataset / Dataframe. I am going to show how to persist a Dataframe off heap memory. Executors heap memory will not be used for the persist in this case. My example below is coded and executed from Scala spark-shell so might see corresponding settings. Btw, use persist or cache only when needed eg multiple actions over same DF/DS.

Enable Off Heap Storage

By default, off heap memory is disabled. You can enable this by setting below configurations

spark.memory.offHeap.size — Off heap size in bytes
spark.memory.offHeap.enabled — value must be true to enable off heap storage

You can enable these settings

In spark-shell use command

spark-shell --conf "spark.memory.offHeap.size=1000000000"  --conf "spark.memory.offHeap.enabled=true"

While using spark-submit also use same — conf flag

Sample Data

We will use below sample data to test. The data is stored in file sparkdata.txt

Read Data And Persist To Off Heap

Read

val data = spark.read.format("csv").option("header", "true").option("delimiter", ";").load("sparkdata.txt")

Persist

import org.apache.spark.storage._data.persist(StorageLevel.OFF_HEAP)

Show (or any other action)

data.show

Validate Dataframe Was Read From OffHeap For Action

Open Spark UI. Go to storage tab. Check Storage level. I see below. Btw, you can open Spark UI for spark-shell too. The UI shows both DS and RDD persisted under RDDs only.

Want to experiment more?

Unpersist the data — data.unpersist. Validate Spark UI -> Storage Tab. It will be blank as no data is persisted now.

data.unpersist

Now save same data using on heap storage level eg DISK_ONLY

data.persist(StorageLevel.DISK_ONLY)

Perform an action eg show

data.show

Check the Spark UI- Storage Tab -> Storage Level of the entry there. I see below. Check the difference between this storage level and the one we saw above for off heap

Spark — Save Dataset In Memory Outside Heap

Enable Off Heap Storage

Sample Data

Read Data And Persist To Off Heap

Validate Dataframe Was Read From OffHeap For Action

Want to experiment more?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sandeep Khurana

No responses yet