Spark — Save Dataset In Memory Outside Heap
This article is for people who have some idea of Spark , Dataset / Dataframe. I am going to show how to persist a Dataframe off heap memory. Executors heap memory will not be used for the persist in this case. My example below is coded and executed from Scala spark-shell so might see corresponding settings. Btw, use persist or cache only when needed eg multiple actions over same DF/DS.
Enable Off Heap Storage
By default, off heap memory is disabled. You can enable this by setting below configurations
- spark.memory.offHeap.size — Off heap size in bytes
- spark.memory.offHeap.enabled — value must be true to enable off heap storage
Read more about these at — https://spark.apache.org/docs/latest/configuration.html#memory-management
You can enable these settings
- In spark-shell use command
spark-shell --conf "spark.memory.offHeap.size=1000000000" --conf "spark.memory.offHeap.enabled=true"
- While using spark-submit also use same — conf flag
Sample Data
We will use below sample data to test. The data is stored in file sparkdata.txt
Read Data And Persist To Off Heap
- Read
val data = spark.read.format("csv").option("header", "true").option("delimiter", ";").load("sparkdata.txt")
- Persist
import org.apache.spark.storage._data.persist(StorageLevel.OFF_HEAP)
- Show (or any other action)
data.show
Validate Dataframe Was Read From OffHeap For Action
Open Spark UI. Go to storage tab. Check Storage level. I see below. Btw, you can open Spark UI for spark-shell too. The UI shows both DS and RDD persisted under RDDs only.
Want to experiment more?
- Unpersist the data — data.unpersist. Validate Spark UI -> Storage Tab. It will be blank as no data is persisted now.
data.unpersist
- Now save same data using on heap storage level eg DISK_ONLY
data.persist(StorageLevel.DISK_ONLY)
- Perform an action eg show
data.show
- Check the Spark UI- Storage Tab -> Storage Level of the entry there. I see below. Check the difference between this storage level and the one we saw above for off heap