The .persist()
method in Apache Spark is used to store intermediate data so that Spark doesn’t have to recompute it every time. This can make your jobs run much faster when the same data is used in multiple actions.
- Without
.persist()
- Every action (e.g.,
count()
,collect()
) recomputes the entire DataFrame or RDD.
- Every action (e.g.,
- With
.persist()
- Data is saved in memory (or disk, depending on storage level).
- Subsequent actions reuse this stored data instead of recomputing.
- After
.unpersist()
- The data is removed from memory/disk, freeing resources.
Storage Levels
Spark provides different storage levels to balance memory use and speed:
- MEMORY_ONLY → Fastest, but data is lost if it doesn’t fit in memory.
- MEMORY_AND_DISK → Stores in memory; spills to disk if too large.
- DISK_ONLY → Slower, but uses less memory.
- Serialized options → Save space but require CPU to decompress.
Why Use .persist()
?
- Reduce latency → Avoid repeating heavy computations.
- Improve throughput → Reuse datasets for multiple actions.
- Stability → Prevent failures from repeatedly recalculating big datasets.
- Control resources → Helps manage memory vs computation trade-offs.