For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame()
API. There are two main things to consider:
- Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even directly from a Pandas DataFrame.
- Schema – The structure of your data. This can be:
- Auto-Inferred: Spark automatically detects column names and data types.
- Explicit: You define the schema yourself. For simple cases, a string schema works (e.g.
'name STRING, age INT'
). For more complex cases, aStructType
gives you full control, including nullability.