For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame() API. There are two main things to consider:
- Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even directly from a Pandas DataFrame.
- Schema – The structure of your data. This can be:
- Auto-Inferred: Spark automatically detects column names and data types.
- Explicit: You define the schema yourself. For simple cases, a string schema works (e.g.
'name STRING, age INT'). For more complex cases, aStructTypegives you full control, including nullability.