Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

For any data operations in Apache spark, one of the first tasks is creating a DataFrame. This is done using the createDataFrame() API. There are two main things to consider:

  1. Data Source – Data can come from a list of dictionaries, a list of lists or tuples, or even directly from a Pandas DataFrame.
  2. Schema – The structure of your data. This can be:
    • Auto-Inferred: Spark automatically detects column names and data types.
    • Explicit: You define the schema yourself. For simple cases, a string schema works (e.g. 'name STRING, age INT'). For more complex cases, a StructType gives you full control, including nullability.