Skip to content

CleanerCrafter

The CleanerCrafter handles missing values in datasets using multiple strategies.

Overview

from mlfcrafter import CleanerCrafter

crafter = CleanerCrafter(
    strategy="auto",
    str_fill="missing",
    int_fill=0.0
)

Parameters

strategy (str)

Default: "auto"

Available Strategies: - "auto": Automatically choose strategy based on data type (numerical: uses int_fill, categorical: uses str_fill) - "mean": Fill numerical columns with column mean (categorical columns unchanged) - "median": Fill numerical columns with column median (categorical columns unchanged) - "mode": Fill all columns with most frequent value - "drop": Drop rows containing any missing values - "constant": Fill with constant values (str_fill for strings, int_fill for numbers)

str_fill (str)

Default: "missing"

Fill value for categorical/string columns.

int_fill (float)

Default: 0.0

Fill value for numerical columns.

Context Input

  • data: Dataset to clean (required)

Context Output

  • data: Cleaned dataset
  • cleaned_shape: Shape after cleaning
  • missing_values_handled: Flag indicating cleaning was performed

Example Usage

from mlfcrafter import MLFChain
from mlfcrafter.crafters import *

# Automatic cleaning
pipeline = MLFChain()
pipeline.add_crafter(DataIngestCrafter("data.csv"))
pipeline.add_crafter(CleanerCrafter(strategy="auto"))
result = pipeline.run()

# Mean imputation for numerical columns
pipeline = MLFChain()
pipeline.add_crafter(DataIngestCrafter("data.csv"))
pipeline.add_crafter(CleanerCrafter(strategy="mean"))
result = pipeline.run()

# Custom fill values
pipeline = MLFChain()
pipeline.add_crafter(DataIngestCrafter("data.csv"))
pipeline.add_crafter(CleanerCrafter(
    strategy="constant",
    str_fill="Unknown",
    int_fill=-1
))
result = pipeline.run()