Model Training¶
Learn how to train and evaluate machine learning models using MLFCrafter's ModelCrafter.
ModelCrafter Overview¶
The ModelCrafter
handles model selection, training, and evaluation. It supports multiple algorithms and provides train/test splitting with evaluation.
Supported Algorithms¶
MLFCrafter supports three main algorithms:
from mlfcrafter import ModelCrafter
# Random Forest (default)
model = ModelCrafter(model_name="random_forest")
# Logistic Regression
model = ModelCrafter(model_name="logistic_regression")
# XGBoost
model = ModelCrafter(model_name="xgboost")
Basic Training¶
Simple Model Training¶
from mlfcrafter import MLFChain, DataIngestCrafter, CleanerCrafter, ModelCrafter
# Train a Random Forest model
pipeline = MLFChain(
DataIngestCrafter("data.csv"),
CleanerCrafter(strategy="auto"),
ModelCrafter(model_name="random_forest")
)
result = pipeline.run()
print(f"Training accuracy: {result['train_score']:.4f}")
print(f"Test accuracy: {result['test_score']:.4f}")
Model Parameters¶
Random Forest Parameters¶
# Customize Random Forest
model = ModelCrafter(
model_name="random_forest",
model_params={
"n_estimators": 100, # Number of trees
"max_depth": 10, # Maximum depth
"min_samples_split": 5, # Min samples to split
"random_state": 42 # For reproducibility
}
)
XGBoost Parameters¶
# Customize XGBoost
model = ModelCrafter(
model_name="xgboost",
model_params={
"n_estimators": 200, # Number of boosting rounds
"learning_rate": 0.1, # Learning rate
"max_depth": 6, # Tree depth
"random_state": 42
}
)
Logistic Regression Parameters¶
# Customize Logistic Regression
model = ModelCrafter(
model_name="logistic_regression",
model_params={
"C": 1.0, # Regularization strength
"max_iter": 1000, # Maximum iterations
"random_state": 42
}
)
Training Configuration¶
Data Splitting¶
model = ModelCrafter(
model_name="random_forest",
test_size=0.2, # 20% for testing
random_state=61, # Reproducible splits
stratify=True # Maintain class balance
)
Complete Training Pipeline¶
# Full training pipeline
pipeline = MLFChain(
# Data processing
DataIngestCrafter("customer_data.csv"),
CleanerCrafter(strategy="median"),
ScalerCrafter(scaler_type="standard"),
# Model training
ModelCrafter(
model_name="xgboost",
model_params={
"n_estimators": 150,
"learning_rate": 0.05,
"max_depth": 6
},
test_size=0.25,
stratify=True
),
# Model evaluation
ScorerCrafter(
metrics=["accuracy", "precision", "recall", "f1"]
)
)
result = pipeline.run()
# Print results
print(f"Model: {result['model_name']}")
print(f"Features: {len(result['features'])}")
print(f"Test accuracy: {result['test_score']:.4f}")
print("\nDetailed metrics:")
for metric, score in result['scores'].items():
print(f" {metric}: {score:.4f}")
Model Comparison¶
Compare different algorithms:
models = {
"Random Forest": ModelCrafter(
model_name="random_forest",
model_params={"n_estimators": 100}
),
"XGBoost": ModelCrafter(
model_name="xgboost",
model_params={"learning_rate": 0.1}
),
"Logistic Regression": ModelCrafter(
model_name="logistic_regression",
model_params={"C": 1.0}
)
}
results = {}
for name, model in models.items():
pipeline = MLFChain(
DataIngestCrafter("data.csv"),
CleanerCrafter(strategy="auto"),
ScalerCrafter(scaler_type="standard"),
model,
ScorerCrafter()
)
result = pipeline.run()
results[name] = result['scores']
# Compare results
print("Model Comparison:")
for name, scores in results.items():
print(f"{name}: Accuracy={scores['accuracy']:.4f}, F1={scores['f1']:.4f}")
Model Evaluation¶
Using ScorerCrafter¶
# Evaluate with multiple metrics
pipeline = MLFChain(
# ... other crafters ...
ScorerCrafter(
metrics=["accuracy", "precision", "recall", "f1"]
)
)
result = pipeline.run()
# Access detailed scores
scores = result['scores']
print(f"Accuracy: {scores['accuracy']:.4f}")
print(f"Precision: {scores['precision']:.4f}")
print(f"Recall: {scores['recall']:.4f}")
print(f"F1-Score: {scores['f1']:.4f}")
Custom Evaluation¶
# Just specific metrics
scorer = ScorerCrafter(metrics=["accuracy", "f1"])
# Or just accuracy
scorer = ScorerCrafter(metrics=["accuracy"])
Algorithm Selection Guide¶
Choose Random Forest when:¶
- Need good baseline performance
- Want feature importance
- Working with mixed data types
- Don't want to tune many parameters
Choose XGBoost when:¶
- Want maximum performance
- Have time for parameter tuning
- Working with structured data
- Need to handle missing values automatically
Choose Logistic Regression when:¶
- Need interpretable results
- Want fast training/prediction
- Working with linear relationships
- Building baseline models
Complete Example¶
from mlfcrafter import MLFChain, DataIngestCrafter, CleanerCrafter, ScalerCrafter, ModelCrafter, ScorerCrafter, DeployCrafter
# E-commerce customer classification
pipeline = MLFChain(
# Load and process data
DataIngestCrafter("customers.csv"),
CleanerCrafter(
strategy="auto",
str_fill="Unknown",
int_fill=0
),
ScalerCrafter(scaler_type="robust"),
# Train XGBoost model
ModelCrafter(
model_name="xgboost",
model_params={
"n_estimators": 200,
"learning_rate": 0.05,
"max_depth": 6,
"random_state": 42
},
test_size=0.2,
stratify=True
),
# Evaluate performance
ScorerCrafter(),
# Deploy model
DeployCrafter(
model_path="customer_model.joblib",
save_format="joblib"
)
)
# Run complete pipeline
result = pipeline.run()
# Results
print(f"✅ Model trained successfully!")
print(f"📊 Test F1-Score: {result['scores']['f1']:.4f}")
print(f"💾 Model saved to: {result['deployment_path']}")
Best Practices¶
Start Simple¶
Begin with Random Forest using default parameters, then optimize.
Use Proper Scaling¶
Always use ScalerCrafter before ModelCrafter, especially for logistic regression.
Validate Performance¶
Use multiple metrics (accuracy, precision, recall, F1) to evaluate models.
Set Random Seeds¶
Use consistent random_state
values for reproducible results.
Monitor Overfitting¶
Compare train_score
vs test_score
- large gaps indicate overfitting.