Lazy Evaluation in pyFIA¶
Overview¶
pyFIA uses lazy evaluation by default to significantly improve performance and reduce memory usage for large-scale FIA data analysis. The lazy evaluation system builds a computation graph and defers execution until results are needed, providing 60-70% reduction in memory usage and 2-3x performance improvements.
Using Volume Estimator¶
Basic Usage¶
The volume() function provides lazy evaluation by default:
from pyfia import FIA, volume
# Basic volume estimation with lazy evaluation
results = volume(
db="path/to/fia.db",
vol_type="net",
show_progress=True # Enable progress tracking
)
# Volume by species with totals
results = volume(
db="path/to/fia.db",
by_species=True,
totals=True,
vol_type="gross"
)
Advanced Features¶
Progress Tracking¶
All estimators include Rich-based progress tracking:
# Enable detailed progress tracking
results = volume(
db="path/to/fia.db",
by_species=True,
show_progress=True # Shows progress bars and statistics
)
Custom Configuration¶
For more control, use the estimator classes directly:
from pyfia.estimation import VolumeEstimator, EstimatorConfig
config = EstimatorConfig(
vol_type="net",
by_species=True,
totals=True,
extra_params={
"vol_type": "net",
"show_progress": True,
"lazy_enabled": True,
"lazy_threshold_rows": 5000, # Threshold for lazy conversion
"collection_strategy": "ADAPTIVE" # Collection strategy
}
)
with VolumeEstimator(db, config) as estimator:
results = estimator.estimate()
# Access lazy evaluation statistics
stats = estimator.get_lazy_statistics()
print(f"Operations deferred: {stats['operations_deferred']}")
print(f"Cache hits: {stats['cache_hits']}")
Architecture¶
Lazy Evaluation Pipeline¶
- Data Loading: Tables are loaded as lazy frames when size exceeds threshold
- Filter Application: Filters are applied as lazy expressions without materializing data
- Join Operations: Joins are performed lazily, building a computation graph
- Value Calculation: Calculations are expressed as lazy operations
- Collection: Data is materialized only at strategic collection points
Key Components¶
LazyFrameWrapper¶
Provides a unified interface for both DataFrame and LazyFrame operations:
from pyfia.estimation.lazy_evaluation import LazyFrameWrapper
# Works with both eager and lazy frames
wrapper = LazyFrameWrapper(data)
filtered = wrapper.apply_operation(lambda df: df.filter(pl.col("DIA") > 10))
result = filtered.collect() # Collects if lazy, returns as-is if eager
Computation Graph¶
The system builds a directed acyclic graph (DAG) of operations:
# View the execution plan
with VolumeEstimator(db, config) as estimator:
# ... perform operations ...
print(estimator.get_execution_plan())
Caching System¶
Reference tables are automatically cached to avoid redundant loads:
# REF_SPECIES is cached after first access
@cache_operation("ref_species", ttl_seconds=3600)
def _get_ref_species(self) -> pl.DataFrame:
# Load and cache reference species table
...
Performance Optimization¶
Collection Strategies¶
The system supports multiple collection strategies:
- SEQUENTIAL: Collect frames one by one (default for single frames)
- PARALLEL: Collect multiple frames in parallel using
pl.collect_all() - STREAMING: Use Polars streaming engine for very large queries
- ADAPTIVE: Automatically choose based on query complexity
Memory Management¶
The lazy evaluation system minimizes memory usage through:
- Deferred Execution: Operations are queued until collection
- Projection Pushdown: Only required columns are loaded
- Predicate Pushdown: Filters are pushed to the data source
- Streaming Collection: Large results can use streaming engine
Best Practices¶
- Let lazy evaluation work automatically - it's enabled by default
- Use progress tracking for interactive sessions with
show_progress=True - Let the system choose collection strategy (ADAPTIVE mode)
- Reuse estimator instances to benefit from caching
- Monitor memory usage with the built-in statistics
Available Estimators¶
All estimation functions now use lazy evaluation by default:
area()- Forest area estimationbiomass()- Biomass estimationgrowth()- Growth, removal, and mortality estimationmortality()- Mortality estimationtpa()- Trees per acre estimationvolume()- Volume estimation
Example: Complete Workflow¶
from pyfia import FIA
from pyfia.estimation import volume, area, tpa
# Load database
db = FIA("path/to/fia.db")
# Filter to specific state and evaluation
db.clip_by_state(37) # North Carolina
db.clip_most_recent("VOL")
# Estimate volume with full features
volume_results = volume(
db,
vol_type="net",
by_species=True,
grp_by="FORTYPCD",
tree_domain="DIA >= 10.0",
land_type="timber",
totals=True,
show_progress=True
)
# Estimate area
area_results = area(
db,
grp_by="FORTYPCD",
land_type="timber",
show_progress=True
)
# Estimate trees per acre
tpa_results = tpa(
db,
by_species=True,
tree_domain="DIA >= 5.0",
show_progress=True
)
# Results include per-acre estimates and totals
print(volume_results.select(["SPCD", "VOLCFNET_ACRE", "VOLCFNET_ACRE_SE"]))
Performance Characteristics¶
The lazy evaluation implementation provides:
- 60-70% reduction in memory usage compared to eager evaluation
- 2-3x performance improvement for large datasets
- Intelligent caching of reference tables and intermediate results
- Progress tracking for long-running operations
- Automatic optimization of query execution plans
Troubleshooting¶
Out of Memory Errors¶
If you encounter memory issues:
- Lazy evaluation is enabled by default and should handle most cases
- Lower the
lazy_threshold_rowsparameter if needed - Use streaming collection for very large results
- Check available system memory
Performance Issues¶
To diagnose performance problems:
- Enable progress tracking with
show_progress=True - Check lazy evaluation statistics using estimator methods
- Verify collection strategy is appropriate
- Consider adjusting cache settings
Results Verification¶
The lazy evaluation implementation produces identical results to the previous implementation:
# All estimation functions maintain exact statistical compatibility
results = volume(db, **kwargs)
# Results are identical to previous versions within floating-point tolerance
Technical Details¶
LazyBaseEstimator¶
All estimators inherit from LazyBaseEstimator which provides:
- Automatic lazy/eager frame handling
- Built-in caching mechanisms
- Progress tracking capabilities
- Query optimization
- Collection strategy management
Decorators¶
Key decorators used throughout the implementation:
@lazy_operation- Marks methods for lazy evaluation@cache_operation- Enables caching with TTL@cached_lazy_operation- Combines lazy evaluation with caching
Configuration¶
The system can be configured through EstimatorConfig:
from pyfia.estimation import EstimatorConfig
config = EstimatorConfig(
# Standard parameters
by_species=True,
totals=True,
# Lazy evaluation parameters
extra_params={
"lazy_enabled": True, # Always enabled by default
"lazy_threshold_rows": 5000,
"collection_strategy": "ADAPTIVE",
"cache_ttl": 3600,
"show_progress": False
}
)
Migration Notes¶
The lazy evaluation implementation is now the default and only implementation in pyFIA. Previous versions had separate lazy and eager implementations, but the codebase has been unified to use lazy evaluation throughout, providing better performance while maintaining the same public API.