Performance Guide¶
Table of Contents¶
- Overview
- Performance Characteristics
- Benchmark Results
- Optimization Strategies
- Performance Tuning
- Best Practices
- Scalability Guide
- Troubleshooting
- FAQ
Overview¶
SocialMapper is engineered for high-performance geospatial analysis with careful attention to optimization at every layer. Our architecture leverages modern Python capabilities including concurrent processing, intelligent caching, and machine learning-based clustering to deliver production-ready performance.
Key Performance Features¶
- Unified Caching System (NEW): Automatic caching for Census API, geocoding, and network graphs with configurable TTL
- HTTP Connection Pooling (NEW): Persistent connections reduce overhead by 50-70%
- Batch Processing (NEW): Optimized batching for Census data and geocoding operations
- Memory Optimization Tools (NEW): DataFrame optimization, lazy loading, and memory profiling
- Performance Presets (NEW): Pre-configured settings (fast, balanced, memory-efficient)
- Intelligent Caching: Multi-level caching system reduces API calls by up to 95%
- Concurrent Processing: Parallel execution delivers 4-8x performance improvements
- Smart Clustering: ML-based POI clustering reduces network downloads by 60-80%
- Adaptive Algorithms: Auto-selection of optimal strategies based on workload
Performance Characteristics¶
Typical Operation Times¶
| Operation | Small (1-10 POIs) | Medium (10-100 POIs) | Large (100-1000 POIs) |
|---|---|---|---|
| Isochrone Generation | 5-15 seconds | 30-120 seconds | 5-15 minutes |
| Census Data Retrieval | 1-3 seconds | 5-10 seconds | 30-60 seconds |
| POI Search (OSM) | 2-5 seconds | 5-15 seconds | 20-40 seconds |
| Geocoding | <1 second | 2-5 seconds | 10-30 seconds |
| Export (GeoJSON) | <1 second | 1-3 seconds | 5-10 seconds |
| Export (GeoParquet) | <0.5 seconds | <2 seconds | 3-7 seconds |
Note: Times measured on standard hardware (4-core CPU, 16GB RAM) with good network connectivity
Memory Usage Patterns¶
| Dataset Size | Base Memory | Peak Memory | With Caching |
|---|---|---|---|
| 10 POIs | ~150 MB | ~300 MB | ~400 MB |
| 100 POIs | ~200 MB | ~800 MB | ~1.2 GB |
| 1000 POIs | ~300 MB | ~2.5 GB | ~3.5 GB |
Network Dependencies¶
SocialMapper's performance is affected by several external APIs:
| API Service | Typical Latency | Rate Limits | Cache Benefit |
|---|---|---|---|
| OSM Overpass | 200-500ms | None (be respectful) | 80-95% hit rate |
| Census API | 100-300ms | 500 req/hour | 90-98% hit rate |
| OpenRouteService | 150-400ms | 2500 req/day (free) | 70-85% hit rate |
| OSRM | 50-200ms | None (self-hosted) | 85-95% hit rate |
Benchmark Results¶
Core Operation Benchmarks¶
Isochrone Generation (15-minute travel time, driving mode)¶
# Single POI Performance
Location: Urban (Portland, OR)
Without cache: 12.3 seconds
With warm cache: 2.1 seconds (83% improvement)
Location: Rural (Eastern Oregon)
Without cache: 18.7 seconds
With warm cache: 3.4 seconds (82% improvement)
Batch Processing Performance¶
# 50 POIs in Portland Metro Area
Sequential processing: 425 seconds
Concurrent (4 workers): 112 seconds (3.8x faster)
Concurrent + Clustering: 67 seconds (6.3x faster)
# 100 POIs across Oregon
Sequential processing: 1,240 seconds
Concurrent (8 workers): 198 seconds (6.3x faster)
Concurrent + Clustering: 142 seconds (8.7x faster)
Comparison with Alternatives¶
| Feature | SocialMapper | Alternative A | Alternative B |
|---|---|---|---|
| 50 POI Isochrones | 67 sec | 320 sec | 450 sec |
| Memory Usage | 450 MB | 1.2 GB | 890 MB |
| Cache Hit Rate | 85-95% | 40-60% | No caching |
| Concurrent Support | Yes (8x speedup) | Limited (2x) | No |
| Clustering Optimization | ML-based | Simple radius | None |
Benchmarks performed on identical hardware with same dataset
Real-World Performance Examples¶
# Healthcare accessibility analysis for Portland
POIs: 147 hospitals and clinics
Travel time: 30 minutes, driving
Processing time: 3 min 24 sec
Cache savings: 89% fewer API calls
# Food desert mapping for Oregon
POIs: 523 grocery stores
Travel time: 15 minutes, walking
Processing time: 18 min 12 sec
Memory peak: 1.8 GB
# Transit coverage analysis
POIs: 89 transit stops
Travel time: 45 minutes, multimodal
Processing time: 5 min 48 sec
Network cache hits: 92%
Optimization Strategies¶
1. Enable and Configure Caching¶
SocialMapper now includes a unified caching system with Census API caching, geocoding caching, and network caching:
from socialmapper.performance import CacheManager, get_performance_config
# Use performance presets
config = get_performance_config(preset='fast') # or 'balanced', 'memory_efficient'
cache = CacheManager(config)
# Check cache statistics
stats = cache.get_stats()
print(f"Census cache: {stats['census']['count']} items, {stats['census']['size_mb']:.2f} MB")
print(f"Geocoding cache: {stats['geocoding']['count']} items, {stats['geocoding']['size_mb']:.2f} MB")
# Cache Census data with custom TTL
cache.set_census("geoid_key", {"B01003_001E": 2543}, ttl_hours=168)
# Use decorator for automatic caching
@cache.cache_census_data(ttl_hours=24)
def fetch_demographics(location):
# Expensive API call - results will be cached
return get_census_data(location, ["population", "median_income"])
2. Use Batch Processing¶
SocialMapper now includes optimized batch fetchers for Census data and geocoding:
from socialmapper.performance import BatchCensusDataFetcher, BatchGeocodingFetcher
# Batch fetch Census data with automatic caching
census_fetcher = BatchCensusDataFetcher()
geoids = ["060370001001", "060370001002", "060370001003"]
variables = ["B01003_001E", "B19013_001E"]
census_data = census_fetcher.fetch_batch(geoids, variables, year=2023)
# Batch geocode addresses with caching
geo_fetcher = BatchGeocodingFetcher()
addresses = ["123 Main St, Seattle, WA", "456 Oak Ave, Portland, OR"]
geocoded = geo_fetcher.geocode_batch(addresses)
# GOOD: Process multiple locations together
locations = ["Portland, OR", "Eugene, OR", "Salem, OR"]
results = api.analyze_locations_batch(
locations=locations,
travel_time=20,
use_concurrent=True, # Auto-enabled for 3+ locations
max_workers=4
)
# AVOID: Processing locations individually in a loop
results = []
for location in locations: # Inefficient!
result = api.create_isochrone(location, travel_time=20)
results.append(result)
3. Leverage Intelligent Clustering¶
# Clustering automatically groups nearby POIs
pois = api.search_pois(
location="Oregon",
query="hospital",
radius=50000 # 50km
)
# Auto-clustering for 5+ POIs
isochrones = api.create_isochrones_batch(
pois=pois,
travel_time=30,
use_clustering=None, # Auto-decides based on POI distribution
max_cluster_radius_km=15 # Tune based on density
)
4. Choose Appropriate Travel Modes¶
# Walking: Fastest processing (smaller networks)
walking_iso = api.create_isochrone(
location="Portland, OR",
travel_time=15,
travel_mode="walk" # ~2-3 seconds
)
# Driving: Moderate processing (larger networks)
driving_iso = api.create_isochrone(
location="Portland, OR",
travel_time=15,
travel_mode="drive" # ~5-8 seconds
)
# For analysis, consider walking for urban areas
if urban_density > THRESHOLD:
mode = "walk" # Faster and often more relevant
else:
mode = "drive"
5. Use HTTP Connection Pooling¶
SocialMapper now includes connection pooling to reduce overhead:
from socialmapper.performance import get_http_session, init_connection_pool
# Get session with connection pooling (automatically configured)
session = get_http_session()
# Make requests with persistent connections
response = session.get('https://api.census.gov/data/2023/acs/acs5')
data = response.json()
# Initialize with custom configuration
config = get_performance_config(
preset='fast',
http_pool_connections=20,
http_pool_maxsize=20,
http_timeout_seconds=30
)
pool = init_connection_pool(config)
6. Memory Optimization¶
Optimize memory usage for large datasets:
from socialmapper.performance import (
optimize_dataframe_memory,
memory_efficient_iterator,
MemoryMonitor,
get_memory_stats
)
# Optimize DataFrame memory usage (50-80% reduction)
import pandas as pd
df = pd.DataFrame({'geoid': ['060370001001'] * 10000, 'population': [2543.0] * 10000})
df_optimized = optimize_dataframe_memory(df)
# Process large lists in chunks
large_geoid_list = [...thousands of GEOIDs...]
for chunk in memory_efficient_iterator(large_geoid_list, chunk_size=100):
results = fetch_census_data(chunk, variables)
# Monitor memory usage
with MemoryMonitor("processing isochrones") as monitor:
isochrones = [create_isochrone(loc, 15) for loc in locations]
print(f"Memory used: {monitor.memory_delta_mb:.2f} MB")
# Get current memory statistics
stats = get_memory_stats()
print(f"Process memory: {stats['used_mb']:.1f} MB")
print(f"Available: {stats['available_mb']:.1f} MB")
7. Optimize Data Formats¶
# Use GeoParquet for better performance
results = api.export_results(
data=isochrones,
format="geoparquet", # 3-5x faster than GeoJSON
compression="snappy" # Balance of speed and size
)
# Enable Arrow for GeoPandas operations
import os
os.environ["PYOGRIO_USE_ARROW"] = "1"
Performance Tuning¶
Configuration Options¶
from socialmapper import config
# Fast mode: Maximum performance, higher memory usage
config.set_performance_mode("fast")
# - Aggressive caching
# - Maximum concurrent workers
# - Larger memory buffers
# - Suitable for: Servers, workstations
# Balanced mode: Default, good for most use cases
config.set_performance_mode("balanced")
# - Standard caching
# - Adaptive concurrency
# - Moderate memory usage
# - Suitable for: Most applications
# Memory-efficient mode: Minimum memory footprint
config.set_performance_mode("memory-efficient")
# - Minimal caching
# - Limited concurrency
# - Stream processing
# - Suitable for: Containers, limited resources
Advanced Tuning Parameters¶
from socialmapper.isochrone import create_isochrones_from_poi_list
# Fine-tune concurrent processing
results = create_isochrones_from_poi_list(
poi_data=pois,
travel_time_limit=30,
# Concurrency settings
use_concurrent=True,
max_network_workers=8, # Network downloads (I/O bound)
max_isochrone_workers=4, # Isochrone calc (CPU bound)
# Clustering settings
use_clustering=True,
max_cluster_radius_km=20, # Larger for rural areas
min_cluster_size=3, # Minimum POIs to cluster
# Memory optimization
simplify_tolerance=0.001, # Reduce geometry complexity
use_parquet=True # Efficient serialization
)
Cache Management¶
from socialmapper.cache_manager import CacheManager
manager = CacheManager()
# Monitor cache performance
stats = manager.get_cache_statistics()
if stats['network']['hit_rate'] < 0.7:
print("Low cache hit rate - consider warming cache")
# Clear specific cache types
manager.clear_cache(cache_type='network') # Just network cache
manager.clear_cache(cache_type='census') # Just census cache
manager.clear_cache(cache_type='all') # Everything
# Set cache size limits
manager.set_cache_limit(max_size_gb=5)
# Cache persistence across sessions
manager.save_cache_to_disk("cache_backup.db")
manager.load_cache_from_disk("cache_backup.db")
Best Practices¶
Optimal Batch Sizes¶
# Recommended batch sizes by operation type
# Isochrone generation
if num_pois < 10:
batch_size = num_pois # Process all at once
elif num_pois < 100:
batch_size = 20 # Balance memory and speed
else:
batch_size = 50 # Prevent memory issues
# Census data retrieval
if num_locations < 50:
batch_size = num_locations # Single batch
else:
batch_size = 100 # API rate limit friendly
# POI search
batch_size = 25 # Optimal for Overpass API
Cache Warming Strategies¶
# Pre-warm cache for known analysis area
def warm_cache_for_city(city_name: str):
"""Pre-load network data for faster analysis."""
# Get city bounds
bounds = api.get_city_bounds(city_name)
# Download networks for common travel times
for travel_time in [15, 30, 45]:
for mode in ['drive', 'walk']:
api.download_network(
bbox=bounds,
travel_time=travel_time,
travel_mode=mode
)
# Run before analysis
warm_cache_for_city("Portland, OR")
Error Handling for Production¶
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def robust_isochrone_generation(location, **kwargs):
"""Production-ready isochrone generation with retries."""
try:
return api.create_isochrone(location, **kwargs)
except RateLimitError:
time.sleep(60) # Wait for rate limit reset
raise
except NetworkError as e:
logger.warning(f"Network error: {e}, retrying...")
raise
Monitoring and Profiling¶
import cProfile
import pstats
from memory_profiler import profile
# CPU profiling
def profile_performance():
profiler = cProfile.Profile()
profiler.enable()
# Your analysis code
results = api.analyze_accessibility(...)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 time consumers
# Memory profiling
@profile
def memory_intensive_operation():
large_dataset = api.process_state_data("California")
return large_dataset
# Resource monitoring
from socialmapper.monitoring import ResourceMonitor
monitor = ResourceMonitor()
monitor.start()
# Run analysis
results = api.complex_analysis()
stats = monitor.stop()
print(f"Peak memory: {stats['peak_memory_mb']:.1f} MB")
print(f"CPU time: {stats['cpu_seconds']:.1f} seconds")
Scalability Guide¶
Small Scale (1-10 locations)¶
Typical use case: Individual location analysis, small city study
# Optimal settings for small scale
config = {
'use_concurrent': False, # Overhead not worth it
'use_clustering': False, # Too few points
'cache_enabled': True,
'simplify_tolerance': None # Keep full detail
}
# Expected performance
# Time: < 1 minute
# Memory: < 500 MB
Medium Scale (10-100 locations)¶
Typical use case: City-wide analysis, regional studies
# Optimal settings for medium scale
config = {
'use_concurrent': True,
'max_workers': 4,
'use_clustering': True,
'max_cluster_radius_km': 10,
'cache_enabled': True,
'simplify_tolerance': 0.0001 # Slight simplification
}
# Expected performance
# Time: 2-10 minutes
# Memory: 500 MB - 2 GB
Large Scale (100-1000 locations)¶
Typical use case: State-wide analysis, multi-city comparisons
# Optimal settings for large scale
config = {
'use_concurrent': True,
'max_network_workers': 8,
'max_isochrone_workers': 4,
'use_clustering': True,
'max_cluster_radius_km': 15,
'min_cluster_size': 5,
'cache_enabled': True,
'simplify_tolerance': 0.001, # Aggressive simplification
'use_parquet': True,
'batch_size': 50
}
# Consider chunking
def process_large_dataset(locations, chunk_size=100):
results = []
for i in range(0, len(locations), chunk_size):
chunk = locations[i:i+chunk_size]
chunk_results = api.process_batch(chunk, **config)
results.extend(chunk_results)
# Save intermediate results
if i % 500 == 0:
save_checkpoint(results)
return results
# Expected performance
# Time: 15-60 minutes
# Memory: 2-5 GB
Enterprise Scale (1000+ locations)¶
Typical use case: National analysis, massive datasets
# Distributed processing with Dask
import dask.dataframe as dd
from dask.distributed import Client
def enterprise_scale_processing():
# Setup Dask client
client = Client(n_workers=4, threads_per_worker=2)
# Partition dataset
df = dd.from_pandas(locations_df, npartitions=16)
# Process in parallel
results = df.map_partitions(
lambda partition: process_partition(partition),
meta=('result', 'object')
)
# Compute with progress bar
with ProgressBar():
final_results = results.compute()
return final_results
# Alternative: Use cloud services
from socialmapper.cloud import CloudProcessor
processor = CloudProcessor(
provider='aws',
instance_type='c5.4xlarge',
max_instances=10
)
results = processor.process_distributed(
locations=massive_dataset,
parallel_jobs=50
)
Troubleshooting¶
Slow Operations¶
Symptom: Isochrone generation taking >30 seconds per location¶
Common causes and solutions:
-
Cold cache
-
Large travel times
-
Poor network connectivity
-
Inefficient travel mode
High Memory Usage¶
Symptom: Memory usage exceeding 2GB for <100 locations¶
Solutions:
-
Enable geometry simplification
-
Process in smaller batches
-
Clear intermediate results
-
Use memory-efficient formats
Rate Limit Errors¶
Symptom: "429 Too Many Requests" errors¶
Solutions:
-
Implement rate limiting
-
Use caching aggressively
-
Batch requests efficiently
Network Timeouts¶
Symptom: "Network timeout" or "Connection refused" errors¶
Solutions:
-
Increase timeout values
-
Implement retry logic
-
Use fallback services
FAQ¶
Q: Why is my first analysis always slow?¶
A: SocialMapper uses extensive caching. The first run downloads and caches network data, making subsequent runs 5-10x faster. This is normal and expected behavior.
Q: How can I make SocialMapper 10x faster?¶
A: Combine these optimizations: 1. Enable concurrent processing (3-4x speedup) 2. Use intelligent clustering (2-3x speedup) 3. Warm the cache (2-5x speedup) 4. Use appropriate batch sizes 5. Choose optimal travel modes
Q: What's the maximum number of POIs I can process?¶
A: Theoretically unlimited, but practically: - Single machine: 1,000-5,000 POIs comfortably - With chunking: 10,000+ POIs - Distributed: 100,000+ POIs
Q: How much disk space does the cache require?¶
A: Cache size depends on coverage area: - City: 100-500 MB - State: 1-5 GB - Country: 10-50 GB
You can limit cache size with manager.set_cache_limit().
Q: Which travel mode is fastest to process?¶
A: Processing speed by mode: 1. Walking (fastest): Smallest networks, quick processing 2. Biking: Moderate network size 3. Driving (slowest): Largest networks, especially rural areas
Q: Can I use SocialMapper in production?¶
A: Yes! SocialMapper is production-ready with: - Comprehensive error handling - Retry mechanisms - Resource monitoring - Cache persistence - Concurrent processing - Memory management
Q: How does SocialMapper compare to commercial alternatives?¶
A: SocialMapper offers: - Cost: Free and open source vs. $$$$ per month - Performance: Comparable or better with proper configuration - Flexibility: Full control and customization - Data ownership: Your data stays with you - Limitations: Depends on free tier API limits
Q: What hardware do I need for good performance?¶
A: Recommended specifications: - Minimum: 2 cores, 4 GB RAM, 10 GB disk - Recommended: 4 cores, 16 GB RAM, 50 GB SSD - Optimal: 8+ cores, 32 GB RAM, 100 GB NVMe SSD
Q: How can I contribute performance improvements?¶
A: We welcome contributions! See our Benchmarks README for: - How to run benchmarks - How to add new benchmarks - Performance testing guidelines - Optimization opportunities
Last updated: November 2024 Performance metrics based on SocialMapper v0.9.0