MongoDB Architecture

Document store design for InfraTraq — 35 collections across 10 categories covering audit logs, API integration, activity tracking, messaging, notifications, caches, user preferences, AI/ML, system observability, and file attachments.

35 Collections

10 Categories

TTL Policies

Indexed Queries

MongoDB 7.x

Overview

InfraTraq uses a dual-database architecture: PostgreSQL for transactional/relational data (398 tables) and MongoDB for event logs, caches, flexible documents, and high-write workloads (35 collections). The decision framework is simple: if the data requires foreign key constraints, ACID transactions, or complex JOINs, it goes to PostgreSQL. If the data is append-only, has variable schema, needs TTL expiry, or is write-heavy with eventual consistency tolerance, it goes to MongoDB.

Application Event

→

Decision Router

→

PostgreSQL

→

MongoDB

Collections

Status States

Status	Description	Allowed Actions	Next States
Active	Collection is in production use	Read/Write enabled	Archived
TTL Expiring	Documents approaching TTL boundary	Monitor, extend if needed	Active, Archived
Archived	Collection data moved to cold storage	Read-only queries	Deleted
Maintenance	Collection under reindex or migration	Admin operations only	Active

RDBMS vs MongoDB Decision Framework

Use PostgreSQL When

Foreign key constraints required • ACID transactions needed • Complex JOINs across entities • Strict schema enforcement • Financial/accounting data • Master data with referential integrity

Use MongoDB When

Schema varies per record • Append-only write pattern • TTL auto-expiry needed • Nested/array fields per document • High-write throughput (logs, events) • Flexible JSON payloads

Bridge Pattern

Store the entity_type + entity_id reference in MongoDB • Look up relational details in PostgreSQL • Use CDC (Change Data Capture) for sync • Materialized views for cross-DB reporting

Anti-Patterns to Avoid

Never store financial transactions in MongoDB • Never build JOINs across MongoDB collections • Never use MongoDB for approval chains needing FK integrity • Never skip TTL for high-volume logs

Collection Categories (35 Collections)

Category 1: Audit Logs (2 collections)

entity_changelog — Full before/after snapshots on every RDBMS write — TTL: 2-7 years
field_audit_log — Field-level change tracking for sensitive fields — TTL: 2-7 years

Category 2: API & Integration Logs (3 collections)

api_request_log — Inbound/outbound HTTP request + response payloads — TTL: 30-90 days
webhook_event_log — Webhook delivery attempts and responses — TTL: 30-90 days
integration_sync_log — Sync job execution details per external system — TTL: 90 days

Category 3: Activity & Behavioral Tracking (3 collections)

user_activity_event — Clickstream: page views, button clicks, time-on-page — TTL: 30-365 days
search_query_log — Search terms, filters, results count, click-through — TTL: 90 days
feature_usage_metric — Feature adoption tracking per user/module — TTL: 365 days

Category 4: Messaging & Communication (3 collections)

chat_message — Real-time chat messages with nested attachments — Permanent
comment_thread — Polymorphic comments on any entity (entity_type + entity_id) — Permanent
email_communication — Email send/receive history with parsed body/attachments — Permanent

Category 5: Notification Delivery Logs (3 collections)

notification_delivery — Per-recipient delivery status: sent/delivered/read/failed — TTL: 90 days
push_notification_log — Mobile push delivery with device/token details — TTL: 90 days
sms_delivery_log — SMS gateway response tracking — TTL: 90 days

Category 6: Caches & Computed Snapshots (4 collections)

report_cache — Rendered report output cached by parameter hash — TTL: 1hr-30 days
dashboard_snapshot — Pre-aggregated KPI data for dashboard tiles — TTL: 24hr
analytics_aggregate — Pre-computed rollups (daily/weekly/monthly) — TTL: 30 days
search_index_cache — Full-text search index for instant search — TTL: 7 days

Category 7: User Preferences & UI State (4 collections)

user_preference — Per-user settings (locale, timezone, default project) — Permanent
dashboard_layout — Widget positions, sizes, data sources per user — Permanent
user_favorite — Pinned reports, bookmarked entities, recent items — Permanent
user_alert_config — Custom alert thresholds and notification preferences — Permanent

Category 8: AI/ML Artifacts (5 collections)

ml_training_dataset — Feature arrays for model training — TTL: 365 days
ml_model_registry — Model versions, hyperparameters, metrics — TTL: 365 days
ml_prediction_log — Input features → output predictions with confidence — TTL: 90 days
nlp_query_history — Parsed intents, extracted entities from NL queries — TTL: 90 days
ai_conversation — Multi-turn AI assistant conversation context — TTL: 90 days

Category 9: System Observability (4 collections)

performance_metric — Response times, throughput, error rates per endpoint — TTL: 7-90 days
background_job_log — Job execution details: params, duration, output, errors — TTL: 30 days
system_health_event — CPU/memory/disk alerts, service health checks — TTL: 30 days
error_log — Structured error logs with stack traces and context — TTL: 90 days

Category 10: File Attachments (1 collection)

attachment_metadata — Files attached to any entity_type + entity_id (polymorphic) — Permanent

MongoDB Operations Lifecycle

Collection Design

Design each collection with its document schema, key fields, and expected growth pattern. Use the decision framework: FK needed → RDBMS; variable schema → MongoDB; append-only → MongoDB; TTL needed → MongoDB.

Index Strategy

Create compound indexes on frequently queried field combinations. Use TTL indexes for automatic document expiry. Partial indexes for sparse fields. Text indexes for search_index_cache.

-- TTL index example
db.api_request_log.createIndex(
  { "created_at": 1 },
  { expireAfterSeconds: 7776000 }  -- 90 days
)

-- Compound index for entity lookup
db.entity_changelog.createIndex(
  { "entity_type": 1, "entity_id": 1, "changed_at": -1 }
)

Write Path

Application events flow through the Decision Router. RDBMS writes trigger CDC events that create entity_changelog documents. API calls log to api_request_log. User actions log to user_activity_event. All writes are fire-and-forget for performance.

Read Path

Dashboard queries hit dashboard_snapshot and analytics_aggregate for pre-computed data. Audit queries scan entity_changelog with compound index. Search uses search_index_cache with text indexes. Report generation checks report_cache before computing.

TTL Management

Configure TTL per collection based on data retention policy. Audit logs: 2-7 years (regulatory). API logs: 30-90 days (debugging). Caches: 1hr-30 days (freshness). System metrics: 7-90 days (capacity planning). Archive to S3/GCS before TTL expiry for cold storage.

Monitoring & Alerting

Monitor collection sizes, index usage, slow queries, and replication lag. Alert on: collection size > threshold, TTL backlog, write concern failures, index miss rate > 5%, oplog window < 24hr.

Implementation Examples

Entity Changelog Write (Python)

# After any RDBMS write, log the change to MongoDB
async def log_entity_change(entity_type, entity_id, operation, old_val, new_val, user_id):
    doc = {
        "entity_type": entity_type,
        "entity_id": str(entity_id),
        "operation": operation,  # INSERT, UPDATE, DELETE
        "old_value": old_val,
        "new_value": new_val,
        "changed_by": str(user_id),
        "changed_at": datetime.utcnow(),
        "ip_address": request.client.host,
        "session_id": request.state.session_id
    }
    await db.entity_changelog.insert_one(doc)

Report Cache Pattern

# Check cache before generating expensive report
async def get_report(report_type, params):
    cache_key = hashlib.sha256(
        f"{report_type}:{json.dumps(params, sort_keys=True)}".encode()
    ).hexdigest()
    
    cached = await db.report_cache.find_one({"cache_key": cache_key})
    if cached and cached["expires_at"] > datetime.utcnow():
        return cached["output_data"]
    
    # Generate report
    output = await generate_report(report_type, params)
    
    # Cache with TTL
    await db.report_cache.update_one(
        {"cache_key": cache_key},
        {"$set": {"output_data": output, "expires_at": datetime.utcnow() + timedelta(hours=1)}},
        upsert=True
    )
    return output

Validation Rules

Business Rules

Document Size: Max 16MB per document — split large payloads across multiple docs or use GridFS
Index Limit: Max 64 indexes per collection — prioritize compound indexes over single-field
TTL Precision: TTL index runs every 60 seconds — documents may live up to 60s past expiry
Write Concern: Use w:majority for audit logs and entity_changelog to prevent data loss
Read Preference: Use secondaryPreferred for analytics_aggregate and dashboard_snapshot reads
Sharding Key: Shard high-volume collections (entity_changelog, api_request_log) by tenant_id + created_at

Integration Points

Connected Modules

PostgreSQL CDC: Debezium or WAL2JSON captures RDBMS changes → writes to entity_changelog
Redis: Hot cache layer in front of dashboard_snapshot for sub-10ms reads
Elasticsearch: Sync search_index_cache to ES for advanced full-text search
S3/GCS Cold Storage: Archive expired documents to object storage before TTL deletion
Grafana/Prometheus: Export performance_metric and system_health_event for dashboards
Notification Engine: notification_rule (RDBMS) triggers → notification_delivery (MongoDB) tracking

Best Practices

Schema Design

Embed related data within documents to avoid cross-collection lookups
Use entity_type + entity_id pattern for polymorphic references
Keep documents under 1KB for high-throughput collections (metrics, events)
Use arrays for bounded lists (attachments, reactions) but not for unbounded growth

Performance

Always create indexes before deploying collection to production
Use projection to return only needed fields on large documents
Avoid $regex queries without anchored prefix — use text indexes instead
Batch inserts using insertMany() for high-volume write paths
Monitor slow query log (>100ms) and optimize with explain()

Operations

Enable authentication and TLS in production
Use replica sets (minimum 3 nodes) for high availability
Schedule compaction during off-peak hours for TTL-heavy collections
Backup strategy: daily mongodump + continuous oplog tailing
Monitor oplog window — should be >= 24 hours for disaster recovery

Back to ER Diagram Master Data Governance Notification System Full Schema