Skip to content

Database Infrastructure

This document outlines the database infrastructure for the AI Agent Orchestration Platform.

Overview

The platform uses PostgreSQL as its primary database for storing application data, with additional specialized databases for specific use cases. This document covers database setup, configuration, management, backup strategies, and scaling approaches.

Database Architecture

Primary Database: PostgreSQL

PostgreSQL is the main relational database used for:

  • User accounts and authentication
  • Workflow definitions and metadata
  • Agent registry and configuration
  • Execution history and results
  • Platform configuration

Database Architecture Diagram

Note: This is a placeholder for a database architecture diagram. The actual diagram should be created and added to the project.

Additional Databases

The platform also uses these specialized databases:

  • Vector Database (e.g., Milvus, Pinecone): For storing and querying embeddings
  • Time Series Database (e.g., TimescaleDB): For monitoring metrics
  • Graph Database (e.g., Neo4j): For relationship mapping (optional)
  • Document Database (e.g., MongoDB): For unstructured data (optional)

PostgreSQL Configuration

Basic Configuration

The PostgreSQL database is configured for optimal performance:

# postgresql.conf

# CONNECTIONS
max_connections = 100
superuser_reserved_connections = 3

# MEMORY
shared_buffers = 1GB
work_mem = 32MB
maintenance_work_mem = 256MB
effective_cache_size = 3GB

# WRITE AHEAD LOG
wal_level = replica
max_wal_size = 1GB
min_wal_size = 80MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB

# QUERY OPTIMIZATION
random_page_cost = 1.1
effective_io_concurrency = 200
default_statistics_target = 100

# AUTOVACUUM
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50

# LOGGING
log_destination = 'csvlog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_min_duration_statement = 1000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0

High Availability Configuration

For production environments, PostgreSQL is configured for high availability:

  • Primary-Replica Setup: Synchronous replication to standby servers
  • Automatic Failover: Using tools like Patroni or Stolon
  • Connection Pooling: Using PgBouncer for efficient connection management
  • Load Balancing: Using HAProxy or similar for read-write splitting

Example Patroni configuration:

scope: meta-agent-cluster
namespace: /service/meta-agent/
name: postgresql0

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgresql0:8008

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 100
        shared_buffers: 1GB
        work_mem: 32MB
        maintenance_work_mem: 256MB
        effective_cache_size: 3GB
        wal_level: replica
        max_wal_size: 1GB
        min_wal_size: 80MB
        checkpoint_completion_target: 0.9
        wal_buffers: 16MB
        random_page_cost: 1.1
        effective_io_concurrency: 200
        default_statistics_target: 100
        autovacuum: on
        autovacuum_max_workers: 3
        autovacuum_naptime: 1min
        autovacuum_vacuum_threshold: 50
        autovacuum_analyze_threshold: 50

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 md5
  - host replication replicator 10.0.0.0/8 md5
  - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgresql0:5432
  data_dir: /data/postgresql0
  bin_dir: /usr/lib/postgresql/15/bin
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: replpass
    superuser:
      username: postgres
      password: secretpassword
  parameters:
    unix_socket_directories: '/var/run/postgresql'

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Database Schema Management

Migration Strategy

The platform uses Alembic for database migrations:

  • Version Control: Track schema changes
  • Forward Migration: Apply schema changes
  • Backward Migration: Revert schema changes
  • CI/CD Integration: Automated migration in deployment pipeline

Example Alembic migration script:

"""Add agent execution metrics table

Revision ID: a1b2c3d4e5f6
Revises: 9z8y7x6w5v4
Create Date: 2025-04-18 14:30:45.123456

"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import JSONB, UUID

# revision identifiers, used by Alembic.
revision = 'a1b2c3d4e5f6'
down_revision = '9z8y7x6w5v4'
branch_labels = None
depends_on = None

def upgrade():
    op.create_table(
        'agent_execution_metrics',
        sa.Column('id', UUID, primary_key=True),
        sa.Column('agent_id', UUID, sa.ForeignKey('agents.id'), nullable=False),
        sa.Column('workflow_execution_id', UUID, sa.ForeignKey('workflow_executions.id'), nullable=False),
        sa.Column('start_time', sa.DateTime(timezone=True), nullable=False),
        sa.Column('end_time', sa.DateTime(timezone=True), nullable=True),
        sa.Column('status', sa.String(50), nullable=False),
        sa.Column('cpu_usage', sa.Float, nullable=True),
        sa.Column('memory_usage', sa.Float, nullable=True),
        sa.Column('input_size', sa.Integer, nullable=True),
        sa.Column('output_size', sa.Integer, nullable=True),
        sa.Column('error', sa.Text, nullable=True),
        sa.Column('metadata', JSONB, nullable=True),
        sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now(), nullable=False),
        sa.Column('updated_at', sa.DateTime(timezone=True), server_default=sa.func.now(), onupdate=sa.func.now(), nullable=False)
    )
    op.create_index('idx_agent_execution_metrics_agent_id', 'agent_execution_metrics', ['agent_id'])
    op.create_index('idx_agent_execution_metrics_workflow_execution_id', 'agent_execution_metrics', ['workflow_execution_id'])
    op.create_index('idx_agent_execution_metrics_start_time', 'agent_execution_metrics', ['start_time'])
    op.create_index('idx_agent_execution_metrics_status', 'agent_execution_metrics', ['status'])

def downgrade():
    op.drop_table('agent_execution_metrics')

Schema Documentation

The database schema is documented using tools like SchemaSpy or pgModeler:

  • Entity-Relationship Diagrams: Visual representation of schema
  • Table Definitions: Detailed table and column descriptions
  • Relationship Documentation: Foreign key relationships
  • Constraint Documentation: Primary keys, unique constraints, etc.

Backup and Recovery

Backup Strategy

The platform implements a comprehensive backup strategy:

  • Full Backups: Daily full database backups
  • WAL Archiving: Continuous transaction log archiving
  • Point-in-Time Recovery: Ability to restore to any point in time
  • Offsite Storage: Backups stored in multiple locations
  • Backup Verification: Regular testing of backup restoration

Example backup script:

#!/bin/bash
# backup_db.sh - Create a full backup of the PostgreSQL database

BACKUP_DIR="/backups/postgresql"
BACKUP_FILE="meta_agent_$(date +%Y%m%d_%H%M%S).backup"
RETENTION_DAYS=7

# Create backup directory if it doesn't exist
mkdir -p $BACKUP_DIR

# Create full backup
pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME -F c -f $BACKUP_DIR/$BACKUP_FILE

# Compress backup
gzip $BACKUP_DIR/$BACKUP_FILE

# Copy to offsite storage
aws s3 cp $BACKUP_DIR/$BACKUP_FILE.gz s3://meta-agent-backups/postgresql/

# Remove old backups
find $BACKUP_DIR -name "meta_agent_*.backup.gz" -type f -mtime +$RETENTION_DAYS -delete

echo "Backup completed: $BACKUP_FILE.gz"

Recovery Procedures

Documented recovery procedures for various scenarios:

  • Full Database Restore: Restore from full backup
  • Point-in-Time Recovery: Restore to specific point in time
  • Single Table Restore: Restore specific table
  • Disaster Recovery: Restore in new environment

Example recovery script:

#!/bin/bash
# restore_db.sh - Restore PostgreSQL database from backup

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: ./restore_db.sh [backup_file]"
  echo "Example: ./restore_db.sh meta_agent_20250418_143045.backup.gz"
  exit 1
fi

BACKUP_DIR="/backups/postgresql"

# Check if backup file exists
if [ ! -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
  # Try to download from offsite storage
  aws s3 cp s3://meta-agent-backups/postgresql/$BACKUP_FILE $BACKUP_DIR/

  if [ ! -f "$BACKUP_DIR/$BACKUP_FILE" ]; then
    echo "Error: Backup file not found"
    exit 1
  fi
fi

# Uncompress if needed
if [[ $BACKUP_FILE == *.gz ]]; then
  gunzip -c $BACKUP_DIR/$BACKUP_FILE > $BACKUP_DIR/${BACKUP_FILE%.gz}
  BACKUP_FILE=${BACKUP_FILE%.gz}
fi

# Stop application services
docker-compose stop backend frontend

# Drop and recreate database
psql -h $DB_HOST -U $DB_USER -c "DROP DATABASE IF EXISTS $DB_NAME;"
psql -h $DB_HOST -U $DB_USER -c "CREATE DATABASE $DB_NAME;"

# Restore from backup
pg_restore -h $DB_HOST -U $DB_USER -d $DB_NAME -v $BACKUP_DIR/$BACKUP_FILE

# Start application services
docker-compose start backend frontend

echo "Database restored from $BACKUP_FILE"

Database Scaling

Vertical Scaling

Strategies for vertical scaling:

  • Resource Allocation: Increase CPU, memory, and storage
  • Storage Optimization: Use faster storage (SSD, NVMe)
  • Configuration Tuning: Optimize PostgreSQL parameters for larger hardware

Horizontal Scaling

Strategies for horizontal scaling:

  • Read Replicas: Distribute read queries across replicas
  • Sharding: Partition data across multiple database instances
  • Connection Pooling: Efficiently manage database connections
  • Query Optimization: Improve query performance

Multi-Region Deployment

For global deployments, multi-region database strategy:

  • Regional Replicas: Read replicas in each region
  • Data Synchronization: Efficient data replication between regions
  • Conflict Resolution: Handle write conflicts in multi-master setup
  • Latency Optimization: Minimize cross-region queries

Database Security

Access Control

Database access control measures:

  • Role-Based Access: Granular permissions for database users
  • Connection Security: TLS encryption for all connections
  • Network Security: Firewall rules and VPC isolation
  • Audit Logging: Track all database access and changes

Example PostgreSQL role setup:

-- Create application roles
CREATE ROLE meta_agent_app WITH LOGIN PASSWORD 'secure_password';
CREATE ROLE meta_agent_readonly WITH LOGIN PASSWORD 'readonly_password';
CREATE ROLE meta_agent_admin WITH LOGIN PASSWORD 'admin_password' SUPERUSER;

-- Grant permissions to application role
GRANT CONNECT ON DATABASE meta_agent TO meta_agent_app;
GRANT USAGE ON SCHEMA public TO meta_agent_app;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO meta_agent_app;
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO meta_agent_app;

-- Grant read-only permissions
GRANT CONNECT ON DATABASE meta_agent TO meta_agent_readonly;
GRANT USAGE ON SCHEMA public TO meta_agent_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO meta_agent_readonly;

-- Set default privileges for future tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO meta_agent_app;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO meta_agent_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT USAGE ON SEQUENCES TO meta_agent_app;

Data Encryption

Database encryption measures:

  • Encryption at Rest: File system or tablespace encryption
  • Encryption in Transit: TLS for all connections
  • Column-Level Encryption: Encrypt sensitive columns
  • Key Management: Secure storage and rotation of encryption keys

Edge Database Strategy

For edge deployments, specialized database strategy:

  • Local SQLite Database: Lightweight database for edge devices
  • Synchronization: Bi-directional sync with central PostgreSQL
  • Conflict Resolution: Handle sync conflicts
  • Offline Operation: Function without connectivity

See Edge Infrastructure for more details.

Federated Database Strategy

For federated deployments, specialized database strategy:

  • Data Sovereignty: Keep data within organizational boundaries
  • Secure Queries: Privacy-preserving distributed queries
  • Schema Compatibility: Ensure schema compatibility across organizations
  • Audit Trail: Track cross-organization data access

See Federated Infrastructure for more details.

Database Monitoring

The platform implements comprehensive database monitoring:

  • Performance Metrics: Query performance, connection count, etc.
  • Resource Usage: CPU, memory, disk I/O, etc.
  • Slow Query Tracking: Identify and optimize slow queries
  • Replication Lag: Monitor replica synchronization
  • Connection Pooling: Monitor connection pool health

See Monitoring Infrastructure for more details.

Database Management Scripts

Scripts for database management are located in /infra/scripts/:

  • setup_db.sh - Initialize database and roles
  • backup_db.sh - Create database backup
  • restore_db.sh - Restore from backup
  • migrate_db.sh - Run database migrations
  • analyze_db.sh - Analyze database performance

Best Practices

  • Use database migrations for schema changes
  • Implement comprehensive backup strategy
  • Monitor database performance
  • Optimize queries and indexes
  • Implement connection pooling
  • Use appropriate scaling strategy
  • Secure database access
  • Regularly maintain database (VACUUM, ANALYZE)
  • Document database schema and procedures

References


Last updated: 2025-04-18