Daniel Okafor

Daniel Okafor

Senior software engineer focused on backend systems, distributed architectures, and observability. Currently at Arclight Security, building threat detection infrastructure that processes billions of events per day. Previously built ML serving systems at Cortex Labs and API observability tooling at Relay Systems.

San Francisco, CA | Georgia Tech '19 | daniel@danokafor.com

# What I've Built

Real-Time Threat Detection Pipeline

Arclight Security / 2023-present

Replaced a batch-oriented detection system (15-min cycles) with a streaming architecture. The old system meant attackers had established persistence before alerts fired. The new design uses Kafka for event ingestion, Go consumer services for rule evaluation, and ClickHouse for windowed aggregations.

Go Kafka ClickHouse Redis

Key decision: Each rule type gets its own consumer group, so high-volume simple pattern rules don't bottleneck complex multi-step correlation rules. Built a custom windowed aggregation layer in Go with in-memory state and Redis checkpoints to avoid hitting ClickHouse on every event. The state machine engine tracks attack progression across events for multi-stage detection rules that weren't possible in the batch system.

2.3B

events/day

400+

tenants

47s

detection time (was 14min)

-28%

false positive rate

Multi-Tenant Data Isolation Layer

Arclight Security / 2023

Our first Fortune 500 customer required contractual guarantees of data isolation. The existing system used application-level WHERE clauses for tenant filtering, which had produced two near-miss cross-tenant data exposure bugs in QA. A single app bug could have exposed security telemetry across tenants.

PostgreSQL RLS Go Connection Pooling

Key decision: Three-layer defense-in-depth instead of relying on any single mechanism. Layer 1: PostgreSQL row-level security policies on all tenant-scoped tables, enforced via SET app.current_tenant. Layer 2: a tenant-aware connection pooler that injects tenant context on every connection checkout. Layer 3: CI integration tests that attempt cross-tenant reads for every API endpoint.

0

cross-tenant leaks (18 months)

Passed

Fortune 500 security audit

3

teams adopted the pattern

Monolith-to-Microservices Migration

Arclight Security / 2023-2024

180K-line Django monolith with 31% test coverage, circular imports everywhere, 25-minute deploys. Three frontend teams, two mobile engineers, and six integration partners all consuming the API. A breaking migration would have been a coordination disaster.

Go Kong Django GraphQL

Key decision: Strangler fig pattern through a Kong API gateway. Traffic routed to either the monolith or new Go microservices per endpoint. Compatibility tests hit both old and new implementations and diffed responses before each cutover. Migrated 47 endpoints over 4 months with zero consumer-facing breaking changes. First refactored the monolith internally -- broke 34 circular imports to zero, raised test coverage from 31% to 58% -- then extracted services.

95ms

p99 latency (was 820ms)

47

endpoints migrated

0

breaking changes

+40%

feature velocity

ML Model Serving Infrastructure

Cortex Labs / 2021-2022

High-throughput inference gateway for an ML infrastructure platform. The system needed to serve 50M+ requests per day while maintaining p99 under 200ms and surviving 3x traffic spikes without scaling infrastructure.

Go Python Kafka AWS EKS

Key decision: Request batching plus adaptive load shedding. When traffic spikes hit, the system sheds lower-priority requests gracefully instead of queuing everything until timeouts cascade. Also built the A/B experimentation framework for model deployments with automatic rollback on latency or error-rate regressions.

50M+

requests/day

<200ms

p99 latency

-$127K

monthly AWS savings

99.95%

uptime (was 99.7%)

API Traffic Capture Agent

Relay Systems / 2019-2021

Relay needed a way to capture customer API traffic for monitoring without requiring SDK instrumentation (which had poor adoption because it meant modifying production code). Built a transparent agent from scratch in Go.

Go eBPF gRPC Docker/K8s

Key decision: Built as a reverse proxy sidecar using eBPF-based packet inspection when available, with a fallback to proxy mode. Plugin system for framework-specific enrichment (Express, Flask, Gin route parameter extraction). Local buffer with at-least-once delivery guarantees. Customer onboarding dropped from 2 weeks to 30 minutes.

180+

deployments

12M

requests/day captured

<2ms

p99 overhead

70%

of closed deals cited agent

# Open Source

Cortex CLI

1.2K stars

Added multi-model endpoint support to the open-source Cortex CLI. Designed the traffic routing specification (weighted, header-based, automatic rollback) as a backward-compatible extension to the existing YAML config format. Three PRs merged: core routing engine, CLI management commands, and documentation.

Adopted by 60% of enterprise customers within 6 months. Two community members contributed follow-up improvements building on the routing engine.

otel-go-kit

internal, open-sourced

Shared Go middleware package for OpenTelemetry instrumentation. Automatically instruments HTTP/gRPC requests, outgoing calls, database queries, and Kafka operations. Used by all 14 backend services at Arclight with zero manual instrumentation needed for new services.

Structured Logging RFC

internal standard

Authored the internal logging specification and built a shared Go library adopted across 28 engineers and 14 services. Includes PII scrubbing middleware that caught and redacted 14K instances of sensitive data in its first month.

# Currently Thinking About

// TODO: write these up properly

// for now, working notes:

eBPF vs. sidecar proxies

Whether eBPF-based observability can replace sidecar proxies for service mesh telemetry. Cilium is making a strong case, but the debugging story when things go wrong at the kernel level is still rough. Tradeoff between operational simplicity and observability depth.

CRDTs vs. operational transforms

The tradeoffs of CRDTs vs operational transforms for real-time collaboration -- been reading Martin Kleppmann's work. CRDTs have better theoretical properties but the metadata overhead is nontrivial. OT is battle-tested but centralization is a hard constraint.

Microservice migrations fail at data

Why most microservice migrations fail at data ownership, not service boundaries. The Arclight migration reinforced this -- we could split services relatively cleanly, but untangling shared database tables took 3x longer than expected.

Weekend project: chess engine in Rust

Building a chess engine in Rust as a weekend project. Currently ~1800 ELO on Lichess puzzles. Using bitboards for board representation and alpha-beta pruning with iterative deepening. The move generation is fun; the evaluation function is where the interesting decisions live.

# About

I'm a backend engineer in San Francisco. I studied Computer Science at Georgia Tech, concentrating in Systems & Architecture, and I've spent the last five years building the kind of infrastructure that's invisible when it works and painful when it doesn't.

My strongest work is in the space between "we need this to handle 10x more traffic" and "we have zero budget for new infrastructure." I care about systems that are as easy to operate at 3 AM as they are to architect on a whiteboard.

Outside of work, I play chess competitively and contribute to open source on weekends, mostly infrastructure tooling. Both scratch the same itch: pattern recognition, thinking several moves ahead, and knowing when to sacrifice short-term advantage for long-term position.

Stack

Go, Python, TypeScript

PostgreSQL, ClickHouse, Kafka, Redis

AWS, Kubernetes, Terraform

OpenTelemetry, Grafana, Datadog

Certifications

AWS Solutions Architect - Associate

Certified Kubernetes Administrator

Links