Fog Compute: Distributed Intelligence for Web3

Most distributed compute projects solve a scaling problem. I’m building one that solves a trust problem — how do you verify that work was done correctly when the workers are anonymous?

Scaling is the easy part. You have a big computation, you break it into chunks, you send chunks to workers, workers return results, you reassemble. MapReduce solved this in 2004. Every cloud provider offers it as a managed service.

Trust is the hard part. In a centralized system, you trust the workers because you own the machines. In a distributed system with anonymous workers, you trust nothing. A worker might return garbage. A worker might return the right answer to the wrong problem. A worker might return a cached result from a different input. You have to verify everything, and verification can’t cost more than the computation itself.

The Problem Space

Distributed compute has three hard problems. Everyone focuses on the first. Almost nobody solves the second and third.

Problem 1: Task Distribution. Breaking work into independent chunks, routing chunks to available workers, handling worker failures and retries. This is a solved problem. Job queues, work stealing, consistent hashing — the literature is deep and the implementations are mature.

Problem 2: Result Verification. How do you know the worker actually computed the result instead of returning random data? In academic settings, you run the same computation on multiple workers and compare (replication). In production, replication doubles or triples your cost. You need verification that’s cheaper than the computation itself.

Problem 3: Economic Alignment. Workers need incentives to participate honestly. If cheating is more profitable than honest work, rational workers will cheat. The system design has to make honesty the economically optimal strategy — not through appeals to morality, but through mechanism design.

Fog Compute addresses all three. Problem 1 is table stakes. Problems 2 and 3 are the actual engineering challenge.

Why Rust

The backend is written in Rust. This isn’t a fashion choice.

Distributed compute coordinators are long-running processes that handle thousands of concurrent connections, manage work queues, and process results. Memory safety bugs in this environment are catastrophic. A use-after-free in the coordinator can corrupt the work queue, causing valid results to be discarded and invalid results to be accepted. In C++, you find these bugs in production at 3 AM. In Rust, you find them at compile time.

Performance matters too. The coordinator is on the critical path for every task. If the coordinator adds 50ms of overhead per task, and you’re processing 10,000 tasks per second, that’s 500 seconds of accumulated latency. Rust’s zero-cost abstractions keep coordinator overhead under 2ms per task.

And Rust’s type system encodes invariants that would be runtime checks in other languages. A task that hasn’t been verified can’t be marked complete — the type system prevents it. A result from an untrusted worker can’t flow into the trusted results pool without passing through the verification pipeline. These aren’t convention-enforced rules. They’re compiler-enforced guarantees.

TON and Cocoon Integration

The project integrates with the TON blockchain ecosystem, specifically the Cocoon framework for decentralized compute markets.

TON provides the economic layer. Workers stake tokens to participate. Honest work earns rewards. Caught cheating burns the stake. The stake-to-reward ratio is calibrated so that honest participation is always more profitable than cheating, even accounting for the probability of detection.

Cocoon provides the marketplace. Compute requestors post tasks with requirements (CPU time, memory, deadline, price). Workers bid on tasks they can fulfill. The marketplace matches requestors with workers based on price, worker reputation, and historical completion rate.

The integration is bidirectional. Fog Compute uses TON/Cocoon for worker recruitment and payment. TON/Cocoon uses Fog Compute’s verification system to validate results and update worker reputation scores.

Service Mesh Architecture

The system runs as a service mesh with four core services:

Coordinator Service: Receives compute requests, breaks them into tasks, manages the task queue, and reassembles results. This is the brain. It’s stateful — it tracks every task’s lifecycle from creation through assignment, execution, verification, and completion.

Worker Gateway: Handles communication with external workers. Authentication, protocol translation, heartbeat monitoring, and result ingestion. Workers connect via WebSocket and receive tasks as encrypted payloads. The gateway doesn’t know what the tasks compute — it only knows the task metadata (size, deadline, payment).

Verification Service: Runs the verification pipeline on submitted results. Multiple verification strategies depending on the task type: deterministic replay for pure functions, probabilistic spot-checking for statistical computations, and cryptographic proof verification for zero-knowledge workloads.

Reputation Service: Maintains worker reputation scores based on historical performance. Accuracy, latency, availability, and stake history all factor into the score. The coordinator uses reputation scores to prioritize task assignment — higher-reputation workers get higher-value tasks.

The services communicate over gRPC with mTLS. Every service-to-service call is authenticated and encrypted. The mesh is deployed on Kubernetes with automatic scaling — the Worker Gateway scales horizontally based on connected worker count, and the Verification Service scales based on result queue depth.

Verification Strategies

This is the core intellectual property. How do you verify a computation without re-running it?

Deterministic Replay (for pure functions): The coordinator selects 5% of completed tasks at random and re-runs them on a trusted verifier. If the worker’s result matches, the task is confirmed. If it doesn’t match, the worker is flagged and all of their recent results are re-verified.

The key insight: you don’t need to verify every result. You need the probability of catching a cheater to be high enough that the expected cost of cheating exceeds the expected reward. At a 5% spot-check rate with stake slashing, cheating has negative expected value for any rational worker.

Probabilistic Spot-Checking (for statistical computations): Some computations aren’t deterministic — they involve random sampling, approximate algorithms, or floating-point operations that vary across hardware. For these, the verifier runs the same computation with the same random seed and checks that results fall within acceptable bounds.

The bounds are task-specific. A Monte Carlo simulation might accept results within 2 standard deviations. A numerical optimization might accept any result that improves on the previous iteration’s objective value.

Cryptographic Proofs (for ZK workloads): For tasks that support zero-knowledge proofs, the worker submits a proof alongside the result. The proof can be verified in constant time regardless of the computation’s complexity. This is the gold standard for verification but only works for computations that have been expressed in a ZK-compatible circuit.

CI/CD with Playwright E2E

The project has 289 tests passing out of 313 total. The failing tests are for features still under development — they’re written test-first as specification.

The CI pipeline runs three test suites:

Unit tests (Rust): Pure function tests for the coordinator logic, verification algorithms, and reputation calculations. These run in under 30 seconds and catch logic bugs.

Integration tests (Docker Compose): Spin up the full service mesh locally, submit tasks through the coordinator, simulate workers producing results, and verify the end-to-end pipeline. These take 3-4 minutes and catch communication failures, protocol mismatches, and state management bugs.

E2E tests (Playwright): Test the admin dashboard — a web UI for monitoring task queues, worker status, and system health. Playwright drives a real browser through the dashboard, verifying that task status updates propagate correctly, worker reputation displays accurately, and the system degrades gracefully when services fail.

The E2E suite caught a bug that unit and integration tests missed: the dashboard showed a task as “completed” before verification finished. The task was complete from the coordinator’s perspective (result received) but not from the system’s perspective (result not yet verified). The Playwright test clicked into the task detail and found the verification status was “pending” while the dashboard card showed “complete.” That’s a trust-breaking UI bug — it could lead an operator to trust unverified results.

Current State and What’s Left

The project is at 35% completion. The coordinator, worker gateway, and reputation service are functional. The verification service has deterministic replay working but probabilistic and ZK verification are still in development.

The TON/Cocoon integration is stubbed — the interfaces are defined and the data flows are mapped, but the actual blockchain interactions aren’t connected yet. That’s deliberate. I’m building the compute layer first and the economic layer second. You can’t design good incentives until you understand the failure modes, and you can’t understand the failure modes until you’ve built and tested the compute layer.

What I’ve learned so far: distributed trust is harder than distributed compute by at least an order of magnitude. The coordination problems are tedious but well-understood. The verification problems are genuinely novel — every computation type needs a different verification strategy, and the wrong strategy either costs too much (re-running everything) or catches too little (spot-checking at too low a rate).

Working on distributed compute, verification systems, or Web3 infrastructure? I’m building in this space and happy to compare notes. Book a call to discuss distributed trust architectures.