Distributed System Computing-&-Designing
A distributed system is a collection of autonomous computing entities (nodes) that communicate over a network and coordinate their actions by passing messages. To the end-user, the collection of independent nodes appears as a single, unified, coherent system.
Resource Sharing: Combining the hardware capacities (compute, storage, and specialized accelerators) of thousands of machines.
Concurrency: Running tasks in parallel across different physical CPUs to dramatically reduce execution times.
Fault Tolerance: If one node goes offline or crashes, the remaining nodes absorb the workload, ensuring the overall system experiences zero downtime.
No Shared Memory: Nodes do not share physical RAM. They only discover the state of other nodes by explicitly transmitting messages across a network link.
No Global Clock: Due to network transmission delays and physical limits, it is impossible for all machines to have a perfectly synchronized global time. Distributed systems rely on logical clocks (like Lamport Timestamps or Vector Clocks) to establish the sequence or ordering of events.
Independent Failures: Individual components fail unpredictably while other parts of the system continue running normally. Handling these partial failures is the primary challenge of distributed software development.
Distributed Architecture specifies how the structural components of a distributed system are logically organized and how they interact with each other.
Centralized vs. Decentralized vs. Distributed Node Frameworks.
Layered Architecture: Components are organized hierarchically. A layer only makes requests to the layer directly beneath it. Common in standard enterprise applications (Presentation layer → Logic layer → Data layer).
Object-Based / Service-Oriented (SOA): Nodes are treated as independent objects or services connected through remote procedure calls (RPCs) or web service protocols.
Event-Driven (Publish-Subscribe): Nodes communicate asynchronously by broadcasting events to a central broker (e.g., Apache Kafka). Producers emit events without knowing who the consumers are, allowing systems to be loosely coupled.
Peer-to-Peer (P2P): Completely decentralized systems where every node acts as both a client and a server (e.g., BitTorrent or Blockchain networks).
The CAP Theorem dictates that a distributed data store can simultaneously provide at most two of the following three guarantees when a network partition occurs:
Consistency (C): Every read request receives the most recent write or an error.
Availability (A): Every non-failing node returns a non-error response for every request (without guaranteeing it contains the most recent write).
Partition Tolerance (P): The system continues to operate despite an arbitrary number of dropped or delayed messages between nodes.
The Reality of CAP: Because physical networks will inevitably experience delays or drops (Partitions), software designers must choose between Consistency (refusing the request to avoid returning old data) or Availability (accepting the request but risking returning stale data).
Distributed Infrastructure refers to the physical networks, bare-metal hardware systems, systems software, and orchestration utilities that support distributed software applications.
Compute Clusters: Groups of independent physical or virtual machines clustered together using high-speed network connections to run unified distributed jobs.
Distributed File Systems (DFS): File systems that store data across multiple physical machines while providing a standard, unified file access interface (e.g., HDFS, Ceph).
Message Brokers / Event Logs: High-throughput streaming platforms that capture, persist, and route messages asynchronously between independent services (e.g., Apache Kafka, RabbitMQ).
Distributed Coordinators: Highly consistent configuration stores used for leader election, cluster state management, and service naming configuration (e.g., Apache ZooKeeper, etcd).
In modern microservices environments, managing node-to-node communication becomes incredibly complex. A service mesh adds a dedicated infrastructure layer to handle this traffic automatically.
Service Mesh Architecture with Sidecar Proxies.
As shown above, a lightweight network proxy (often called a Sidecar Proxy) is deployed directly alongside every microservice instance. Instead of the application code handling network timeouts, encryption, or retries, the sidecar proxies intercept all incoming and outgoing HTTP/gRPC traffic, managing cluster communication safely without altering the underlying code.
Distributed Designing is the process of architecting software systems to maintain data integrity, fast performance, and high availability when spread across many nodes.
Because nodes frequently fail or experience network lags, distributed systems must use strict consensus protocols to agree on a single data value or system state.
Paxos & Raft: Protocols that elect a clear "Leader" node. The leader accepts writes, log entries are safely replicated to a majority of follower nodes, and changes are committed only after acknowledgment.
Sharding: Breaking up a massive database horizontally and distributing the segments across completely different database servers to increase throughput.
Consistent Hashing: An algorithmic mapping technique used to minimize re-sharding disruptions. When a new storage node is added or removed from a cluster, consistent hashing ensures only a tiny fraction of data keys need to be reallocated.
Circuit Breaker Pattern: Prevents a failing downstream service from causing a cascading failure across the entire system. If a service begins failing or timing out, the circuit breaker trips open, instantly returning a fallback error response without wasting system resources on broken network calls.
Idempotency: Designing API operations so that making the exact same request multiple times produces the identical system state as making it once. This is vital because network hiccups cause clients to frequently retry requests.
When moving data between distributed infrastructure components, designers must explicitly choose an architectural trade-off for message deliveries:
| Delivery Guarantee | Operational Behavior | Overhead Cost | Typical Use Case |
|---|---|---|---|
| At-Most-Once | Messages are sent with no retries. Messages can be lost, but are never duplicated. | Lowest overhead | Real-time telemetry, video streaming |
| At-Least-Once | Messages are retries until acknowledged. No data loss occurs, but duplicates can happen. | Medium overhead | Log collection, billing metrics |
| Exactly-Once | Messages are guaranteed to be delivered precisely once via coordinated two-phase commits. | Highest overhead | Financial transactions, banking ledger adjustments |