Load balancing helps us scale horizontally across an ever-increasing number of servers, but caching will enable us to make vastly better use of the resources we already have. Caches can exist at all levels in architecture. A cache is like short-term memory which has a limited amount of space. The cache is typically faster than the original data source.
Let’s talk about different approaches first.
1) Application server cache:
We could have a cache directly on the Application server. Each time a request is made to the service, the node will quickly return local, cached data if it exists. If it is not in the cache, the requesting node will query the data by going to network storage such as a database.
How this solution will scale as we may grow to many nodes? If we decide to expand to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses:
- This will result in extra network call for same data!
- Extra storage since same data will be stored in two different nodes!
- This may cause inconsistent application behavior. How would you handle consistency and Cache Invalidation for each node?
Two choices for overcoming this hurdle are global caches and distributed caches.
2) Global caches
All the nodes use the same single cache space. Each of the application node queries the cache in the same way it would a local one. However, it is very easy to overwhelm a single global cache system as the number of clients and requests increase, but is very effective in some architectures.
3) Distributed cache
Typically, the cache is divided up using a consistent hashing function and each of its nodes own part of the cached data. If a request node is looking for a certain piece of data, it can quickly use to hashing function to locate the information within the distributed cache to determine if the data is available. Therefore, one of the advantages of a distributed cache is the ease by which we can increase the cache space, which can be achieved just by adding nodes to the request pool.
- A disadvantage of distributed caching is resolving a missing node. We may get around this issue by storing multiple copies of the data on different nodes.
4) Content Distribution Network (CDN)
This is the best solution if our sites serving large amounts of static media. If the system we are building isn’t yet large enough to have its own CDN yet! We can serve static media off separate subdomain such as “content.emrekoca.com” using a lightweight HTTP server like apache and cutover the DNS from your servers to a CDN layer.
How typical CDN works?
- A request will ask the CDN for a piece of static data which is a media.
- The CDN will serve the content if it is available locally.
- If it is not available, the CDN will query the back-end servers for the media and cache it locally.
- Then it serves the media to the requesting client.
If the data is modified in the database, it should be invalidated in the cache, if not, this can cause inconsistent application behavior. There are majorly three kinds of caching systems:
- Write through cache : Where writes go through the cache and write is confirmed as success only if writes to DB and the cache BOTH succeed. We will have complete data consistency between cache and storage. Nothing will get lost in case of a crash, power failure, or other system disruptions. However, write latency will be higher in this case as there are writes to two separate systems.
- Write around cache : Where write directly goes to the DB, bypassing the cache. This may reduce the latency (except one case!). However, it increases cache misses because the cache system reads the information from DB incase of a cache miss. As a result of it, this can lead to higher read latency incase of applications which write and re-read the information quickly. Read must be happened from slower back-end storage and experience higher latency.
- Write back cache : Where the write is directly done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the DB. This would lead to a really quick write latency and high write throughput for the write-intensive applications. However, there is a risk of losing the data incase the caching layer dies because the only single copy of the written data is in the cache. We can improve this by having more than one replica acknowledging the write in the cache.
Cache eviction policies
Following are some of the most common cache eviction policies:
- First In First Out (FIFO): The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before.
- Last In First Out (LIFO): The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before.
- Least Recently Used (LRU): Discards the least recently used items first.
- Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used items first.
- Least Frequently Used (LFU): Counts how often an item is needed. Those that are used least often are discarded first.
- Random Replacement (RR): Randomly selects a candidate item and discards it to make space when necessary.
Design a distributed key value caching system
Goal of this article is to design a distributed key value caching system, such as Memcached or Redis (most popular ones out there today). Let’s start with questions that we have to answer:
- What is the amount of data that we need to cache?
- It depends on whatever we are building maybe TB in this case.
- What should be the cache eviction policy?
- Let’s focus on LRU.
- What should be the access pattern for the given cache or cache Invalidation method?
- Let’s go with write-back-cache!
- What is the kind of QPS we expect for the system?
- Don’t over kill the machine! A single machine is going to handle 1M QPS, we may run into a high risk of high latency due to the machine dying because of queries not being answered fast enough!
- Is Latency a very important metric for us?
- The whole point of caching is low latency!
- How about Consistency vs Availability?
- Unavailability in a caching system means we have a cache miss which leads to a high latency due to read from slower machine (disk instead of memory!).
- Choose the Availability over Consistency to reduce latency. Accept eventual consistency as long as I eventually see new changes in reasonable time.
- What data structure will you use to implement this?
- Map and LinkedList should do the job! We may get better performance on double pointer linked-list on the remove operation.
- What happens when a machine handling a shard goes down?
- Single machine: If we only have one machine per shard, then if the machine goes down, all requests to that shard will start hitting the DB and hence there will be elevated latency.
- Multiple machines: If we have a lot of machines, then we can have multiple machines per shard where they maintain exactly the same amount of data.
- Since we have multiple servers maintaining the same data, it is possible that the data is not in sync between the servers.
- This means that a few keys might be missing on some of the servers, and a few servers might have older values for the same keys.
- Master slave technique: There is only one active server at a time in a shard and it has a follower which keeps getting the update. When the master server goes down, the slave server takes over as the master server.
- Master and slave can maintain a change log with version number to make sure they are caught up.
- If we are fine with all servers becoming eventually consistent, then we can have one master that takes all the write traffic and many read replicates where they can service the read traffic as well.
- Or we can go to peer to peer systems such as Apache Casandra is great example of this kind of Architecture!
I have used draw.io for diagrams in this article (it is online tool which is free!).