Emprovise Blog: Performance

Showing posts with label Performance. Show all posts

Monday, May 18, 2020

Akka - Evolution of Multithreading

Over the past decade with rise of mobile applications, distributed micro-services and cloud-infrastructure services, the applications need to be highly responsive, provide high throughput and efficiently utilize system resources. Many real-time systems ranging from financial applications to games cannot wait for single-threaded process to complete. Also resource-intensive systems would take a lot of time to finish tasks without parallelization. Multithreading takes advantage of the underlying hardware by running multiple threads making the application more responsive and efficient.

Multithreading achieved through context switching, gives rise to many problems such as thread management, accessing shared resources, race conditions and deadlocks. Some of the multithreading concepts were developed to resolve these problems. Thread pool allowed to decouple task submission and execution, and alleviate from manual creation of threads. It consists of homogeneous worker threads (as pool size) that are assigned to execute tasks and returned back to the pool once task is finished. A synchronization technique of locks is used to limit access to a resource by multiple threads. Mutex on other hand is used to guard shared data, only allowing a single thread to access a resource. Such locking seriously limits concurrency with blocked caller threads not performing any work while the CPU does the heavy-lifting of suspending and restoring them later. The ExecutorService was introduced as part of Concurrency API in Java, which provide higher abstraction on threads. It managed thread creation and maintained thread pool under the hood, and is capable of running multiple asynchronous tasks. The Callable threads were added to allow to return results back to the caller within the Future object. These and many other improvements help to simplify the concurrency code but is not enough to avoid the complex thread synchronization.

Concurrency means that multiple tasks run in overlapping time periods, while Parallelism means the tasks are split up into smaller subtasks to be processed in parallel. Concurrent tasks are stateful, often with complex interactions, while parallelizable tasks are stateless, do not interact, and are composable. A concurrency model specifies how threads in the the system collaborate to complete the tasks they are are given. The important aspect of a concurrency model is whether the threads are designed to share a common state or each thread has its own own state isolated from other threads. When threads share state by using and accessing the shared object, problems like race conditions and deadlocks may occur. On the other hand when threads have separate state, they need to communicate either by exchanging immutable objects among them, or by sending copies of objects (or data) among them. This allows for no two threads to write to the same object thus avoiding the concurrency problems faced in shared state. The Parallel worker model, which is most common, has a delegator distributes the incoming jobs to different workers. Each worker completes the entire job, working in parallel and running in different threads. The parallel worker model works well for isolated jobs but becomes complex when workers need to access shared data. Event driven model (Reactive model) is were workers perform the partial job and delegate the remaining job to another worker. Each worker is running in its own thread, and shares no state with other workers. Jobs may even be forwarded to more than one worker for concurrent processing. The workers react to events occurring in the system, either received from the outside world or emitted by other workers. Since workers know that no other threads modify their data, the workers can be stateful making them faster than stateless workers. Akka is one such event driven concurrency model.

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant applications on the JVM. Akka creates a layer between the actor and baseline system so that the actor's only job is to process the messages. All the complexity of creating and scheduling threads, receiving and dispatching messages, and handling race conditions and synchronization, is relegated to the framework to handle transparently. Akka model is a a concurrency model which avoids concurrent access to mutable state and uses asynchronous communications mechanisms to provide concurrency. Akka encourages the push model using messages and a single shared queue, rather than the pull model were each worker processes from its own queue.

The following characteristics of Akka allow you to solve difficult concurrency and scalability challenges in an intuitive way:

Event-driven model: Actors perform work in response to messages. Communication between Actors is asynchronous, allowing Actors to send messages and continue their own work without blocking to wait for a reply.
Strong isolation principles: Actors don't have any public API methods which can be invoked. Instead, its public API is defined through messages that the actor handles. This prevents any sharing of state between Actors, with the only way to observe another actor's state is by sending it a message asking for it.
Location transparency: The system constructs Actors from a factory and returns references to the instances. Because location doesn’t matter, Actor instances can start, stop, move, and restart to scale up and down as well as recover from unexpected failures. It enables the actors to know the origin of the messages they receive. The sender of the message may exist in the same JVM or another JVM, thus allowing Akka actors to run in a distributed environment without any special code.
Lightweight: Each instance consumes only a few hundred bytes, which realistically allows millions of concurrent Actors to exist in a single application.

Note: Lightbend announced the release of Akka 2.6 in Nov 2019 with a new Typed Actor API ("Akka Typed"). In the typed API, each actor needs to declare which message type it is able to handle and the type system enforces that only messages of this type can be sent to the actor. The classic Akka APIs are still supported even though it is recommended to use the new Akka Typed API for new projects. Currently only classic Akka APIs are used in this post but as the new typed APIs is adopted it would be updated.

Actors and Actor System

An actor is a container for State, Behavior, a Mailbox, Child Actors and a Supervisor Strategy. An actor object needs to be shielded from the outside in order to benefit from the actor model hence actors are represented to the outside using actor references, which are objects that can be passed around freely and without restriction.

ActorSystem

All actors form part of an hierarchy known as the actor system. Actor system is a logical organization of actors into a hierarchical structure. It provides the infrastructure through which actors interact with one another. It is a heavyweight structure per application, which allocates n number of threads. Hence it is recommended to create one ActorSystem per application. ActorSystem manages the life cycle of the actors within the system and supervises them. The creator actor becomes the parent of the newly created child actor. The list of children is maintained within the actor's context. Akka creates two built-in guardian actors in the system before creating any other actor. The first actor created by actor system is called the root guardian. It is the parent of all actors in the system, and the last one to stop when the actor system is terminated. The second actor created by actor system is called the system guardian with system namespace. Akka or other libraries built on top of Akka may create actors in the system namespace. The user guardian is the top level actor that we can provide to start all other actors in the application. An actor created using system.actorOf() are children of the user guardian actor. Top-level user-created actors are determined by the user guardian actor as to how they will be supervised. The root guardian is the parent and supervisor of both the user guardian and system guardian.

Messaging and Mailbox

Actors communicate exclusively by exchanging messages. Every actor has an exclusive address and a mailbox through which it can receive messages from other actors. Mailbox messages are processed by the actor in consecutive order. There are multiple mailbox implementations with the default implementation being FIFO.

Akka ensures that each instance of an actor runs in its own lightweight thread, shielding it from rest of the system and that messages are processed one at a time. The messages are required to be immutable since actors can potentially access the same messages concurrently, in order to avoid race conditions and unexpected behaviors. Hence actors can be implemented without explicitly worrying about concurrency and synchronized access using locks. When a message is processed it is matched with the current behavior of the actor which is the function which defines the actions to be taken in reaction to the message at that point in time. When a message is sent to an actor which does not exist or is not already running then it goes to a special Dead Letter actor (/deadLetters).

State

An actor contains many instance variables to maintain state while processing multiple messages. Each actor is provided with the following useful information for performing its tasks via the Akka Actor API:

sender: an ActorRef to the sender of the message currently being processed
context: information and methods relating to the context within which the actor is running
supervisionStrategy: defines the strategy to be used for recovering from errors
self: the ActorRef for the actor itself

Behind the scenes Akka will run sets of actors on sets of real threads, where typically many actors share one thread, and subsequent invocations of one actor may end up being processed on different threads. Akka ensures that this implementation detail does not affect the single-threadedness of handling the actor’s state. Since the internal state is vital to an actor's operations, when the actor fails and is restarted by its supervisor, the state will be created from scratch.

Actor Lifecycle

An actor is a stateful resource that has to be explicitly started and stopped. An actor can create, or spawn, an arbitrary number of child actors, which in turn can spawn children of their own, thus forming an actor hierarchy. ActorSystem hosts the hierarchy and there can be only one root actor, an actor at the top of the hierarchy of the ActorSystem. ActorContext has the contextual information for the actor and the current message. It is used for spawning child actors and supervision, watching other actors to receive a Terminated(otherActor) events, logging and request-response interactions using ask with other actors. ActorContext is also used for accessing self ActorRef using context.getSelf().

Every Actor that is created must also explicitly be destroyed even when it's no longer referenced. The lifecycle of a child actor is tied to the parent, a child can stop itself or be stopped by parent at any time but it can never outlive its parent. In other words, whenever an actor is stopped, all of its children are recursively stopped. Actor can be stopped using the stop method of the ActorSystem. Stopping a child actor is done by calling context.stop(childRef) from the parent.

An Actor has the following life-cycle methods:

Actor's constructor: An actor’s constructor is called just like any other Scala class constructor, when an instance of the class is first created.
preStart: It is only called once directly during the initialization of the first instance, i.e. at creation of its ActorRef. In the case of restarts, preStart() is called from postRestart(), therefore if not overridden, preStart() is called on every restart.
postStop: It is sent just after the actor has been stopped. No messages are processed after this point. It can be used to perform any needed cleanup work. Akka guarantees postStop to run after message queuing has been disabled for the actor. All PostStop signals of the children are processed before the PostStop signal of the parent is processed.
preRestart: According to the Akka documentation, when an actor is restarted, the old actor is informed of the process when preRestart is called with the exception that caused the restart, and the message that triggered the exception. The message may be None if the restart was not caused by processing a message.
postRestart: The postRestart method of the new actor is invoked with the exception that caused the restart. In the default implementation, the preStart method is called.

If initialization needs to occur every time an actor is instantiated, then constructor initialization is used. On contrary, if initialization needs to occur only the first instance of the actor which is created, then initialization is added to preStart and postRestart is overridden to not call the preStart method. Below is the example of implementing all the actor lifecycle methods.

import akka.actor.{Actor,ActorSystem, Props}  
  
class RootActor extends Actor{  
  def receive = {  
    case msg => println("Message received: "+msg);  
    10/0;  
  }  
  override def preStart(){  
    super.preStart();  
    println("preStart method is called");  
  }  
  override def postStop(){  
    super.postStop();  
    println("postStop method is called");  
  }  
  override def preRestart(reason:Throwable, message: Option[Any]){  
    super.preRestart(reason, message);  
    println("preRestart method is called");  
    println("Reason: "+reason);  
  }  
  override def postRestart(reason:Throwable){  
    super.postRestart(reason);  
    println("postRestart is called");  
    println("Reason: "+reason);  
  }  
}

Supervision and Monitoring

In an actor system, each actor is the supervisor of its children. When an actor creates children for delegating its sub-tasks, it will automatically supervise them. If an actor fails to handle a message, it suspends itself and all of its children and sends a message, usually in the form of an exception, to its supervisor. Once an actor terminates, i.e. fails in a way which is not handled by a restart, stops itself or is stopped by its supervisor, it will free up its resources, draining all remaining messages from its mailbox into the system's dead letter mailbox. The mailbox is then replaced within the actor reference with a system mailbox, redirecting all new messages into the drain. Actors cannot be orphaned or attached to supervisors from the outside, which might otherwise catch them unawares.

When a parent receives the failure signal from its child, depending on the nature of the failure, the parent decides from following options:

Resume: Parent starts the child actor, keeping its internal state.
Restart: Parent starts the child actor by clearing its internal state.
Stop: Stop the child actor permanently.
Escalate: Escalate the failure by failing itself and propagating the failure to its parent.

Restart of the actor is carried by creating a new instance of the underlying Actor class and replacing the failed instance with the fresh one inside the child's ActorRef. The new actor then resumes processing its mailbox, without processing the message during which the failure occurred.

The supervision strategy is typically defined by the parent actor when it spawns a child actor. The default supervisor strategy is to stop the child in case of failure. There are two types of supervision strategies to supervise any actor, namely one-for-one strategy and one-for-all strategy. The one-for-one strategy applies the supervision directive to the failed child while one-for-all strategy applies it to all its siblings. The one-for-one strategy is the default supervision strategy. Below is an example One-for-One Strategy implementation.

import akka.actor.SupervisorStrategy.{Escalate, Restart, Stop}
import akka.actor.{Actor, OneForOneStrategy}

class Supervisor extends Actor {

 override val supervisorStrategy =
    OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
      case _: ArithmeticException      => Resume
      case _: NullPointerException     => Restart
      case _: IllegalArgumentException => Stop
      case _: Exception                => Escalate
    }

  def receive = {
    case p: Props => sender() ! context.actorOf(p)
  }
}

Monitoring is used to tie one actor to another so that it may react to the other actor’s termination, in contrast to supervision which reacts to failure. Since actors creation and restarts are not visible outside their supervisors, the only state change available for monitoring is the transition from alive to dead. Lifecycle monitoring is implemented using a Terminated message to be received by the monitoring actor, where the default behavior is to throw a special DeathPactException if not otherwise handled. The Terminated messages can be listened by invoking ActorContext.watch(targetActorRef). Termination messages are delivered regardless of the order of termination of the actors. Monitor enables to recreate actors or schedule it at later time as a retry mechanism, thus providing an alternative to restarting the actor by the supervisor. Below example show the parent actor monitoring the child actor for termination using watch() method.

import akka.actor.Actor  
import akka.actor.ActorLogging  
import akka.actor.PoisonPill  
import akka.actor.Terminated
import akka.actor.SupervisorStrategy.Escalate

class DeathPactParentActor extends Actor with ActorLogging {

   override val supervisorStrategy = OneForOneStrategy() {
     case _: Exception => {
       log.info("The exception is ducked by the Parent Actor. Escalating to TopLevel Actor")
       Escalate
     }
   }
  
  def receive={
    case "start"=> {
      val child=context.actorOf(Props[DeathPactChildActor])
      context.watch(child) //Watches but doesnt handle terminated message. Throwing DeathPactException here.
      child ! "stop"
    }
    // case Terminated(_) => log.error(s"$actor died"")        // Terminated message not handled
  }
}

class DeathPactChildActor extends Actor with ActorLogging {  
  def receive = {
    case "stop"=> {
      log.info ("Actor going to be terminated")
      self ! PoisonPill
    }
  }
}

Creating Actors

Akka creates Actor instances using a factory spawn methods which return reference to the actor's instance. Props is a configuration class to specify immutable options for the creation of an Actor, which contains information about a creator, routing, deploy etc. Actors are created by passing a Props configuration object into the actorOf factory method which is available on ActorSystem and ActorContext. The call to actorOf returns an instance of ActorRef which is a handle to the actor instance and the only way to interact with it. It is recommended to provide factory methods on the companion object of each Actor which helps keeping the creation of suitable Props close to the actor definition. Actors are automatically started asynchronously when created. It is also recommended to declare one actor within another as it breaks actor encapsulation.

Actors are implemented by extending the Actor base trait and implementing the receive method.

The receive method defines a series of case statements determining which messages the Actor handles using standard Scala pattern matching. The Akka Actor receive message loop is exhaustive, with all the messages which the actor can accept is pattern matched, otherwise an unknown message is sent to akka.actor.UnhandledMessage of the ActorSystem's EventStream. The result of the receive method is a partial function object, which is stored within the actor as its "initial behavior".

import akka.actor.Actor
import akka.event.Logging
 
class StatusActor extends Actor {
  val log = Logging(context.system, this)
  def receive = {
    case "complete" => log.info("Task Completed!")
    case "pending" => log.info("Task Pending!")
    case _ => log.info("Task Status Unknown")
  }
}

Messages

Messages are mostly passed using Scala's Case class being immutable by default and easily able to integrate with pattern matching. Messages are sent to an Actor through one of the following Scala methods.

! means "fire-and-forget", e.g. send a message asynchronously and return immediately. Also known as tell.
? sends a message asynchronously and returns a Future representing a possible reply. Also known as ask.

Since ask has performance implications it is recommended to use tell unless really required. Message ordering is guaranteed on a per-sender basis. Below is example of sending tell message.

import akka.actor.{Props, ActorSystem}
 
object ActorDemo {
  def main(args: Array[String]): Unit = {
    val system = ActorSystem("demo-system")
    val props = Props[StatusActor]
    val statusActor = system.actorOf(props, "statusActor-1")
    statusActor ! "pending"
    statusActor ! "blocked"
    statusActor ! "complete"
    system.terminate()
  }
}

A message can be replied using sender() method which provides the ActorRef to the instance of sender Actor. To replying to a message we just need to use sender() ! ReplyMsg. If there is no sender i.e. a message which is sent without an actor or future context, the default sender is a 'dead-letter' actor reference. Messages can also be forwarded from one actor to another. In such case, address/reference of an Actor is maintained even though the message is going through a mediator. It is helpful when writing actors that work as routers, load-balancers, replicators etc.

import akka.actor.{Actor,ActorSystem, Props};  
class ParentActor extends Actor{  
  def receive = {  
    case message:String => println("Message recieved from "+sender.path.name+" massage: "+message);  
    val child = context.actorOf(Props[ChildActor],"ChildActor");  
    child ! message    // Message forwarded to child actor   
  }  
}  

class ChildActor extends Actor{  
  def receive ={  
    case message:String => println("Message recieved from "+sender.path.name+" massage: "+message);  
    println("Replying to "+sender().path.name);  
    sender()! "I got your message";  // Child Actor replying to parent actor
  }  
}  
  
object ActorExample{  
  def main(args:Array[String]){  
    val actorSystem = ActorSystem("ActorSystem");  
    val actor = actorSystem.actorOf(Props[ParentActor], "ParentActor");  
    actor ! "Hello";  
  }  
}

ActorContext

ActorContext has the contextual information for the actor and the current message. It is used for spawning child actors and supervision, watching other actors to receive a Terminated(otherActor) events, logging and request-response interactions using ask with other actors. ActorContext is also used for accessing self ActorRef using context.getSelf().

ActorRef

ActorRef is a reference or immutable handle to an actor within the ActorSystem. It is shared between actors as it allows other actors to send messages to the referenced actor. Each actor has access to its canonical (local) reference through the ActorContext.self field. ActorRef always represents an incarnation (path and UID) not just a given path, once an actor is stopped and new actor uses the same name as ActorRef, the original ActorRef will not point to the new actor. There are special types of actor references which behave like local actor references, e.g. PromiseActorRef, DeadLetterActorRef, EmptyLocalActorRef and DeadLetterActorRef.

ActorPath

Actors are created in a strictly hierarchical fashion. Hence there exists a unique sequence of actor names given by recursively following the supervision links between child and parent down towards the root of the actor system. ActorPath is a unique path to an actor that shows the creation path up through the actor tree to the root actor. An actor path consists of an anchor, which identifies the actor system, followed by the concatenation of the path elements, from root guardian to the designated actor; the path elements are the names of the traversed actors and are separated by slashes. ActorPath can be created without creating the actor itself, unlike ActorRef which requires a corresponding actor. ActorPath from terminated actor can be reused to a new incarnation of the actor. ActorPath are similar to file path, for example, "akka://my-sys/user/service-a/worker1".

ActorSelection

ActorSelection is another way to represent an actors similar to ActorRef. It is a logical view of a section of an ActorSystem's tree of Actors, allowing for broadcasting of messages to that section. ActorSelection points to the path (or multiple paths using wildcards) and is completely oblivious to which actor's incarnation is currently occupying it.

val selection: ActorSelection = context.actorSelection("/user/a/*/c/*")

val actorRef = system.actorSelection("/user/myActorName/").resolveOne()

Stopping Actor

Actors can be stopped by invoking the stop() method of either ActorContext or ActorSystem class. ActorContext is used to stop child actor and ActorSystem is used to stop the top level Actor. The actual termination of the actor is performed asynchronously. Some of other methods to stop the actor are PoisonPill, terminate() and gracefulStop().

In the below example the stop() method of ActorSystem passing the ActorRef is used to stop the top level Actor.

object ActorExample{  
  def main(args:Array[String]){  
    val actorSystem = ActorSystem("ActorSystem");  
    val actor = actorSystem.actorOf(Props[ActorExample], "RootActor");  
    actor ! "Hello"  
    actorSystem.stop(actor);
  }  
}

A child actor can be stopped by the parent actor using stop() method from its ActorContext and passing childActor's ActorRef as shown in the below example.

class ParentActor extends Actor{  
  def receive = {  
    case message:String => println("Message received " + message);  
    val childactor = context.actorOf(Props[ChildActor], "ChildActor");  
    context.stop(childactor);  
      
    case _ => println("Unknown message");  
  }  
}

The terminate method stops the guardian actor, which in turn recursively stops all its child actors.

Exceptions

Exceptions can occur while an actor is processing a message. In case when an exception occurs while the actor is processing a message taken out of mailbox, the corresponding message is lost. The mailbox and the remaining messages in it remain unaffected. In order to retry the processing of the message, the exception must be caught and handled to retry processing the message. If code within an actor throws an exception, that actor is suspended and the supervision process is started. Depending on the supervisor's decision the actor is resumed, restarted (wiping out its internal state and starting from scratch) or terminated.

Scheduler

Scheduler is a trait and extends to AnyRef. It is used to handle scheduled tasks and provides the facility to schedule messages. We can schedule sending of messages and execution of tasks. It creates new instance for each ActorSystem for scheduling tasks to happen at specific time. It returns a cancellable reference to the scheduled operation which can be cancelled by calling cancel method on the reference object. We can implement Scheduler by importing akka.actor.Scheduler package.

Logging

Akka also comes built-in with a logging utility for actors, and it can access it by simply adding the ActorLogging trait. An actor by implementing the trait akka.actor.ActorLogging can become a logging member. The Logging object provides error, warning, info, or debug methods for logging messages.

The Logging constructor's first parameter is any LoggingBus, specifically system.eventStream, while the second parameter is the source of the logging channel. Logging is performed asynchronously to ensure that logging has minimal performance impact. Log events are processed by an event handler actor that receives the log events in the same order they were emitted. By default log messages are printed to STDOUT, but it can also plug-in to a SLF4J logger or any custom logger. By default messages sent to dead letters are logged at INFO level Akka provides a additional configuration options for very low level debugging and customized logging.

When to use Akka

Akka is ideal for the concurrency models with a shared mutable state accessed by multiple threads which are required to be synchronized. It provides an elegant solution were multiple threads needs to communicate with each other asynchronously in order to provide concurrency. Akka actors use java threads internally and assign threads to only on-demand actors which has work to do and don't have any occupied threads. This prevents the application from running thousands of threads simultaneously adding overhead on most machines especially at peak loads. It enforces encapsulation of behavior and state within individual actors without resorting to locks. Akka actor passing messages between each other avoids blocking and the usage of inter-thread communication Java methods like wait(), notify() and notifyAll(). Akka handles errors more gracefully with the supervisor actor when informed of the child actor's failure, it could carry out retries by creating new child actors. Akka encourages push mindset, were entities react to signals, changing state, and send signals to each other to drive the system forward. Akka provides better performance than using native Java threads, even with Executors for standard implementation and fixed amount of threads.

Thursday, December 31, 2015

The Performance Problem

Nobody likes to wait. Nobody is willing to spend time on something if its slow. Performance is the most important aspect to be considered in any software development process after easy of use. Google Search, Youtube, Facebook or WhatsApp won't be popular if they were slow, no matter their content or user interface. Memory management is core in achieving the best performance. Despite the fact that memory and flash storage is getting cheaper by the passing day, badly implemented solutions can still run the system out of memory, thus degrading performance and ultimately crashing the system. The common misconception that performance should be focused in later stage of development or solution hardening process is a recipe for failure. Performance should be accounted for during the initial design and development which includes choosing the architecture, defining database structure, designing data flow and algorithms.

Measure and Identify Problems
Identify potential delays by sketching out the flow of data through the entire system. Chart where it enters, where it goes, and where it ends up. Mark the sections that you can test directly and note the sections which are out of your control. This flowchart will not offer guaranteed answers but it will be a road map for exploring. Creating a mock data that mirrors real world data instead of running successfully in random numbers helps to identify the areas impacted high load.

Relational Database
Every applications today is more complex and performs many more functions than ever before. Database is critical to such functionality and performance of any application.

Excessive Querying: The major problem with database arises when there are excessive database queries executed. When the applications access the database far too often it results in longer response times and unacceptable overhead on the database. Hibernate and other JPA implementations to some extent provide fine-grained tuning of database access by providing eager and lazy fetching options. Eager fetching reduces the number of calls that are made to the database, but those calls are more complex and slower to execute and they load more data into memory. Lazy fetching increases the number of calls that are made to the database, but each individual call is simple and fast and it reduces the memory requirement to only those objects your application actually uses. The expected behavior of the application would help to decide how to configure the persistence engine. The correlation between number of database calls versus number of executed business transactions helps to troubleshoot the excessive database query (N+1) problem.

Caching: Database calls are very expensive from the performance standpoint, hence caching is the preferred approach to optimize the performance of their applications as it’s much faster to read data from an in-memory cache than to make a database call across a network. Caching help to minimize the database load when the application load increases. Caches are stateful and it is important to retrieve a specific object requested. There are various levels such as level 2 cache which sits between the persistence layer and the database, a stand-alone distributed cache that holds arbitrary business objects. Hibernate supports level 2 cache which checks the cache for existence of the object before making a database call and updates the cache. One of the major limitation of caching is that caches are of fixed size. Hence when the cache gets full, the most least recently used objects preferably gets removed, which can result in a "miss" if the removed object is requested. The hit-miss ratio can be optimized by determining the objects to cache and configuring the cache size in order to take advantage of the performance benefits of caching without exhausting all the memory on the machine. Distributed caching provide multiple caches on different servers, with all the changes being propogated to all the members in the cache being updated. Consistency is an overhead for distributed caching and should be balanced with the business requirements. Cache also need to be updated frequently by expiring the objects in order to avoid reading the stale values.

Connection Pool: Database connections are relatively expensive to create, hence rather than creating connections on the fly, they should be created beforehand and used whenever needed to access the database. The database connections pool which contains multiple database connections, enable to execute concurrent queries againist the database and limits the load to the database. When the number of connections in the pool is less then business transactions will be forced to wait for a connection to become available before continuing to process. On the other hand when there are too many connections then they send a high number of requests to the database causing high load and making all the business transactions to suffer from slow database performance. Hence the database pool size should be tuned carefully.

Normalization
Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and eliminate inconsistent dependency. Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. Normalized tables have smaller foot-print and are small enough to get fit into the buffer. Updates are faster as there are no duplicates and inserts are faster as data is inserted at a single place. Selects are faster for single tables were the size is smaller to fit in the buffer and heavy duty group by or distinct queries can be avoided as no duplicates. Although fully normalized tables means more joins between tables are required to fetch the data. As a result the read operations suffer because the indexing strategies do not go well with table joins. When the table is denormalized the select queries are faster as all the data is present in the same table thus avoiding any joins. Also indexes can be used efficiently when querying on single table compared to join queries. When the read queries are more common than updates and the table is relatively stable with infrequent data changes then normalization does not help in such case. Normalization is used to save storage space, but as the price of storage hardware is becoming cheaper it does not offer significant savings. The best approach is to mix normalized and denormalized approaches together. Hence normalize the tables were number of update/insert operations are higher than Select queries and store the all columns which are read together very frequently into single table. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

Indexing and SQL Queries
Indexing is an effective way to tune your SQL database. An index is a data structure that improves the speed of data retrieval operations on a database table by providing rapid random lookups and efficient access of ordered records. On the other hand indexes decreases the performance of DML queries (Insert, Delete, Update) as all indexes need to be modified after these operations. When creating indexes, estimate the number of unique values the column(s) will have for a particular field. If a column can potentially return thousands of rows with same value , which are then searched sequentially, it seldom help in speeding up the queries. Composite index contains more than one field and should be created if it is expected to run queries that will have multiple fields in the WHERE clause and all fields combined will give significantly less rows than the first field alone. Clustered index determines the physical order of data in a table and are particularly efficient on columns that are often searched for range of values.

Below are few of the performance guidelines for SQL queries:

A correlated subquery (nested query) is one which uses values from the parent query. Such query tends to run row-by-row, once for each row returned by the outer query, and thus decreases SQL query performance. It should be refactored as a join for better performance.
Select the specific columns which are required and "Select *" should be avoided, as it reduces the load on the resources to fetch the details of the columns.
Avoid using Temporary tables as it increases the query complexity. In case of a stored procedure with some data manipulation which cannot be handled by a single query, then temporary tables can be used as intermediaries in order to generate a final result.
Avoid Foreign keys constraints which ensure data integrity at the cost of performance. If performance is the primary goal then data integrity rules should be pushed to the application layer.
Many databases return the query execution plan for SELECT statements which is created by the optimizer. Such plan is very useful in fine tuning SQL queries. e.g. Query Execution Plan or Explain.
Avoid using functions or methods in the where clause.
Deleting or updating large amounts of data from huge tables when ran as a single transaction might require to kill or roll back the entire transaction. It takes a long time to complete and also block other transactions for their duration, essentially bottle-necking the system. Deleting or updating in smaller batches enhances concurrency as small batches are committed to disk and has a small number of rows to roll back on failure. On other hand single delete/update operation increases the number of transactions and decreases the efficiency.

Design
Performance should be considered at every step in designing the system. Below are some few design considerations to made while developing application services.

Avoid making multiple calls to the database especially inside the loop. Instead try to fetch all the required records beforehand and loop through them to find a match.
Decision to make the application services stateless instead of stateful does come with a performance price. In an effort to make the services stateless, the common data (e.g. user roles) required by multiple services need to be fetched every time adding overhead to the database. Hence such design decision for all the services to be stateless should be carefully considered.
Database calls are expensive compared to Network latency. Hence it is always preferred to reuse the data fetched from the database by either caching it or by sending it to the client in order to be sent back again for future processing. Although it purely depends on the size of the data and the complexity of the queries fetching the data from the database.
When the size of the data is too large to fetch or process in a single call then pagination should be applied. If the services are stateless and fetching the data involves multiple joins or queries, then pagination fails as the service still needs to fetch all the rows and determine the chunk which the client has requested for. In such cases, the database calls should be split into two. First fetch all the rows containing only the meta-data (especially unique ids) by which a chunk can be determined. Then another call is made to use the meta-data (unique ids) in order to fetch the entire data for the requested chunk.
When fetching a large chunk of records from a database/service takes a substantial toll on the performance, multiple threads can be spawned, each calling the database/service in small chunks and then merging the all the results into a final result.
When the number of records and size of data in the database increases exponentially then no matter the amount of indexing and tuning, the performance of the system will deter. In such cases high scalability database architectures such as clustered databases and distributed database processing frameworks such as Hadoop should be considered.

Algorithms
Algorithms are core building blocks inside any application. The complexity of the algorithm affects the performance but not the other way around. The Big O notation is widely used in Computer Science to describe the performance or complexity of an algorithm, by specifically describing the worst-case scenario. It measures the efficiency of an algorithm based on the time it takes for the algorithm to run as a function of the input size. It is an expression of how the execution time of a program scales with the input data. In Big O notation O(N), N could be the actual input, or the size of the input. When calculating the Big O complexity, constants are eliminated as we're looking at what happens as n gets arbitrarily large. Also the less significant terms are dropped such as O(n + n^2) becomes O(n^2), because the less significant terms quickly become less significant as n gets bigger. The statements, If-else cases, loops and nested loops in the code are analyzed to determine the Big O value for the function.
Amortized time is often used when stating algorithm complexity and it looks at an algorithm from the viewpoint of total running time rather than individual operations. When an operation occurs over million times then the total time taken is considered as opposed to worst-case or the best-case of that operation. If the operation is slow on few cases it is ignored while computing Big O, as long as such cases are rare enough for the slowness to be diluted away within a large number of executions. Hence amortised time essentially means the average time taken per operation, when operation occurs many times. It can be a constant, linear or logarithmic.

Below are the few best practices in designing better algorithms.

Avoid execution or computation of an expression whose result is invariant inside the loop, by moving it outside the loop.
Prefer iteration over recursion in order to avoid consuming a lot of stack frames if if it can be implemented using few local variables.
Don’t call expensive methods in an algorithms "leaf nodes", but cache the call instead, or avoid it if the method contract allows it.
Prefer to use Hashmap with O(1) complexity for element retrieval instead of List with O(n) where elements are read continuously from the data storage. Similarly use the appropriate data structure (Collection type) based on the kind of operations used frequently.

Java Coding Practices
Proper coding practices help to avoid the performance problems faced when the system is put through high load either during performance testing or in production.

Avoid using finalizers when possible in order to avoid delay in garbage collection.
Explicitly close resources (streams) without relying on the object finalizers.
Use StringBuffer /StringBuilder to form string rather than creating multiple (immutable) strings.
Use primitive types which are faster than their boxed primitive couterparts (wrapper classes), and avoid unintentional autoboxing.
When data is constant declare it as static and final, since final thus signaling the compiler that the value of this variable or the object referred to by this variable will never change could potentially allow for performance optimizations.
Create an object once during initialization and reuse this throughout the run.
Prefer local variables instead of instance variables which have faster read access for better performance.
The common misconception is that object creation is expensive and should be avoided. On the contrary, the creation and reclamation of small objects whose constructors do little explicit work is cheap, especially on modern JVM implementations, even though its non-trivial and has a measurable cost. Although creation of extremely heavyweight objects such as database connection is expensive and such objects should be reused by maintaining an object pool. Further creating lightweight short lived objects in Java especially in a loop is cheap (apart from the hidden cost that the GC will run more often).
Cache the values in a variable instead of reading repetitively in the loop. For example caching the array length in a local variable is faster than reading the value for every single loop iteration.
Pre-sizing collections such as ArrayLists when the estimated collection size is known during creation, improves performance by avoiding frequent size reallocation.
Prefer to make the variable fields in the class as public for direct access by outside classes rather than implementing getter and setter accessor methods unless the class fields needed to be encapsulated.
Declare the methods as static if it doesn’t need to access other instance methods or fields.
Prefer manual iteration instead of builtin for each loops for ArrayList which uses Iterator object for iteration. Although for arrays for each loops performance better than manual iterations.
Regular expressions should be avoided in the nested loops. When regular expressions is used in computation-intensive code sections, then the Pattern reference should be cached instead of compiling it everytime.
Prefer bitwise operations as compared to the arithmetic operations such as multiplication, division and modulus when there is extensive computations (e.g. cryptography). For example i * 8 can be replace by i << 3, i / 16 replaced by i >> 4 and i % 4 replaced by i & 3.

Memory
Garbage collection occurs as either a minor or major mark-sweep collection. When eden section is full a minor mark-sweep garbage collection is performed and all the surviving objects are tenured or copied to the Tenured Space. During the major garbage collection, the garbage collector performs a mark-sweep collection across the entire heap, and then performs a compaction. It freezes all the running threads in the JVM, and results in the entire young generation to be free with all the live objects being compacted into the old generation space, shrinking its size. The longer it takes to complete or the more often it executes, the application performance is impacted. The amount of time taken by major garbage collection depends on the size of the heap with 2-3 GB of heap takes 3-5 seconds while 30 GB of heap takes 30 seconds. The java command option –verbosegc logs the full garbage collection entries for monitoring. The Concurrent Mark Sweep (CMS) garbage collection strategy allows an additional thread which is constantly marking and sweeping objects, which can reduce the pause times for major garbage collections. In order to mitigate the major garbage collections, the heap should be sized in such a way that short-lived objects are given enough time to die. The size of young generation space should be a little less than half the size of the heap and the survivor ratios should be anywhere between 1/6th and 1/8th the size of the young generation space.

Memory Leaks
A memory leak occurs when memory acquired by a application for execution is never freed-up and the application inadvertently maintains a object references which it never intended to use again. The garbage collector finds the unused objects by traversing from root node through all nodes that are no longer being accessed or referenced and removes them freeing memory resources for the JVM. If the object is unintentionally referenced by other objects then it is excluded from garbage collection, as the garbage collector assumes that someone intended to use it at some point in the future. When this tends to occur in code that is frequently executed it causes the JVM to deplete its memory impacting performance and eventually exhaust its memory by throwing dreaded OutOfMemory error. Below are some of the best practices and common cases in order to avoid memory leaks.

Each variable should be declared with the narrowest possible scope, thus eliminating the variable when it falls out of scope.
When maintaining own memory using set of references, then the object references should be nulled out explicitly when they fell out of scope.
The key used for HashSet/HashMap should have proper equals() and hashCode() method implementations or else adding multiple elements in an infinite loop would cause the elements to expand causing leaks.
Non-static inner/anonymous classes which has an implicit reference to its surrounding class should be used carefully. When such inner class object is passed to a method which stores the references in cache/external object, the local object (having references to enclosing class) is not garbage collected even when its out of scope. Static inner classes should be used instead.
Check if the unused entries in the collections (especially static collections) are removed to avoid ever increasing objects. Prefer to use WeakHashMap when entries are added to the collection without any clean up. Entries in the WeakHashMap will be removed automatically when the map key object is no longer referenced elsewhere. Avoid using primitives (wrappers) and Strings as WeakHashMap keys as those objects are usually short lived and do not share the same lifespan as the actual target tracking objects.
Check when the event listener (callbacks) is registered but not unregistered after the class is not being used any longer.
When using connection pools, and when calling close() on the connection object, the connection returns back to the connection pool for reuse. It doesn't actually close the connection. Thus, the associated Statement and ResultSet objects remain in the memory. Hence, JDBC Statement and ResultSet objects must be explicitly closed in a finally block.
Usage of static classes should be minimized as they stay in memory for the lifetime of the application.
Avoid referencing objects from long-lasting (singleton) objects. If such usage cannot be avoided, use a weak reference, a type of object reference that does not prevent the object from being garbage collected.
Use of HttpSessions should be minimized and used only for state that cannot realistically be kept on the request object. Remove objects from HttpSession if they are no longer used.
ThreadLocal, a member field in the Thread class, and is useful to bind a state to a thread. Thread-local variables will not be removed by the garbage collector as long as the thread itself is alive. As threads are often pooled and thus kept alive virtually forever, the object might never be removed by the garbage collector.
A DOM Node object always belongs to a DOM Document. Even when removed from the document the node object retains a reference to its owning document. As long as the child object reference exists, neither the document nor any of the nodes it refers to will be removed.

Concurrency
Concurrency (aka multithreading) refers to executing several computations simultaneously and allows the application to accomplish more work in less time. In java concurrency is achieved by synchronization. Synchronization requires the thread, which is ready to execute a block of code, to acquire the object's lock leading to many performance issues.

Mutable shared objects are shared or accessible by multiple threads, but can also be changed by multiple threads. Ideally any objects that are shared between threads will be immutable, as immutable shared objects don't pose challenges to multithreaded code.

Deadlocks occur when two or more threads need multiple shared resources to complete their task and they access those resources in a different order or a different manner. When two or more threads each possess the lock for a resource, the other thread needs to complete its task and neither thread is willing to give up the lock that it has already obtained. Also in synchronized block, a thread must first obtain the lock for the code block before executing that code and, no other thread will be permitted to enter the code block while it has the lock. In such case the JVM will eventually exhaust all or most of its threads and the application will become slow although the CPU utilization appears underutilized. Thread dump helps to determine the root cause of deadlocks. Deadlocks can be avoided by making the application resources as immutable.

Gridlock: Thread synchronization is a powerful tool for protecting shared resources, but if major portions of the code are synchronized then it might be inadvertently single-threading the application. If the application has excessive synchronization or synchronization through a core functionality required by large number of business transactions, then the response times becomes slow, with very low CPU utilization as each of these threads reaches the synchronized code to go in waiting state. The impact of synchronized blocks in the code should be analyzed and redesigned to avoid synchronization.

Thread pool contains ready for execution threads which process the requests in the execution queue of the server. Creating and disposing of multiple threads is a common performance issue. If the application uses many threads for quicker response, it can be faster to create a pool of threads so they can be reused without being destroyed. This practice is best when the amount of computation done by each thread is small and the amount of time creating and destroying it can be larger than the amount of computation done inside of it. The size of thread pool directly impacts the performance of the application. If the size of thread pool is too small then the requests wait, while when the thread pool size is too large then many concurrently executing threads consumes the server's resources. Also too many threads causes more time to be spend on time context switching between threads causing threads to be starved. Hence the thread pool should be carefully tuned based on metrics of Thread pool utilization and CPU utilization.