Thursday, December 31, 2015

The Performance Problem

Nobody likes to wait. Nobody is willing to spend time on something if its slow. Performance is the most important aspect to be considered in any software development process after easy of use. Google Search, Youtube, Facebook or WhatsApp won't be popular if they were slow, no matter their content or user interface. Memory management is core in achieving the best performance. Despite the fact that memory and flash storage is getting cheaper by the passing day, badly implemented solutions can still run the system out of memory, thus degrading performance and ultimately crashing the system. The common misconception that performance should be focused in later stage of development or solution hardening process is a recipe for failure. Performance should be accounted for during the initial design and development which includes choosing the architecture, defining database structure, designing data flow and algorithms.

Measure and Identify Problems
Identify potential delays by sketching out the flow of data through the entire system. Chart where it enters, where it goes, and where it ends up. Mark the sections that you can test directly and note the sections which are out of your control. This flowchart will not offer guaranteed answers but it will be a road map for exploring. Creating a mock data that mirrors real world data instead of running successfully in random numbers helps to identify the areas impacted high load.

Relational Database
Every applications today is more complex and performs many more functions than ever before. Database is critical to such functionality and performance of any application.

Excessive Querying: The major problem with database arises when there are excessive database queries executed. When the applications access the database far too often it results in longer response times and unacceptable overhead on the database. Hibernate and other JPA implementations to some extent provide fine-grained tuning of database access by providing eager and lazy fetching options. Eager fetching reduces the number of calls that are made to the database, but those calls are more complex and slower to execute and they load more data into memory. Lazy fetching increases the number of calls that are made to the database, but each individual call is simple and fast and it reduces the memory requirement to only those objects your application actually uses. The expected behavior of the application would help to decide how to configure the persistence engine. The correlation between number of database calls versus number of executed business transactions helps to troubleshoot the excessive database query (N+1) problem.

Caching: Database calls are very expensive from the performance standpoint, hence caching is the preferred approach to optimize the performance of their applications as it’s much faster to read data from an in-memory cache than to make a database call across a network. Caching help to minimize the database load when the application load increases. Caches are stateful and it is important to retrieve a specific object requested. There are various levels such as level 2 cache which sits between the persistence layer and the database, a stand-alone distributed cache that holds arbitrary business objects. Hibernate supports level 2 cache which checks the cache for existence of the object before making a database call and updates the cache. One of the major limitation of caching is that caches are of fixed size. Hence when the cache gets full, the most least recently used objects preferably gets removed, which can result in a "miss" if the removed object is requested. The hit-miss ratio can be optimized by determining the objects to cache and configuring the cache size in order to take advantage of the performance benefits of caching without exhausting all the memory on the machine. Distributed caching provide multiple caches on different servers, with all the changes being propogated to all the members in the cache being updated. Consistency is an overhead for  distributed caching and should be balanced with the business requirements. Cache also need to be updated frequently by expiring the objects in order to avoid reading the stale values.

Connection Pool: Database connections are relatively expensive to create, hence rather than creating connections on the fly, they should be created beforehand and used whenever needed to access the database. The database connections pool which contains multiple database connections, enable to execute concurrent queries againist the database and limits the load to the database. When the number of connections in the pool is less then business transactions will be forced to wait for a connection to become available before continuing to process. On the other hand when there are too many connections then they send a high number of requests to the database causing high load and making all the business transactions to suffer from slow database performance. Hence the database pool size should be tuned carefully.

Normalization
Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and eliminate inconsistent dependency. Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. Normalized tables have smaller foot-print and are small enough to get fit into the buffer. Updates are faster as there are no duplicates and inserts are faster as data is inserted at a single place. Selects are faster for single tables were the size is smaller to fit in the buffer and heavy duty group by or distinct queries can be avoided as no duplicates. Although fully normalized tables means more joins between tables are required to fetch the data. As a result the read operations suffer because the indexing strategies do not go well with table joins. When the table is denormalized the select queries are faster as all the data is present in the same table thus avoiding any joins. Also indexes can be used efficiently when querying on single table compared to join queries. When the read queries are more common than updates and the table is relatively stable with infrequent data changes then normalization does not help in such case. Normalization is used to save storage space, but as the price of storage hardware is becoming cheaper it does not offer significant savings. The best approach is to mix normalized and denormalized approaches together. Hence normalize the tables were number of update/insert operations are higher than Select queries and store the all columns which are read together very frequently into single table. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

Indexing and SQL Queries
Indexing is an effective way to tune your SQL database. An index is a data structure that improves the speed of data retrieval operations on a database table by providing rapid random lookups and efficient access of ordered records. On the other hand indexes decreases the performance of DML queries (Insert, Delete, Update) as all indexes need to be modified after these operations. When creating indexes, estimate the number of unique values the column(s) will have for a particular field. If a column can potentially return thousands of rows with same value , which are then searched sequentially, it seldom help in speeding up the queries. Composite index contains more than one field and should be created if it is expected to run queries that will have multiple fields in the WHERE clause and all fields combined will give significantly less rows than the first field alone. Clustered index determines the physical order of data in a table and are particularly efficient on columns that are often searched for range of values.

Below are few of the performance guidelines for SQL queries:
  • A correlated subquery (nested query) is one which uses values from the parent query. Such query tends to run row-by-row, once for each row returned by the outer query, and thus decreases SQL query performance. It should be refactored as a join for better performance.
  • Select the specific columns which are required and "Select *" should be avoided, as it reduces the load on the resources to fetch the details of the columns.
  • Avoid using Temporary tables as it increases the query complexity. In case of a stored procedure with some data manipulation which cannot be handled by a single query, then temporary tables can be used as intermediaries in order to generate a final result.
  • Avoid Foreign keys constraints which ensure data integrity at the cost of performance. If performance is the primary goal then data integrity rules should be pushed to the application layer.
  • Many databases return the query execution plan for SELECT statements which is created by the optimizer. Such plan is very useful in fine tuning SQL queries. e.g. Query Execution Plan  or Explain.
  • Avoid using functions or methods in the where clause.
  • Deleting or updating large amounts of data from huge tables when ran as a single transaction might require to kill or roll back the entire transaction. It takes a long time to complete and also block other transactions for their duration, essentially bottle-necking the system. Deleting or updating in smaller batches enhances concurrency as small batches are committed to disk and has a small number of rows to roll back on failure. On other hand single delete/update operation increases the number of transactions and decreases the efficiency.

Design
Performance should be considered at every step in designing the system. Below are some few design considerations to made while developing application services.
  • Avoid making multiple calls to the database especially inside the loop. Instead try to fetch all the required records beforehand and loop through them to find a match.
  • Decision to make the application services stateless instead of stateful does come with a performance price. In an effort to make the services stateless, the common data (e.g. user roles) required by multiple services need to be fetched every time adding overhead to the database. Hence such design decision for all the services to be stateless should be carefully considered.
  • Database calls are expensive compared to Network latency. Hence it is always preferred to reuse the data fetched from the database by either caching it or by sending it to the client in order to be sent back again for future processing. Although it purely depends on the size of the data and the complexity of the queries fetching the data from the database.
  • When the size of the data is too large to fetch or process in a single call then pagination should be applied. If the services are stateless and fetching the data involves multiple joins or queries, then pagination fails as the service still needs to fetch all the rows and determine the chunk which the client has requested for. In such cases, the database calls should be split into two. First fetch all the rows containing only the meta-data (especially unique ids) by which a chunk can be determined. Then another call is made to use the meta-data (unique ids) in order to fetch the entire data for the requested chunk.
  • When fetching a large chunk of records from a database/service takes a substantial toll on the performance, multiple threads can be spawned, each calling the database/service in small chunks and then merging the all the results into a final result.
  • When the number of records and size of data in the database increases exponentially then no matter the amount of indexing and tuning, the performance of the system will deter. In such cases high scalability database architectures such as clustered databases and distributed database processing frameworks such as Hadoop should be considered.
Algorithms
Algorithms are core building blocks inside any application. The complexity of the algorithm affects the performance but not the other way around. The Big O notation is widely used in Computer Science to describe the performance or complexity of an algorithm, by specifically describing the worst-case scenario. It measures the efficiency of an algorithm based on the time it takes for the algorithm to run as a function of the input size. It is an expression of how the execution time of a program scales with the input data. In Big O notation O(N), N could be the actual input, or the size of the input. When calculating the Big O complexity, constants are eliminated as we're looking at what happens as n gets arbitrarily large. Also the less significant terms are dropped such as  O(n + n^2) becomes O(n^2), because the less significant terms quickly become less significant as n gets bigger. The statements, If-else cases, loops and nested loops in the code are analyzed to determine the Big O value for the function.
       Amortized time is often used when stating algorithm complexity and it looks at an algorithm from the viewpoint of total running time rather than individual operations. When an operation occurs over million times then the total time taken is considered as opposed to worst-case or the best-case of that operation. If the operation is slow on few cases it is ignored while computing Big O, as long as such cases are rare enough for the slowness to be diluted away within a large number of executions. Hence amortised time essentially means the average time taken per operation, when operation occurs many times. It can be a constant, linear or logarithmic.



Below are the few best practices in designing better algorithms.
  • Avoid execution or computation of an expression whose result is invariant inside the loop, by moving it outside the loop.
  • Prefer iteration over recursion in order to avoid consuming a lot of stack frames if if it can be implemented using few local variables.
  • Don’t call expensive methods in an algorithms "leaf nodes", but cache the call instead, or avoid it if the method contract allows it.
  • Prefer to use Hashmap with O(1) complexity for element retrieval instead of List with O(n) where elements are read continuously from the data storage. Similarly use the appropriate data structure (Collection type) based on the kind of operations used frequently.

Java Coding Practices
Proper coding practices help to avoid the performance problems faced when the system is put through high load either during performance testing or in production.
  • Avoid using finalizers when possible in order to avoid delay in garbage collection.
  • Explicitly close resources (streams) without relying on the object finalizers.
  • Use StringBuffer /StringBuilder to form string rather than creating multiple (immutable) strings.
  • Use primitive types which are faster than their boxed primitive couterparts (wrapper classes), and avoid unintentional autoboxing.
  • When data is constant declare it as static and final, since final thus signaling the compiler that the value of this variable or the object referred to by this variable will never change could potentially allow for performance optimizations.
  • Create an object once during initialization and reuse this throughout the run.
  • Prefer local variables instead of instance variables which have faster read access for better performance.
  • The common misconception is that object creation is expensive and should be avoided. On the contrary, the creation and reclamation of small objects whose constructors do little explicit work is cheap, especially on modern JVM implementations, even though its non-trivial and has a measurable cost. Although creation of extremely heavyweight objects such as database connection is expensive and such objects should be reused by maintaining an object pool. Further creating lightweight short lived objects in Java especially in a loop is cheap (apart from the hidden cost that the GC will run more often).
  • Cache the values in a variable instead of reading repetitively in the loop. For example caching the array length in a local variable is faster than reading the value for every single loop iteration.
  • Pre-sizing collections such as ArrayLists when the estimated collection size is known during creation, improves performance by avoiding frequent size reallocation.
  • Prefer to make the variable fields in the class as public for direct access by outside classes rather than implementing getter and setter accessor methods unless the class fields needed to be encapsulated.
  • Declare the methods as static if it doesn’t need to access other instance methods or fields.
  • Prefer manual iteration instead of builtin for each loops for ArrayList which uses Iterator object for iteration. Although for arrays for each loops performance better than manual iterations.
  • Regular expressions should be avoided in the nested loops. When regular expressions is used in computation-intensive code sections, then the Pattern reference should be cached instead of compiling it everytime.
  • Prefer bitwise operations as compared to the arithmetic operations such as multiplication, division and modulus when there is extensive computations (e.g. cryptography). For example i * 8 can be replace by i << 3, i / 16 replaced by i >> 4 and i % 4 replaced by i & 3.

Memory
Garbage collection occurs as either a minor or major mark-sweep collection. When eden section is full a minor mark-sweep garbage collection is performed and all the surviving objects are tenured or copied to the Tenured Space. During the major garbage collection, the garbage collector performs a mark-sweep collection across the entire heap, and then performs a compaction. It freezes all the running threads in the JVM, and results in the entire young generation to be free with all the live objects being compacted into the old generation space, shrinking its size. The longer it takes to complete or the more often it executes, the application performance is impacted. The amount of time taken by major garbage collection depends on the size of the heap with 2-3 GB of heap takes 3-5 seconds while 30 GB of heap takes 30 seconds. The java command option –verbosegc logs the full garbage collection entries for monitoring. The Concurrent Mark Sweep (CMS) garbage collection strategy allows an additional thread which is constantly marking and sweeping objects, which can reduce the pause times for major garbage collections. In order to mitigate the major garbage collections, the heap should be sized in such a way that short-lived objects are given enough time to die. The size of young generation space should be a little less than half the size of the heap and the survivor ratios should be anywhere between 1/6th and 1/8th the size of the young generation space.

Memory Leaks
A memory leak occurs when memory acquired by a application for execution is never freed-up and the application inadvertently maintains a object references which it never intended to use again. The garbage collector finds the unused objects by traversing from root node through all nodes that are no longer being accessed or referenced and removes them freeing memory resources for the JVM. If the object is unintentionally referenced by other objects then it is excluded from garbage collection, as the garbage collector assumes that someone intended to use it at some point in the future. When this tends to occur in code that is frequently executed it causes the JVM to deplete its memory impacting performance and eventually exhaust its memory by throwing dreaded OutOfMemory error. Below are some of the best practices and common cases in order to avoid memory leaks.
  • Each variable should be declared with the narrowest possible scope, thus eliminating the variable when it falls out of scope.
  • When maintaining own memory using set of references, then the object references should be nulled out explicitly when they fell out of scope.
  • The key used for HashSet/HashMap should have proper equals() and hashCode() method implementations or else adding multiple elements in an infinite loop would cause the elements to expand causing leaks.
  • Non-static inner/anonymous classes which has an implicit reference to its surrounding class should be used carefully. When such inner class object is passed to a method which stores the references in cache/external object, the local object (having references to enclosing class) is not garbage collected even when its out of scope. Static inner classes should be used instead.
  • Check if the unused entries in the collections (especially static collections) are removed to avoid ever increasing objects. Prefer to use WeakHashMap when entries are added to the collection without any clean up. Entries in the WeakHashMap will be removed automatically when the map key object is no longer referenced elsewhere. Avoid using primitives (wrappers) and Strings as WeakHashMap keys as those objects are usually short lived and do not share the same lifespan as the actual target tracking objects.
  • Check when the event listener (callbacks) is registered but not unregistered after the class is not being used any longer.
  • When using connection pools, and when calling close() on the connection object, the connection returns back to the connection pool for reuse. It doesn't actually close the connection. Thus, the associated Statement and ResultSet objects remain in the memory. Hence, JDBC Statement and ResultSet objects must be explicitly closed in a finally block.
  • Usage of static classes should be minimized as they stay in memory for the lifetime of the application.
  • Avoid referencing objects from long-lasting (singleton) objects. If such usage cannot be avoided, use a weak reference, a type of object reference that does not prevent the object from being garbage collected.
  • Use of HttpSessions should be minimized and used only for state that cannot realistically be kept on the request object. Remove objects from HttpSession if they are no longer used.
  • ThreadLocal, a member field in the Thread class, and is useful to bind a state to a thread. Thread-local variables will not be removed by the garbage collector as long as the thread itself is alive. As threads are often pooled and thus kept alive virtually forever, the object might never be removed by the garbage collector.
  • A DOM Node object always belongs to a DOM Document. Even when removed from the document the node object retains a reference to its owning document. As long as the child object reference exists, neither the document nor any of the nodes it refers to will be removed.
Concurrency
Concurrency (aka multithreading) refers to executing several computations simultaneously and allows the application to accomplish more work in less time. In java concurrency is achieved by synchronization. Synchronization requires the thread, which is ready to execute a block of code, to acquire the object's lock leading to many performance issues.

Mutable shared objects are shared or accessible by multiple threads, but can also be changed by multiple threads. Ideally any objects that are shared between threads will be immutable, as immutable shared objects don't pose challenges to multithreaded code.

Deadlocks occur when two or more threads need multiple shared resources to complete their task and they access those resources in a different order or a different manner. When two or more threads each possess the lock for a resource, the other thread needs to complete its task and neither thread is willing to give up the lock that it has already obtained. Also in synchronized block, a thread must first obtain the lock for the code block before executing that code and, no other thread will be permitted to enter the code block while it has the lock. In such case the JVM will eventually exhaust all or most of its threads and the application will become slow although the CPU utilization appears underutilized. Thread dump helps to determine the root cause of deadlocks. Deadlocks can be avoided by making the application resources as immutable.

Gridlock: Thread synchronization is a powerful tool for protecting shared resources, but if major portions of the code are synchronized then it might be inadvertently single-threading the application. If the application has excessive synchronization or synchronization through a core functionality required by large number of business transactions, then the response times becomes slow, with very low CPU utilization as each of these threads reaches the synchronized code to go in waiting state. The impact of synchronized blocks in the code should be analyzed and redesigned to avoid synchronization.

Thread pool contains ready for execution threads which process the requests in the execution queue of the server. Creating and disposing of multiple threads is a common performance issue. If the application uses many threads for quicker response, it can be faster to create a pool of threads so they can be reused without being destroyed. This practice is best when the amount of computation done by each thread is small and the amount of time creating and destroying it can be larger than the amount of computation done inside of it. The size of thread pool directly impacts the performance of the application. If the size of thread pool is too small then the requests wait, while when the thread pool size is too large then many concurrently executing threads consumes the server's resources. Also too many threads causes more time to be spend on time context switching between threads causing threads to be starved. Hence the thread pool should be carefully tuned based on metrics of Thread pool utilization and CPU utilization.