Productionizing Backend Development

Introduction

This page lists the various terms related to testing of a software application. This isn't an exhaustive list but only contains terms that I consider important. For a more detailed lists of different tests, see this article at Atlassian. Tests are an integral part of software development because they ensure that the developed application meets the business requirements, and in an expected manner, and do not have any unintended side effects. A new code should be made available for public use only after it has been throoughly tested to behave in a manner as expected by the business. Cross-referencing the tests is exteremely useful to identify if a newly added code has implemented all desired business requirements and is not breaking an existing business feature. Tests also help with code review; A piece of code that is not tested is likely indicative of adding business features that are not required or coding in a way that does not meet accepted best practices. For new developer who are reading the code for the first time and trying to understand its intent, it is very helpful to cross reference a code with the corresponding tests.

A test case is a code that executes "test cases" aginst the code of web application to verify if it is working as expected, both in positive cases (i.e., when things should work correctly) and in negative cases (i.e., when things should fail). A test can be broadly categorized into unit test, functional test, integration test, end to end test and load test. Note that except unit tests, there may not be whole lot of consensus on definition of other tests. My suggestion would be to focus on what a test does and why it is important, rather than on what it is named! The different test categories are also discussed later. Traditionally, tests were written after a developer has already added the software code. The philosophy of test driven development, or TDD, reverses this idea and suggests adding tests first. It is discussed more later. Thankfully, there isn't much that can go wrong in writing tests as long as some of the basics of testing are covered. This book does not cover all of the terms because testing related details can easily be obtained from the internet. Nonetheless, it is important to be aware of these terms and what they mean.

Confusion in test category names

Before even starting to compare different tests, the readers should note that as of now, and as of my knowledge, the definition given below for unit test, functional test, etc., are different from one text to another, and from one team to another! There is no naming standard on what constitutes the scope of unit test, or functional test, etc. So, when reading about the various tests below, I want to request the readers to focus on the functionality provided by a certain group of tests and understand why it is important; And not to get hung up on whether it is given a proper name! I have tried my best to categorize various types of tests based on their purpose, and so, all of the tests are important in one way or another.

Unit test

See article about unit testing on Wikipedia. When writing code for a software application, it is almost always that the codes to achieve a certain business function is spread across multiple classes or objects. A unit test is ideally designed to test if a single public method in a single class behaves as expected, when it is given a certain input, and when all other code dependencies act in a particular manner. Rephrasing with technical terms, a unit test verifies the behavior of a single public method in a single class by "mocking" or "stubbing" the behavior of all other classes and methods that is depended on by the method being tested. Side-note: (1) Various testing frameworks like Mockito in Java, and unittest.mock in Python provide ability to mock or stub an existing code behavior; (2) For details about mocking vs stubbing or spying, the references can be used from StackOverflow, Grails and here.

The AAA (Arrange, Act, Assert) pattern is a common way of writing unit tests. The Arrange section of a unit test method initializes objects and sets the value of the data that is passed to the method under test. The Act section invokes the method under test with the arranged parameters. The Assert section verifies that the action of the method under test behaves as expected.
Realize that when making tests, one must not just test that an expected behavior is observed, but also that unexpected behaviors raise exceptions, or disallow processing (i.e., negative tests).
Unit test method names are conventially kept long and descriptive so that the intent of the test is reflected in the test method name itself; This is unlike the convention for naming the method of a class which is kept short. This is a good post on StackOverflow that discusses the naming convention for unit tests.
It is very likely that the code language has a library that provides random fake data. For example, Faker in Python, Java faker in Java and Faker in Javascript. As much possible, these librarues should be used to prepare random data that is then used in the test setup. Running tests using randomly generated data rather than hard wired data is a great way to ensure that the tests are robust.

Functional test

See article about functional testing on Wikipedia. A functional test verifies that when a user interacts with the web application, then they see an expected response. So, this testing occurs on the request/response level. Hence, only the web server on which the software aplication is loaded, and any external dependencies, like the database, or external API calls are mocked. Unlike unit test, a functional test does not care about how the different classes that are involved in the request processing interact with each other, and also, it does not mock any internal web application code. Almost all web application development frameworks (like Spring for Java, Django for Python, etc.) also provide testing utilities to set up a mock server to which requests can be made as if done by a real user. The mock server processes this request and returns a response. The test verifies if the returned response matches the expectation. Just like the unit test, each functional test should also try to test a single business feature and so, the AAA (Arrange, Act, Assert) pattern can be used when writing functional tests. Also, like unit tests, the test method names are kept long and descriptive, and fake data library should be used to create random test data.

Test data provider

Since functional tests aim to mimick a user interaction as closely possible, it is necessary that the test data used in these tests (for example, data sent in body of mocked request, or data stored in the mocked database) are as close as possible to what it might be in a real use case. Hence, it is a good idea to use centralized data provider(s) that prepares test data in proper form as expected by the web application. For example, let's say that a user record in your application contains a user name which can be 20 characters maximum and an email which must be from your-application-domain.com. In this case, a provider for user test data is a method that returns a user entry having values that agree with the corresponding restrictions. In future, if the application removes the restriction of email to only be from your-application-domain.com domain, then this change will need to be reflected only in the test data provider for user, and all functional tests that rely on this provider will get updated!

Background job test

A web application may involve background processing tasks that are executed in addition to handling web requests from users. If these tasks modify database entries made in some previous request, or prepare data without which future requests may fail, then they have an indirect interaction with the user and so, they must be tested. For example, let's say that a web application allows user to upload a docx resume, and the application automatically converts it into a pdf file at some later time. Since these background methods may run asynchornously, testing these methods may require developing custom methods to force the background task to run synchronously, and after which the test assertions can be executed. It should also be ensured that the custom methods do not create any daemon threads or cause any memory leaks.

Server settings during functional test

It is a good practice to provide all web application configurations at runtime (reference: here, as one of 12-factors in developing web application, here). It is very likely that when the testing framework makes a mock server on which functional tests are run, then it will attempt to look for configuration settings to use for the mock server. The suggestion here is to ensure that a configuration file is available that can be used by the mock server.

External API test

If the web application makes an api call to an external resource, either as part of a request processing, or from a background task, then it must be tested that codes for these external interaction are behaving in an expected manner. Functional testing of external api calls or third party applications is achieved by mocking the external api used in the application (just as one may do in a unit test) and configuring the mock server to use these mocks during the test runs. Thus, even when executing functional tests, external api calls are tested in same way as in a unit test. Note that it may also be the case that these third party api calls are being done within some background tasks. In this case, the test must invoke the background task and verify the effects of third party api call.

Integration test

An integration test can be seen as a functional test, where no methods are mocked, and testing is done by only using the API available to a user. An integration test is run on web application that is deployed on an actual server and connected to a real database. Any external api calls made are also real and not mocked. Thus, it tests if the integration between the various components used by the web application works in expected manner. This can be helpful in scenarios where, say, one of the external api call fails because the corresponding API provider modified the API specifications. Since functional tests mock the behavior of external api calls, so this change would not have gotten caught in functional tests. Integration tests are also preferred when it needs to be verified that a sequence of operations done in a workflow behave in an expected manner. Although it is also possible to do a workflow verification in a functional test, that would require creating a lot of mocked test data and managing it quickly becomes cumbersome! One drawback of integration test is that it cannot verify actions done asynchronously by background tasks. Another drawback is that the data created in the database by integration tests may live for a long time and need manual deletion.

End-to-end test

The simplest way to think of an end-to-end testing is as an integration test executed manually over all steps of a business workflow, spanning across multiple website domains. Additionally, if there are any asynchornous background task(s) involved, then the test is paused till the task completes, and is picked up thereafter. Any business critical observations that are collected as logs or runtime metrics can also be cross checked during an end-to-end test. Like integration tests, it is suggested to only focus on high-impact workflows within these tests.

Load test

See this article about load testing on Wikipedia. Load test involves simulating an actual request load from expected count of different users to the system, and to then verify if the system is able to serve requests within a desired latency. It is also expected that all requests are successfully processed and do not unexpectedly fail. Preparation for load test involves creating a testing environment with similar count and specification of servers as would run in production environment; Let's call these the "load test servers". A different group of servers are spun up which emulates users who will make request to the "load test servers"; Let's call these "user servers". As part of load test, mutiple requests are simulatenously made from the user servers to the load test servers. It is identified that up to how many requests per second can be handled by load test servers before the response latency starts to increase. It is also analzed that the load test servers are not having any unexpected cpu and memory usage when they are loaded. An interesting feature of load test is that among all tests listed on the page, it is the the only one that can identify concurrency related bugs (like, race conditions in code, deadlocks in database connection, memory leak, etc.). While load tests can be skipped by new business because they will likely not see a huge workload for some time, it is an absolute must for important web applications! Note that there are many more considerations that go into designing and analyzing a load test, but it is not covered here because it is outside the scope of this section. Readers are encouraged to explore more on load tests.

Comparison of tests

Test pyramid

Almost all of the modern software development process use the agile methodology. Mike Cohn (reference: Wikipedia, company) is one of the contributors to the scrum software development method and one of the founders of the Scrum Alliance. In his book Succeeding With Agile, he describes the concept of a "test automation pyramid", describing the three levels of test automation, their relation and their relative importance (reference: here; Disclosure: I haven't fully read the book! I encourage the readers to also go over the book because it may contain additional information that I, and many other webpages that refer to the book, might have missed or overlooked). Side-note: Automated tests are important part of software development lifecycle because they prevent new code with unexpected and breaking behavior to make inside the application codebase. They are comprised of unit test, functional test, integration test. Returning back to discussion, the book suggests that a codebase should have much more unit tests, relatively fewer integration or service tests, and even less UI tests. This suggestion to have more unit test rather than UI tests is called test pyramid. It is a great visual metaphor telling you to think about different layers of testing. It also tells you how much testing to do on each layer (reference: Martin Fowler's webpages and here).

Sample web application

For purpose of discussion in this section comparing various tests, let's consider the following web application. Let's say there is an application that allows saving, retrieving and deleting files, something like a Box or Google drive. Let's say that the application exposes REST endpoints that can be used to interact with it. A user can sign up if they are a new user, or they can log in if they already have acount. After logging in, they can upload a new file, view or delete an existing file. A deleted file cannot be retrieved again. After they log out, they cannot upload, view or delete a file. For file operations, let's say that the user request initially goes to a class named "Controller", from there it goes to a class named "BusinessService" which checks if user owns the file. If user does not own the file, then it is not allowed to view or delete a file. If user does own the file, then the request goes to another class called "Repository", which allows retrieved the file content from some storage, or deletes the file, or adds the file to the storage.

Unit test vs functional test

Unit tests are made over the classes used in the web application. For the example application above, unit tests can be made for "Controller", "BusinessService" and "Repository" classes. When writing unit test for "Controller", the "BusinessService" class is mocked because latter is the only direct dependency used by the former. When writing unit test for "BusinessService" class, the "Repository" class is mocked because it is the only dependency. Finally, when writing unit test for "Repository", the file storage is mocked. On the other hand, functional tests (also called service tests in the test pyramid) are made using the REST endpoints. Three functional tests (note: do not confuse "service test" as defined in the test pyramid with a "BusinessService" unit test - both are different! The only common thing is the use of word "service" in their names) can be made, one each for adding, retrieving and deleting a file. For functional tests, a mock server with an in-memory file storage is instantiated, and the REST requests are made against it. The actual "Controller", "BusinessService" and "Repository" classes are used by the mock server; Unlike the unit tests, these classes are not mocked.

The "test pyramid" suggests that there should be both unit and functional tests, but there must be a lot more functional tests than unit tests. My suggestion is the artifact exported by your codebase are concrete implementation of interfaces, then it should have more unit test and almost no functional test. However, if your are coding for a web application, then it should have more functional tests and almost no unit test. Note that this suggestion is different from that provided by the test pyramid, so, I strongly suggest discussing with your team on whether they want to use my suggestion or the one from test pyramid! However, from personal experience, I strongly favor going with my suggestion for the reasons mentioned below.

In writing unit tests where all dependency classes are mocked, the test code becomes strongly coupled with the application code. If, for any reason, the application code is refactored, then all the test code will also break. For example, if it is identified that the breakdown of code into "Controller", "BusinessService" and "Repository" is unnecessary burdensome, and that everything should be pulled inside one single file, then the tests that are made assuming that there will be these 3 files will break. Since functional tests are only based on request-to and response-from the mock server, so any code rearrangement will still leave the tests intact.
Since unit tests are strongly coupled to the application code, so they can hinder code refactoring. Teams may not be willing to refactor the code because doing so would break all the tests. Any application code refactor would need to also have test code refactor. Even worse, it cannot be said with full certainty that no application behavior go unknowingly modified or broken because of the refactor because both the application and testing code changed at the same time! On the other hand, with functional tests, since they are decoupled to application code, a refactor of application code does not need to be accompanied with a change in test code. Since the functional test code does not change on application code refactor, so one can be much more certain that the refactored application did not break any business functionality.
Let's say we want to add a behavior where an "audit entry" is made whenever a file is deleted. As the application evolves, maybe the requirements change from creating an audit entry in logs, to making it in database, or making it only for certain class or user and/or sensitive files! Let's say this was achieved in application code by initially making a change in "Repository" code, but later moving it to "BusinessService" code as a method under the class, and even later, by moving it to a second "AuditRepository" class. If using unit test, then every time a change is made, then the unit tests undergo a major disruption, either the mocked dependencies of the test class change, or code is moved in-to /out-from one test class fromanother. On the other hand, when using functional tests, only the corresponding test method in the functional test class needs to be change. This is again because the functional test code is decoupled from the application code
One advantage of the unit test is that it can identify to a pin-point where the error is occurring. Here's the thing though, if the application code is developed such that any runtime exceptions in the applications are caught and logged, and the functional tests are configured properly to collect logs when functional tests are run, then the logged exception stacktrace can be used to identify the cause of error! So, it is not that having functional tests preclude the ability to identify the source of errors. Both unit and functional tests can be automated and run in-memory, so there is again no drawback in using functional tests rather than unit test.
It is commonly asserted that since unit tests are more granular than functional tests, so it is more thorough in testing the application code than functional tests. I personally disagree with the statement and have experienced that any line of application code that isn't related to logging / monitoring of application can be as easily tested with unit test and functional test. Also, if there is any feature that isn't covered by functional test, then that is extra code added by developer that is not needed by business requirement.
Since unit tests are designed depending on classes used in application code, it is not immediately clear on which business procesed are tested by a unit test. For example, each of "Controller", "BusinessService" and "Repository" class will have various methods, some of which are used in adding, retrieving or deleting a file. Hence, the unit test for each of the class will have test method that relate to adding, retrieving or deleting a file. Thus, the tests relating to a single business feature (like, retrieving a file), is scattered across different test classes. In contrast to this, all code for a particular business feature is collected inside a single feature test class. Let's put this observation aside for some time and analyze another aspect of software development. In a software development team, the product managers drive the agenda to add / update features in the software application. Note that any discussions with product managers are done on a "feature" level and not on a code component level. Hence, it is more intuitive to develop feature tests based on business requirements, rather than first identifying the code components that would be needed to achieve a feature and then creating unit test for code components. It is also easier and more intuitive to verify if all the business requirements have been met by looking at feature tests, rather than by looking at the unit tests.
I would assert that it is easier to have test driven development, or TDD when using feature test, rather than when using unit test. With unit test, one needs to first identify the components that will get used in developing a feature so that unit tests can be made. However, this initial identification need not be correct, and while writing new code, additional code components may get identified that need to be changed in order to achieve the new business feature(s). This will break any unit tests that were written based only on initial identification of components involved and before the feature code was completed. Hence, having unit test may, in some cases, preclude the ability to use TDD.
While unit tests give satisfaction of good code coverage and nothing more, having functional tests can provide additional benefit for applications the expose RESTful services. The same code, as used for functional test also be used to perform integration test on actual server. This is also discussed in a later section.

Since functional tests are based on REST endpoints, so different functional tests may repeat testing of comon behavior. For example, the functional test for retrieving an existing file, and being able to retrieve a newly added file may repeat testing the "retrieve" functionality. Both tests also repeat the behavior that file retrieval is only allowed for logged in user and who are owners of the file. To prevent writing same test setup and assertion code in different functional tests, these common behaviors can be extracted into a common class / method and reused in various tests. Still, this increases the total time needed to run functional tests.
Since unit tests are based on code components, so requiring unit tests is a good way to enforce certain code arrangement or design patterns. Thus, if the application isn't fast evolving and the design choices have been set, and will not modify in future, then it may be required to have unit tests to enforce that any new code conforms to the existing design choices. On the other hand, if the application is fast evolving and likely to get refactored many times in future, then functional tests should be preferred over unit tests.

Both are set up to run in-memory, and so, they are faster than other tests.
Both are set up to run positive and negative test cases. Positive test cases are expected to succeed, and negative test cases are expected to fail and raise an exception.
Both and are generally configured to not run test assertions against log statements or application monitoring statistics.
For both tests, it is also a convention to have long and descriptive method names so that the intent of the test is reflected in the test method name itself.

Functional test vs integration test

Integration tests are similar to functional test in that they test the application only using the request and response. Similar type of test assertions as made for functional test can also be used for integration test. For example, both the functional and integration tests may have a test class to verify that a user that owns a file can view it. However, the difference lies in how this testing is performed. For a functional test, an in-memory test server is spun up, a test user profile is made in a test in-memory database, and a mock request is sent to the test server that is interpreted as if it is coming from test user. For the integration testing, the new code is deployed to a real server in a quality-testing environment (in the quality testing environment, the server and application it is not open to public and is only available internally to the company), and server requests are made to log in as a valid user whose details exist in the server, and an attempt is made to retrive a file from the server. So, while "similar" kind of testing is done by both functional and integration tests, the former employs some kind of mocks while the latter only uses real servers calls through-and-through and does not use any mocks! For this reason, functional tests can mock the execution of asynchronous background tasks, but it is not possible to do so in integration tests. Also, functional tests are much faster than integration tests because latter use network calls to a real server rather than using in-memory server calls. Hence, it is preferred to use integration tests for only testing positive use case and not the negative test case.

Since integration test are run using real server calls, it is easier to test applictaion workflow using integration tests rather than using functional tests. For example, consider the workflow: (i) user signs up, (ii) user logs in, (iii) user successfully uploads a file, (iv) user successfully retrieves file, (v) user logs out, (vi) user is unable to view the file. If this sequence of operations are done using functional tests, then it will require maintaining a lot of mock data, and may become cumbersome. However, it is comparatively easy to do this test by simply sending a sequence of REST calls to a real server. Hence, for testing workflows, it is preferable to use integration test than functional test. That said, I would not suggest testing each and every edge case in a workflow using integration tests, else the overall duration of integration. I would rather suggest using integration tests for only testing workflows of business importance!

Integration test vs end-to-end test

Integration tests (..and also unit and functional tests) are executed as automated tests and are run against a single application's codebase and in the quality testing environment. In contrast, end to end tests can span multiple applications, are almost always run manually and can be run in production environment. As an example, let's say that after the file is upload, the application converts it into a pdf and then automatically publishes the file as a novel on some publishing website. The novel publishing website has a feature where an email is sent to the user after the novel is published. This workflow spans across the file upload application, the third party novel publishing website and the email feature provided by the third party website. Let's also say that the conversion to pdf is done if by an asynchronous task. This workflow likely cannot be tested as an integration test because of the use of asynchronous tasks and features provided by the third party website in the workflow. However, since it is important to ensure that workflow works in an expected manner before the feature is opened for public use, so, it is necessary to test it using end-to-end tests.

Load test vs others

Unlike other test, the primary objective of a load test is not to ensure that a web application feature works properly. Instead, the load test is setup and run to identify if there are any unexpected system resource utilization (like, unusually high CPU usage, memory leak, etc.) that can happen when users start interacting with the application. It is also identified if the system is able to respond to the users within a required latency. Before load tests are run, it should have already been verified via other tests that the application is behaving as expected. That said, load tests are the only tests that can identify if there are any concurrency issues in the application, like, deadlock, race conditions, etc.

Code coverage

A metric closely asociated with code testing is "code coverage". See article about code coverage on Wikipedia. It is the precentage of overall source code of the application that is executed when a particular test suite is run. A program with high test coverage has more of its source code executed during testing, which suggests it has a lower chance of containing undetected software bugs compared to a program with low test coverage. However, just because a large percentage of code is run, it does not imply that the code that ran is correct! This is why having a high code coverage does not necessarily means that the application is bug-free, and that's why other tests are also run! That said, having a low code coverage does erode confidence in ability of the code to be bug-free and so, software application strive to maintain high code coverage. From experience, I notice that aspring for a 85% or better of code coverage is a good starting point. Code coverage further breaks into 2 parts: line coverage and branch coverage. Branch coverage includes tests covering the branched code execution, like, loops, if-else, switch-case, exception, etc.

Test-driven development, or TDD

An important term related to software testing is Test-driven development, or TDD. TDD is a style of developing code and tests. It suggests that the task to add business features should first start with adding test cases associated with the feature being developed, and then adding or modifying code just enough till all the newly added test cases start passing again. This idea contrasts to the traditional route of software development, where the application code is added or modified, and corresponding test cases are created later. The motivation for TDD comes from observation that is much easier to write functional and integration level tests based on the task requirements because these tests are written against user request and expected response. This allows for easy translation of product requirements into technical expectations fro the new code. Also, once the tests are added, the count and type of failing test cases can be used to identify how much of the overall development effort is pending or if there's a particular thorny issue in the feature being developed that is slowing down the development. With TDD, since the first steps in feature development involves adding test cases, it stops the bias among managers and developers to push out a feature without testing it against all edge cases.

Despite its advantages, there are also cons of adopting TDD. The primary one is that if same developer is writing the code and also the test cases, then there will always be a bias in the testing process, regardless of whether codes are developed first, or if tests are made first. Hence, best use of TDD comes if two developers can simulatenously work on a feature: one of them writing the code and other writing the tests. This ensures that at a later time when the code is run on test, then the developer can get an unbiased feedback by looking at failing tests if certain functions have not been added. For this to happen, the coding requirements must have been solidified before any development begins. However, in real world, this is not always assured. As development proceeds, the code requirements can get progressively clarified, new issues might be found, or new interactions between different business services may be identified that need handling. Maybe the structure of request and response body isn't defined when the coding started. It may also be the case the feature being developed is experimental in nature and likely to change in future. In this case, TDD doubly hurts because it takes developer time away from feature development and later on, when the behavior is changed, then the old tests need to deleted and rewritten. By working on the application code before writing test, a developer gets sufficient time to focus on the business feature being added, rather than trying to continuously change the tests. While TDD may be a good practice, it shouldn't be used as a silver bullet.

Static code analysis

Static program analysis is the analysis of computer software that is performed without actually executing programs. The code analysis is performed by an automated tool. This is in contrast with dynamic analysis, which is analysis performed on programs while they are executing, like the various tests discussed above. Although they do not contribute to code coverage, a huge benefit of using static analysis tools is that tey discover bugs before the code is released into production. Some examples of static analysis tools are formatters, linters and static vulnerability analyzers. Formatters automatically style code to improve readability and consistency without modifying how code is executed. Linters look for code smells, defects and can also detect some bugs. Static vulnerability analysis tools warn you if you are using a package version that has known security vulnerabilities. Its execution can be integrated within the software development workflow via the pre-commit hooks, for example: running ESLint over Javascript code during commit in Git, configuring static analysis for Python.

Code review

"Code review" refers to the manual review of code by team members to identify if the requirements of the tickets have been met the developed code, if necessary tests have been added, if the code many have any bugs and if the code is written using acceptable design and coding practices used by the team. Wherever it is deemed, team members can add comments to request for answers from software developer who wrote the code. Unlike automated tests, this step is qualitative, non-deterministic and manual in nature. It also provides a quick venue for other developers to keep pace with changes occurring at a difference place in code, and to improve their coding skills through discussion with others. Although it is an important step, the code review process can suffer from a few drawbacks. For example, the peers may not be fully vested in the process and may accept the new code code without proper review. Or, maybe a valid opinion from a team meber is disregarded just because it is non-conforming to majority opinion. Or, maybe only the opinions coming from team lead is considered and views coming coming from other developers are disregarded. Even without these issues, it is likely that multiple back and forth discussions, comments, and code-changes during the review stage increase in the time-gap between initial code completion and its subsequent deployment. Note that in all these cases, it is not the code review process itself that is bad! Instead, the team must discuss on how to handle these issues to make the code review process more productive in achieving its goal.