Planning for Fault Tolerance
Fault tolerance, or "high availability," is critical to any successful business operation. To ensure that requests are processed in the event of failure, FME Server supports configuring fault tolerance throughout the multiple levels of an integrated system. FME Server provides fault tolerance in the following ways:
- Recovery: Restarting components and jobs when crashes occur. FME Server provides component and job recovery automatically—no additional planning is needed.
- Failover: Ensuring there is no single point of failure. Two different configurations can be used to achieve this: Active/Passive or Active/Active. Failover is the primary consideration for the type of installation architecture you decide to implement.
About Recovery
Component Recovery
FME Server comes out-of-the-box with component recovery. This means that, even on a single system, FME Server monitors and restarts components that fail, including the FME Engines and the FME Server Core. This is achieved through the FME Server Process Monitor. The ability for FME Server to monitor its own components ensures reliable uptime and dependability.
Job Recovery
FME Server also includes the ability to restart a translation (job) when a crash occurs. FME Server continues to resubmit a translation up to a specified number of attempts. As a result, jobs that experience temporary issues, such as a network hiccup, are resubmitted and run again. Job recovery is configurable and can be turned off entirely. For more information, see Job Recovery.
Note: Resubmitted jobs may cause data duplication, such as when writing to database formats.
About Failover
The goal of a failover environment is to remove single points of failure so that a component can fail, but not take the system offline. FME Server supports two approaches to failover: Active-Passive and Active-Active.
We typically recommend the Active/Passive architecture, which meets the needs of most enterprises. There are advantages and disadvantages to both approaches.
                                             
                                        
Active-Passive
With the Active-Passive failover approach, when the Active system fails, the Passive system takes over the capabilities of the failed Active system and assumes the role of the Active system. The failed system, in turn, assumes the Passive mode. The failed system can then be investigated while the new Active system provides continued operation of FME Server. Once the Passive system is recovered it remains in this role until another failure on the Active system occurs.
Failover is achieved through a heartbeat monitor between the Active and Passive systems. The types of failures that typically cause failover are hardware or OS crashes, in which the primary system goes down completely.
Any translations that are lost at the time of failover are resubmitted. These include jobs that failed due to loss of power on the machine hosting the FME Engines, as well as jobs that completed, but are still considered lost due to loss of power on the machine hosting the FME Server Core.
In the Active-Passive architecture, the FME Server Web Application Server and FME Server System Share files are separated physically. Fault tolerance for these components must be provided by the client. For more information, see Active-Passive Architecture.
Advantages of Active-Passive
- Publishing workspaces is a one-time task for the whole system.
- Job Recovery is built-in to the fault tolerance design.
Disadvantages of Active-Passive
- Requires multiple physical or virtual systems, as each component and its failover are on different systems. That is, a minimum of two FME Server Core systems, plus separate systems for the web application server, database, and file system.
Active-Active
The Active-Active failover architecture duplicates complete FME Server installations on separate servers. In other words, all components reside on the same system, and additional systems are configured similarly and provide similar functionality. A third-party load balancer directs incoming traffic to one of the available systems. When requests are directed to any of the systems, they are handled independently and only by one system. This approach works well with a cloud-based computing environment, such as Amazon Web Services, in which machines can be cloned easily to expand capacity.
For more information, see Active-Active Architecture.
Advantages of Active-Active
- Easy installation using Express install option.
- Fewer machines are required to create a fault-tolerance environment.
- Additional throughput is achieved easily by adding more systems.
Disadvantages of Active-Active
- Requires administration of multiple FME Servers.
- Workspaces must be published to each system, either manually or through scripting, to keep parent and children in sync.
- No built-in job recovery. Any translations running on a system that fails are lost until the system is brought back online, or must be manually resubmitted on another system.
- Processing capacity is diminished when a system fails.
- May still require recovery/replication of the FME Server System Share for entire environment.
- Schedules do not failover; they must be manually restarted on another system.