WCCILevent - REDU/SEVERE - EventManager, evMain, Redundant peer recovery timeout - aborting recovery (active recovery)
Enclosed you'll find the explanation for a log-message which can occur during startup in a redundant system when the recovery for the event manager failed. The log-message is written to the PVSS_II.log-file.
WCCILevent (0), 2014.09.24 14:03:35.975, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery
Log-message with symbolic names:
WCCILevent (0), <TIMESTAMP>, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery
The log-message is written when the allowed time is exceeded on the system which is running and therefore making the active recovery. The maximum time for the recovery is defined with the following config-entry in the config-redu-file at the [event]-section (value is defined in seconds):
activeRecoveryTimeout = 3600
The time starts when the recovery was initiated by the project on the other server which is starting up. Within the timeout the recovery for the database, the startup of the project and the recovery for the event-manager (exchange of buffered data) needs to be executed.
The timeout can be reached when recovering the database and/or the startup of the project takes very long (slow network, insufficient read/write performance at the hard disc) or when a lot of buffered data needs to be exchanged.
If you want to change the timeout you have to do it in a config.redu-file stored in your project.
If the timeout was reached you’ll see the following block of log-messages. The message describe that the own system is aborting the recovery, the data-manager is informed by the event-manager:
WCCILevent (0), <TIMESTAMP>, REDU, SEVERE, 54, Unexpected state, EventManager, evMain, Redundant peer recovery timeout - aborting recovery
WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 0, , Recovery request aborted from event.
On the other server-project you will see a block of corresponding log-messages. The first message describes that the recovery abort from the other server was received. The data-manager closes the connection to the data-manager on the other server and stops afterwards to initiate the recovery and startup again:
WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 0, , Recovery aborted from data.
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 181, Closing connection to (SYS: 0 Data -num 0 CONN: 2)
WCCILdata (0), <TIMESTAMP>, REDU, WARNING, 54, Unexpected state, DataManager, passiveRecovery, Lost connection to other replica while receiving updates.
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 39, Connection lost, MAN: (SYS: 0 Data -num 0 CONN: 2), Connection closed
WCCILdata (0), <TIMESTAMP>, SYS, INFO, 2, Manager Stop
At the following FAQ-entry it is described how to check the hardware performance for the recovery:
portal.etm.at/index.php