Notes and Recommendations

Redundancy Switch Detection - Time Optimization

In a redundant system an active/passive redundancy switch can take place due to following reasons:

A manual state change triggered by the user (“preferred active”)
A manually triggered switch (“force active”)
A different error calculated by the error state calculation
A network error where the connection to the redundant server is lost or a complete failure of one server occurs

The detection time for an automatic redundancy switch is related to the configuration of the error state calculation and the timeouts for the alive mechanism for manager connections.

The information which kind of redundancy switch was executed is written on the internal DP _ReduManger and _ReduManager_2.

To improve the detection rate in case of changes to the calculated error state or network/hardware related switching following measurements can be taken:

Automatic Error State Calculation

To get a quick response and a shortened detection time in case of a different error state you can follow the given rules

Configure all relevant parameters for the error state calculation in the specific project, e.g. driver connections, PLC connections, distributed connections, …
Use appropriate timeouts for the alive connection check between driver and PLC.

Most drivers can define the connection timeout to the PLC or the communication partner (OPC server <--> OPC client).

For some drivers you may need to change additional settings if one parameter is modified, as they are related to each other.

If the default values for the timeouts are changed you have to ensure that the modified values have no negative impact on the functionality of your system by doing appropriate tests.

E.g. if the network reliability is very low and/or the latency times are very high, a too short timeout can cause a loss of detection even if the network is still available.

To get detailed information for the drivers used inside your project please take look at the corresponding documentation.

Connection Loss / Server Failure

Between the redundant server pair alive messages are send/received for the connection between the Event Manager and the Redundancy Manager.

The default value for the manager timeouts: [all sections] aliveTimeout = -10

If a negative value is used, alive messages are only sent to managers which are not running on the same machine.

A WinCC OA manager can detect a connection loss by

Reaching the aliveTimeout
Getting the information from the TCP stack (operating system) that the connection is lost

Which one occurs first depends on the situation and the location of the network error. E.g. if a cable is unplugged at one server the managers running on this machine will, most of the time, first receive the information from the TCP stack. The partner managers running on the redundant server will most of the time run into the alive timeout.

You can see which of the possibilities occurred in the log message written to the PVSS_II.log file.

If you want to have a quicker response to a lost network connection you can decrease the value for the aliveTimeout. The recommended minimum value is 3 seconds (aliveTimeout = -3).

In that case you have to take into account the network quality used for the connection between the redundant partners. After modifying the configuration you have to ensure that there is no negative impact on the functionality of the redundant system.

Recovery Behavior - Details

Following steps are performed during the recovery a of a redundant system with one WinCC OA server already started.

The server names SRV1 and SRV2 used in this description are symbolic names for the redundant servers.

The debug flag “-dbg redu” for the Data, Event and Redu Manager provides detailed information for the actions executed in a redundant system.

For the redundant system and the active/passive recovery procedure, timeouts are defined (activeRecoveryTimeout and passiveRecoveryTimeout) . If a timeout is reached the recovery procedure will be stopped. Depending on the situation, when the recovery procedure is stopped, a server project is restarted automatically.

Initial situation

The project on SRV2 is running.
The project on SRV1 is stopped and should perform a recovery.

Starting Sequence

Start of project at SRV1
Data Manager at SRV1 tries to connect to the Data Manager at SRV2 (config entries data = "...$..." and event = "...$...")
When the connection to the Data Manager at SRV2 is not possible, the project at SRV1 will start without the recovery (normal startup procedure)
If the connection to the Data Manager at SRV2 is possible a check is made if the recovery is possible at SRV2

Recovery Requirements

The recovery is possible if

The Event- and Redu-Managers are running
The database is running in multiuser mode

In case the recovery cannot be performed:

The Data Manager at SRV1 will stop.
The Process Monitor (Pmon) will restart the Data Manager and another check is performed if the recovery is possible
If a recovery is possible, the Data Manager at SRV2 will forward the information to the Event Manager at SRV2
The Data Manager and Event Manager at SRV2 will start the active recovery.

Active Recovery - Data Manger

For the active recovery the Data Manager

Stops the DataBG Manager, if it is running , see chapter RDB archiving for additional information.
Sends the information for the active recovery to the archive managers (HDB archiving)
Closes the RAIMA database files

Active Recovery - Event Manager

For the active recovery the Event Manager

Switches into the buffer mode and buffers all data which must be sent to the Data Manager. Historical requests are not possible while the recovery is running.
The communication to other managers is not affected and reading current information, e.g. using a dpGet() is still possible.

Hotlink messages (for a dpConnect, dpQueryConnect*) are send to those managers which are connected to information change.

If the active recovery in the Event Manager is started the timeout [event] “activeRecoveryTimeout” (default 3600 seconds) is used.

Passive Recovery - Data Manager

The Data Manager at SRV1 is performing a passive recovery. It is waiting for the recovery of database files.

If the passive recovery in the Data Manager is started, the timeout [data] “passiveRecoveryTimeout” (default 1800 seconds) is used.

Recovery - Database

In detail, the database recovery works as described in the following sequence:

While the system is running in a redundant state the Data Manager at SRV1 and SRV2 writes the current time, in seconds since 1970.01.01 00:00:00, to the file <WinCC OA project>/db/wincc_oa/dbase.touch in the local database.
When the recovery is performed, the starting Data Manager (in our example SRV1) reads the information from the dbase.touch file, subtracts 10 minutes and sends this time information "touch time" to the Data Manager at the running system (in our example SRV2)
The running Data Manager (SRV 2) creates a “RAIMA directory list” of directories in the RAIMA database, which contains files that have been modified after this time "touch time".

The value-archive directories are excluded from this check.
If HDB archiving is used, the Data Manager checks which files in the Value Archive directories have been changed after the time "touch time" and creates a file list of modified VA files "VA file list".

This only works when the directories have the correct notation VA_<4 digit archive-number>.
For all files located in the directories “RAIMA directory list” the md5-checksum is calculated
For every file in the “ VA file list” the md5-checksum is calculated
The complete list of files and their md5-checksums is sent to the starting Data Manager (SRV1)
The starting Data Manager (SRV1) checks if the files from the file list exist in the local system
Files which do not exist will be copied during the recovery, the file name will be added to the list of files which need to be copied ("Recovery file list").
If a file already exists, the md5-checksum is calculated (SRV1) and compared with the md5-checksum calculated on the other server (SRV2)
Files with a different md5-checksum will also be added to the list of files which need to be copied (Recovery file list")
The Data Manager (SRV1) sends the "Recovery file list" to the running Data Manager (SRV2)
After receiving the "Recovery file list" the running Data Manager (SRV2) will start to copy the files to the starting Data Manager (SRV1).

Copying files is done by sending WinCC OA messages instead of copying files using OS commands.
When all files for the "Recovery file list" are copied the Data Manager (SRV1) will perform the normal startup procedure.
The running Data Manager (SRV2) is still in passive recovery mode.

Recovery Closure

The startup procedure at SRV1 will be continued
The Event Manager at SRV1 starts in passive recovery mode.

If the passive recovery in the Event Manager is started the timeout [event] “passiveRecoveryTimeout” (default 300 seconds) is used.
When the Redu Manager is started at SRV1, it tries to connect to the Redu Manager at the running server SRV2
The Event Manager at SRV1 also tries to connect to the Event Manager at the running system SRV2
When the Redu Manager and Event Manager connection is possible, the recovery of buffered data between the Event Manager on both servers is done
If the Event Manager recovery is finished the system is running in redundant mode
SRV1 is running as passive host and SRV2 is running as active host as long as there was no priority system switch requested