StrongMail Admin Tasks

StrongMail is sort of a less-known product that I have a ton of experience with. It offers business class email solutions to companies. In other words, it allows you to send out a lot of email in a short period of time. I’m only going to make this one post about StrongMail, but if you want me to continue talking about it please post in the comments. This is essentially just a brain dump of what I can think of right now.

To upgrade the version of StrongMail to the minor latest release, run these commands on the StrongMail server:

# cd /data1/svcpack/
# ./sm-svcpack -c
New service pack available for installation:

Download and Install? [Y|N]
Please select [Y] :

The services will be interrupted only in applying update and usually take no more than 10/min. The only box you can upgrade to is the one currently on a 5.2 build, if you have other older version box, you need contact our AM to get on 5.2 build first. If you don’t keep up with the latest and greatest, you will most likely need to get an upgrade package from SM, but I thought this is a neat little trick.

StrongMail LiveUpdates…
Once a month, StrongMail comes out with the latest LiveUpdate package. This package is very important to keep the bounce and throttling rules up to date. The downfall of this is that it requires restarting all of the services once the install is complete. Restarting StrongMail can take up to 30-45 minutes. StrongMail will report that everything restarted correctly, but it needs a “warm up” time. I never have gotten a good explanation of what goes on during this…  Coincidentally, this “warm up” time is about the same time it takes to write a report to StrongMail support about what’s going on.
To apply the updates, login to the admin:
on the left hand navigation, there is a link called “Live Updates”. Follow the instructions for an Online Installation.

Due to the restart, I try to perform this update and restart about once a quarter. We have random issues with our transactional mailings being deleted when SM gets restarted. This is very inconsistent behavior. However, I have had the most luck NOT using the GUI and instead using the command line. I avoid this GUI at all costs. Basic tasks and monitoring can be done from here, but don’t put full dependence on it…

We do not use Message Studio (they try to bundle and license it with everything)
All StrongMail installations start in the /data1 directory
StrongMail offers a “SHAC” (StrongMail High Availability Cluster) which I’ve had some good experience with. This uses DRBD and Linux HA for automatic failover. The /data1 directory is the one being replicated (you will see that it isn’t mounted on the non-active SM server).

Following a failover event, the logprocessor WILL NOT start properly and HAS to be restarted. This is a known bug. The only way to ensure that logprocessor is functioning properly following a failover from one node to another is to manually stop/start the logprocessor process using the following commands:

#/data1/strongmail/strongmail-mta/sm-server logprocessor stop
#/data1/strongmail/strongmail-mta/sm-server logprocessor start

There are 2 “master” scripts to start/stop/restart/status the StrongMail services:

[root@sm data1]# /data1/strongmail/strongmail-eas/sm-client status
[root@sm data1]# /data1/strongmail/strongmail-eas/sm-client status
-- StrongMail Client VERSION: --
smclient-scheduler: [ RUNNING ]
smclient-trackhttpd: [ RUNNING ]
smclient-httpd: [ RUNNING ]
logcollector: [ DISABLED ]
strongmail-logprocessor: [ RUNNING ]
strongmail-dataprocessor: [ RUNNING ]
strongmail-messageassemblyserver: [ RUNNING ]
strongmail-etlagent: [ RUNNING ]

…Followed by this is a list of the Active Batch Mailings and Active Transactional Mailings

[root@sm data1]# /data1/strongmail/strongmail-mta/sm-server

[root@sm data1]# /data1/strongmail/strongmail-mta/sm-server status
-- StrongMail Server VERSION:
smserver-named:                                            [  RUNNING  ]
goodmail-proxy:                                            [  DISABLED  ]
strongmail-server:                                         [  RUNNING  ]
smserver-logserver:                                        [  DISABLED  ]
strongmail-logprocessor:                                   [  RUNNING  ]
strongmail-dataprocessor:                                  [  RUNNING  ]
strongmail-messagequeue:                                   [  DISABLED  ]

If space starts filling up, the first thing to check is that the log processor is running and cleaning the logs correctly (/data1/strongmail/log). The next thing that I would check is if the old batch mailings are getting properly cleaned out (.db files that don’t start with “Trans_<site>.db in /data1/strongmail/data/databases).

If the drive starts to fill up, you will want to ensure that old batch mailings are not hanging around. We currently have a 3 week retention policy as defined in /data1/strongmail/config/strongmail-logprocessor.conf

Each batch mailing creates 3 files in 3 separate directories. If mailings aren’t being cleaned up properly you will see files in any of these directories. Prior to version 5.2, we had to clean these up ourselves via a bash script. This may have also been caused by how we were using the API to create the mailings.
/data1/strongmail/data/messages/<filename>.txt (and/or .html)
Files in these directories that start with “Trans_” are for transactional mailings and should not ever be removed.

Again, the most likely cause for them not being deleted is the Dataprocessor process. The dataprocessor process needed to be stopped and restarted.You must run a full stop on the process, kill any rogue dataprocessor process,and start the dataprocessor.The deletions started at the time setup in the dataprocessor config file (/data1/strongmail/config/strongmail-dataprocessor.conf)

I hope these tips help you to master StrongMail administration. Please leave any questions or suggestions on further StrongMail topics in the comments area

Hard lesson from Dell’s PERC H700 Battery Write Cache

We ran into a huge issue recently… At about 9PM on a Friday night our primary MySQL database began slowing to a crawl. This is an extremely slow time of day for our web application so we were all quite confused. All of our tables live on a pretty beefy SAN and everything checked out clear there.

Lone behold, the Battery on our Dell Controller (PERC H700 Integrated) decided that it was time to re-learn its battery cycle.


 Sep 16 17:36:31 DB01 Server Administrator: Storage Service EventID: 2176 The controller battery Learn cycle has started.: Battery 0 Controller 0
 Sep 16 17:37:36 DB01 Server Administrator: Storage Service EventID: 2415 Controller battery is discharging: Battery 0 Controller 0
 Sep 16 17:37:36 DB01 Server Administrator: Storage Service EventID: 2248 The controller battery is executing a Learn cycle.: Battery 0 Controller 0
 Sep 16 18:37:52 DB01 Server Administrator: Storage Service EventID: 2278 The controller battery charge level is below a normal threshold.: Battery 0 Controller 0
 Sep 16 18:37:52 DB01 Server Administrator: Storage Service EventID: 2188 The controller write policy has been changed to Write Through.: Battery 0 Controller 0
 Sep 16 18:37:53 DB01 Server Administrator: Storage Service EventID: 2199 The virtual disk cache policy has changed.: Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC H700 Integrated)

It turns out that this re-learning task happens by default on Dell servers every 90 days. While our Data didn’t reside on local disk, our binlogs did. This was apparently enough to bring MySQL to a crawl.

The only way to change the behavior permanently is in the BIOS of the controller, there we can set it to only warn us that it needs to be checked. We can use some of Dell’s Open Manage tools to get more information on the status of the battery.

The first command here shows battery’s status (with example output)

[root@svr /]$ omreport storage battery controller=0 battery=0
Battery 0 on Controller PERC H700 Integrated (Embedded)

Controller PERC H700 Integrated (Slot Embedded)
ID                        : 0
Status                    : Non-Critical
Name                      : Battery 0
State                     : Degraded
Recharge Count            : Not Applicable
Max Recharge Count        : Not Applicable
Predicted Capacity Status : Ready
Learn State               : Due
Next Learn Time           : 13 days 2 hours
Maximum Learn Delay       : 7 days 0 hours
Learn Mode                : Auto
[root@svr /]$

We’re unable to disable the learn cycle all together. However, we can push out when it happens to 7 days from now with this command which adds 7 days to end end of the learn cycle. This cycle should still be run, just preferably at a very off-peak time when we can monitor it

 omconfig storage battery action=delaylearn controller=0 battery=0 days=7


The last command here forces write back cache even if battery is not available.

omconfig storage vdisk action=changepolicy writepolicy=fwb controller=0 vdisk=0

If you’re not much of a command guru, using Dell’s Open Manage GUI can also show you the status.
If you see this first image, a battery learn cycle is about to begin


If you see this second image, a battery learn cycle is currently in progress! You may see some of the symptoms that I mentioned above


For some more info on this issue, I found a nice blog posting here: