When deploying software – especially custom built software – things will go wrong, typically at the wrong time. This was a standard upgrade of our application system – which is quite complex. It required us to shutdown and upgrade the database systems as well as 32 other machines.
Things go wrong – no mater how much you prepare – This is one of those stories.
Part of our solution has a Kannel server with multiple SMPP binds talking to an application server using the URL’s in Kannel (see config 1). The flow is as such:
- an SMS message comes from a phone over the SMPP bind (SMSC)
- Kannel received the message and decodes it and sends it to the device SMS-SERVICE (see Config 2)
- The SMS-SERVICE is defined as a HTTP GET For example :
- The response to this is then sent back over the SAME SMPP bind the message was received.
# Config 1. group = smsc smsc = smpp smsc-id = smpp_link1 host = 192.168.0.2 port = 1200 receive-port = 0 smsc-username = BINDID smsc-password = PASSWD max-pending-submits=100 # 10=25+TPS, 30=75+TPS, 40=100 TPS system-type = "" log-file = "/opt/kannel/log/smpp_link1.log" log-level = 2 source-addr-ton = 1 source-addr-npi = 1 dest-addr-ton = 1 dest-addr-npi = 1 bind-addr-ton = 1 bind-addr-npi = 1
The link between the SMPP connection and the HTTP processing is the “smpp_link1” tag.
# SMS SERVICE Default # there should be default always group = sms-service keyword = default catch-all = true omit-empty = true accept-x-kannel-headers = true accepted-smsc = "smpp_link1" max-messages = 0 get-url=http://imps-smsproxy:9009/co?from=%p&to=%P&text=%a&originSMSC=%i
The use case can be summed up as a “SMS to HTTP” gateway – at least this part of the system. The rest of the system is not relevant to this discussion.
While we would love to have a “rolling upgrade” – the system is still in the process of being re-architected to do that. The build been well tested – except for one feature which is quite difficult to test – the SMS portion from the carrier, the risk was considered acceptable as there were no changes in that part of the code.
Unfortunately, there was a partial change that made it into the code – the new gateway required an specific HTTP header on the kannel request, which the development team had not informed us of. They had build a new custom branch to Kannel that had this – but it was not yet complete!!
While most of the services were functioning properly, this one feature – minor but important – was now broken.
Can we Fix it? Yes we can !
Reading the logs in our own applications, we could see that the application was expecting a specific header :
Our realease of Kannel does not support adding new HTTP headers (nor did find a way to do it in other releases), installing the new release of kannel the dev team had build was not an option as it was not complete yet the only other option was a rollback – unless we could “fix it fast”.
Back in November I had attended Amazon’s re:Invent 2013 conference , one of the talks They don’t hug back talked about distancing yourself from naming you infrastructure and Martin Rhoads presented something he and Igor Serebryany from AirBnB had done, SmartStack, to solve service discovery. While the whole things is quite impressive, one item that I had been exploring as a quick solution was to use haproxy as a reverse proxy for service failover instead of setting up a fully redundant haproxy pair.
Hack and Hack some more
With the clock ticking down, we installed haproxy on our Kannel box, and set up a configuration where it would accept connections on localhost at the same port our application was, and it would forward the request to the same tartget as Kannel was forwarding to – adding a new header in the URI request. Here is the original Kannel request (From config 2):
The haproxy configuration in a simple LISTEN config ( the reqadd is what adds the request header we needed) :
listen smsproxy reqadd x-imps-instance:\ test stats enable stats uri /haproxy?stats bind 127.0.0.1:9009 balance roundrobin server static imps-smsproxy:9009 check
HAProxy was now set up to accept and forward modified requests. We could also now test the fix without impacting the production system. We would have been able to fully set up a paralell kannel system to perform end to end testing. After that was successful we then configured kannel to point to the HAproxy frontend on the localhost we defined in the listen above.
Kannel does require a restart to load a new configuration – while the configuration can be reloaded – our experience has taught us that it is better to fully stop and start the software. Once it was restarted, the software functioned properly with the new header. We had again saved the day.
Not just a hack
While I present this as a hack, it has become very useful. Here are some of the pro’s and cons :
- if we need to change the endpoint, we can do so without restarting Kannel. The HAProxy reload is more reliable, and it can be done without cutting existing connections. Tha is important when you are running at 300 msg/sec.
- HAProxy can be told to check if the target port is up or down and we can fail to a second system. This comes to us “free” without additional development. The code is also robust and well tested.
- We now monitor HAProxy and HAProxy can tell us if there are issues talking to our application. HAProxy also has a very good statistics page that we can pull into our perfomance graphs.
- HAProxy has robust logging on connection latencey, which is amazing.
- Another component that can go wrong – this is true, but it is not that complex (though it can be). HAProxy is very efficient and fast – and easy to use in a simple configuration. As well – since it runs on the same node, there is no real point to having a failover pair – since there will be no traffic generated if the node fails. (If you didn’t get that right away – it will come)
- Another component to configure – this is also true but we are only using it for one outbound connection. AirBnB has released the SmartStack tools to the open source community so there are tools to manage this complexity. The largest savior though is that this is not a complex configuration – unlike an incoming loadbalancer.
Using tools in unconventional ways
There would have been many ways to fix the above problem (we eventually deployed the originally designed fix), but it shows that there is allot of value not just in knowing how to deploy the tools you use, but also to understand how they work – as you may end up using them in a way that is not quite what was in the book.
What super hero tales do you have to tell about saving the day in an odd way ?