Covalence

Let's Share

Recovering From Downtime, or Must We Turn Back the Clock?

I saw this article earlier this week on Hedgeworld, and at first, noticed nothing surprising here. This happens all the time. A particularly dramatic version of this even shows up in Michael Lewis' Flash Boys, where a network admin takes down the Nasdaq for an afternoon after rolling out a software update in-flight.

But then:

"As a result of the latest trading halt, all day and session orders, including so-called good-through-date orders with an Aug. 24 trade date were cancelled. All open orders that have been acknowledged remained working, CME said."

So the exchange goes down for 4 hours and orders get cancelled? I get that it's a common way to handle this situation and that rewinding the trading day is at least a consistent way to ensure nobody got preferential treatment because of the way a system issue was handled. But I wonder if there's a better way?

From a pure software manufacturer perspective, my customers would be livid if I told them they'd have to go redo things they spent all day doing, due to a mistake I made. If I was a guy who made money on the exchange that morning, I too would be pretty upset.

Software defects happen despite everyone's best efforts to avoid them. One of the things I'm learning about is how to fix them transparently. I'd be curious to know if exchanges have more targeted ways to accommodate for system failures --i.e., rapid detection of errors, the ability to rapidly notify participants of them, and the ability to automatically roll back to the point the downtime began. It seems to me that this would be the next level of quality at an exchange and one can only hope that seamless error management is a priority on the to-do list.