Time Gentlemen Please!!!

Posted by Riomhaire Research on Wednesday, July 13, 2016

Do you know how well your application functions? Can you tell what is the average time for a request to be handled? the worst? Can you tell me how your application functions with time? If not then you have missed one of the most common “unwritten” requirements – that of instrumenting your code.

Hang-on I hear you say!! you can profile the application and unit tests can give you this information. Yes this may be true, but it is never a substitute for building into your code instrumentation. You will never know when it will come handy, and anyway testing your code on your development machine is rarely anything like running it on the real deployment environment.

For example, many years ago I developed some middle-ware for a large and very profitable international corporation. The piece of code was to mediate between the front-end and the back-end billing/resource system. Development went well and as usual I inserted instrumentation as I went along before the product was released into the production environment under the watchful eye of application management group. To help roll out the product there was a massive marketing campaign – TV, billboards the whole nine yards - first time I have ever seem anything I developed up on TV.

About a month latter my line manager (a really nice lady) came to me and said “Gary there seems to be a problem with the system; I have just had my arse kicked by sales and marketing. The product is a great success but there are a lot of unhappy customers who get ‘time out’ errors from the GUI, and the GUI guys say its because the connection to your middle ware is timing out, and the applications support environment says the 3 machines in the deployment cluster are fine. Could you investigate and get back to me?” Sound familiar? Talk about a bombshell and a potential career limiting move. With the manager there I said “lets look at the instrumentation and performance screens”; A couple of minutes latter the situation came clear. Two of the three machines in the middle-ware cluster were “down” for the times and periods mentioned but for different reasons; one machine was “up” and running but there were no performance metrics; the other machine was not up at all!!! So in effect we were running at 1/3rd capacity – no wonder we had problems. We paid a quick visit to application support. To cut a long story short, it turned out the machine which was down was “being upgraded” - why that was done in the “busy period” and why it took a week would be another article; the machine that was “up” but had no statistics turned out not to be included in the cluster configuration at all so was not being used !!!!

These were soon fixed and everything went smoothly – especially when I wrote a script to monitor the scripts and analyze the results so my manager would then know on a day/week/month basis how many actual transactions went through the system, how long they took, what the system throughput was and what we were capable of supporting and when the transactions took place. As a result my manager did not need to ask sales and marketing how many sales went through – she knew it; And she did not need to access application support unless from the statistics she was told one or more of the servers was down – where upon, she went looking for application support.

I suppose the moral of this little story is if you don’t instrument your code you will never know you have a problem until someone with more power and clout than you comes looking for you with a big club embedded with nails and a very bad attitude.

As I said before: in many ways instrumentation is one of the major “unwritten” requirements in any project – along with a good logging strategy. The tough question is what do you instrument and what are the options?

As with most things in development there are many options and variants – you just need to choose what feels right for your project. A passionate and professional developer after a few years in the minefield of software development gets to know what will and more importantly what will not work in a situations.

Ninety percent of software development involves something (user/software) invoking a method or command on some target object. Some of the things you can track from that is:

How long the invocation took.

  1. The result of the invocation (success, error etc).
  2. The average time invocation of this method/command.
  3. The maximum time of any invocation.
  4. The minimum time of any invocation.
  5. The invocation/min/max/average times for 1-5 for the last n commands.
  6. Values 1-5 but broken down by success/error etc – can help if “success” times take a lot longer than the errors.
  7. Values 3-6 but over time.
  8. How much memory was used.
  9. How many times method/command has run.
  10. Tallies on errors.
  11. Deviation counters – How many times less 1 second, less 10 second, less 100 sec, less 1000 sec and so on. For All/success/error etc.
  12. Tallies by hour, day, week, month etc.
  13. Throughput calculation by the second, minute,hour,day depending on the domain.
  14. Error rates by second, minute,hour,day depending on the domain.
  15. Drop rates – in some systems requests can time out. You might need to track these.
  16. Uptime.
  17. Load – number of concurrent invocations etc.
  18. Memory usage (very very rough in Java).
  19. Data transfer – if you can calculate request and response sizes.

This list is by no means complete – there are many other options and variations.

You can also from the above work out what invocations were executing concurrently which can help in figuring out the hog processes when certain “odd” things occur that depend on when and what things are running; For example when two processes use a lock to access some resource. When run individually everything runs fine. When they run concurrently one gets the lock and the other has to wait; if the process that has to wait has a hard time constraint then you might have issues.

Another thing to do is make sure that all this is kept as an in-memory database for performance reasons - using a traditional DB backend such as postgres could have quite an impact; this does not prevent you from building in a mechanism to flush to backing store during quiet periods or every 5 minutes or so.