Thursday, June 3, 2010

squeezing into an ever shrinking maintenance window

There once was a down day in store
But the mill wanted just to roll more
For early on Monday
They'd lost nearly one day
Turned up bar had destroyed feed roll door

Rediscovered this comment from June 2009 on the frustration of seeing yet another planned maintenance day cancelled so the plant could stay operational.

This is one of the challenges of supporting computer systems in a market where plant throughput is valued higher than essential maintenance. After all, that's what weekends are for and your systems support personnel don't have a real life anyway, just a second life at work. In the words of one of our managers, "I know weekends are obnoxious but the work has to be done sometime (nudge, nudge, wink, wink)". Another subtle hint.

My wife refers to my periods away from home during the week as meeting up with my other wife/mistress. The server might be a lot like a nagging wife and the fans do whine a bit. She calls in the middle of the night when she hasn't received enough attention. A bit of remote fiddling seems to quieten her. She's getting on in years now. Perhaps her hardware has a seven year itch, coinciding with the time the manufacturers withdraw extended support for her.

There'll be other suitors knocking on the door soon, a fancy blade server might want my attention, or some high end virtualisation rig. Who knows, I might take on their advances. They promise increasing flexibility for essential maintenance tasks and that could mean more time to dedicate to my real wife and family. Just have to make the manager believe that not doing something will lead to a far more obnoxious outcome.

Maintenance periods shrink ever smaller.
The amount of maintenance grows ever bigger.
The only solution is to perform the maintenance continually as the window shrinks to nothing.

Thursday, March 18, 2010

sanity check - are my documents readable now?

I have worked with the same company since 1982.

I have office documents on my hard drive that have been around since 1985 from software as early as Lotus Symphony and 123, emails that I have preserved since I first had an mail account in 1995, our full system documentation that dates from around 1989 when I converted it over a 3 month period from Wang word processor format.

But just how many are actually readable?

The more active documents get migrated through office application updates and through use, reuse, duplication, updates and by a natural requirement to be widely read on different formats.

But seeing this site covering Document Freedom Day 2010 made me think a bit deeper.

Sure I might never read some of the system documentation ever again, as it has become second nature to me but I have vowed to retire some day so one of my clones will have to wade through this jungle of document flavours with a machete. Why? Because I am not only tasked with maintaining domain knowledge at work but with family history at home. Try explaining how, in 20 years time, that those digital memories of that special wedding day or baby's first steps can only be read by flying over to Russia and running on some archaic 1990s hardware in a museum maintained by some aging, Jolt addicted geek retiree.

So we preserve documents files but how often does a business think to maintain some means to read them. Software vendors go out of their way to define their own proprietary formats that do nothing to solve this problem. Haven't they heard of backwards compatibility!!! Imagine the trouble this would cause the *NIX community in the management of operating systems upgrades. PC users are too gullible about this, fixated on new features without full realisation of the impact on their heritage file formats.

Hang on, I think I have the 5 1/4" Word 1.0 install disks somewhere ....

Sunday, March 14, 2010

get cosy with compiler flags

The compiler flags are your friends.

I say that from experience, having just integrated some sweet tracing libraries into our make environment and by chance activating advanced type checking for our FORTRAN compiles, all 100,000 lines of them.

All sorts of mismatches started appearing in the compilation logs. Applications that had been compiled, linked and used for the last 23 years now showed mismatches in the types being passes in subroutine calls. Some harmless, but others quite ready to stab you in the back on the day that you decided that wearing chainmail shirts to work was so old fashioned.

How about these:

Missing a required subroutine argument or supplying extra arguments.
Passing zero as an integer to a routine expecting a real.
Returning a real to the calling subroutine when an integer was expected.
Passing a real array but then treating the elements as if they were double precision.
Expecting no loss in precision when combining reals and double precision numbers.
Generally finding that multiple calls to the same subroutine are not consistent.
Misuses of global common modules by concurrent equivalencing using inconsistent types.

All of this revealed by a simple compiler flag.

Get to know the flags that are available, not just for your FORTRAN compiles, but for any language. While it is true that activating one will likely increase the volume of output when you compile, it can save you from decades of thinking that the code is type consistent.

Better to curse your compiles than to have someone cursing your attention to detail a couple of decades down the track.

the hazards of cumulative programmers

Ever wonder how long an error or omission in a piece of code can go undetected?

I found one in a piece of FORTRAN code last week that had remained undetected for around 23 years. It was based around the assumption that a particular method of processing for a slab of steel was the same every time. But conditions do change and the result of a small change in incoming dimensions coupled with the use of a particular processing selection produced an unfavourable result in the final product. As usual, while it would be justifiable for me to assume that this was one programmer's shortcoming, the real story is more mundane.

Programmer A assumes that the product will always be processed in the same way and determines critical product parameters based on where he thinks the process starts and ends.

Programmers B, C, D, E, F and G work on the same 25000 lines of code over the next 23 years, making unrelated changes that all slightly alter the conditions that programmer A's coding solution held valid under, until such time that, in combination, they result in a truly unfavourable, fuel tanker, freight train collision cum explosion of a bad product outcome.

Scratch many tonnes of finished product and suddenly, programmer H, who has been left to cook with such spaghetti, is caught in the headlights of a large truck being driven by the angry production manager.

So programmer H has to sort it all out over the following week until the "aha" moment arrives. One line of missing code is all it took to wreak havoc. Inserting it is simple but satisfying and all but guarantees some future proofing against a repeat of this carnage.

Imagine an error like that bringing down a bird at the end of NASA's shuttle program. Unthinkable but possible. How many probes have now been lost due to such errors and it aint a short trip to Mars to reload either.

Monday, March 1, 2010

quote this

Quotes, quotes, quotes.

Frustration, SQL and crontab go together again.

The SQL script was running well but then I have to go and add a date to the crontab line.

Arrggghhh.

No execution. No errors. No output to file. Cron is usually quite nice and throws and error that finds its way via email to me. But not today.

What have I missed?

Aha. What was that thing Sandra Bullock said in the widely panned movie "The Net" ... "Escape the system". Well I would if I could or was willing to get paid less but this is the solution.

If you want to embed this sort of thing

date +%w

you have to escape the % character to avoid it being interpretted by cron

date +\%w

The full line with my script name scrubbed to avoid potential embarrassment and with a few extra things thrown in for confusion or good measure, depending on your experience.

54 22 * * 1-6 csh -c "setenv ORACLE_HOME $ORACLE_BASE/product/8.1.6;setenv ORACLE_SID instance1;rm /tmp/datafile_`date +\%w`.dat;sqlplus user/password @ /myscripts/extractData.sql `/myscripts/lastWeek.pl` 07:00:00 > /outdir/datafile_`date +\%w`.dat"

The script lastWeek.pl returns the date 7 days back from today in the format dd/mm/yyyy.

Sanity restored. Sleep ensuing.

Thursday, January 14, 2010

anyone for a date

Great Christmas holidays are oft followed by despair.

This year was no exception. Fancy discovering a couple of days before a manufacturing plant starts up for the new year that a bug has been hiding since 1998 to show itself. During a system upgrade in 1998, one developer had decided that handling a 2 digit year in an incoming file transfer would only require checking the first digit to determine which century it lay within. Consequently, with the new year rollover to 2010, we found ourselves back in 1910. A problem on a couple of counts.

One, it lay before the UNIX epoch of 1/1/1970 and two, it was out of range of our now legacy third party real time database.

An easy thing to fix, followed by a full system build and release on a Friday afternoon, all ready for a weekend startup.

But it has now exposed a bigger problem, the UNIX end of signed 32 bit date problem, that some say may be known as the UNIX millenium bug or the Friday the 13th bug due to the date rollover to Fri Dec 13 20:45:52 1901 when the date happens to hit Tue Jan 19 03:14:07 2038 UTC.

A potentially larger problem due to the total number of impacted systems across the planet included many that are embedded.

Many references to the problem, particularly at http://www.2038bug.com/ with some interesting predictions about the response that I have extended with my own thoughts and descriptive language.
- lazy programmers wait till the last minute to fix the problem through lack of incentive or management pressure
- smart programmers use the problem to create another dot com frenzy and the sharemarket hits new highs
- the media blows the problem up into a doom and gloom scenario where even the most basic social structures will break down, all without the benefit of a zombie attack
- terrorists direct their attention to fostering the installation of as many bug prone systems as possible in the hope of the mother of all meltdowns.