Home for HMNL Enterprise Computing

Big Old Data

Ian Tree  28 October 2013 10:58:00

Just When Did Data Get Big?

It never did, of course. Data and the demands on processing it have always been bigger than the box of processing capacity that we had to put it in.

I was having a discussion the other evening with some fellow enterprise architects over some of the design challenges they faced in handling the Big Data needs of their respective organisations. Somewhat fuelled by the fact that one of them was picking up the drinks tab I was forced to confront them with the fact that the problems they were grappling with were not the new high frontier of enterprise IT. We have been here before.

Back in the Day

"Back in the Day" in this case is back in the 70's. In those days financial and marketing organisations were crying out for more data to drive their strategic and operational systems and technological development was increasing our capacity to process data. In the marketing arena we had to deal with ever increasing volume and variability of data and the demand to process it at an ever increasing velocity. The challenge was that there were only 24 processing hours in a day and every day there was a new operational schedule of processing that needed to be completed and projections of elapsed processing times were always heading for 36 hours for the daily schedule in the next 6 months.

There was "NoSQL", sorry, I mean "No SQL" at that time. Ted Codd was still busy working on the theory and IBM was toying with experimental designs for "System R". Even if SQL had been around disk storage was so slow, low capacity per unit and hugely expensive that a disk based database of the size needed was a dream too far. The marketing master database was a multi-reel tape sequential file, some of these databases had quite sophisticated data structures consisting of a sequentially serialised form of a hierarchic data structure. Tapes and decks of cards would arrive from data preparation bureaus containing event records and update transactions that needed to be applied to the master database. The incoming transactions would be validated and then sorted into the sequence of the master database. Once the daily input was available then the "Daily Update Run" would commence. The daily update run was a copy with updates process that would read the latest generation of the master database and write a new generation with the updates applied and new records inserted. Once the daily update job was completed then a number of analytics and reporting jobs would have to be run, each requiring a sequential pass through the complete master database. After the analytics jobs had been run then the extraction jobs would be run to extract complex selections of a subset of the data these extracts would be used for generating direct marketing offers and other purposes, again each of these jobs would require a complete sequential pass through the master database.

Rising to the Challenge

We started out by partitioning the master database into first two and later into four separate datasets, this allowed the four daily update jobs to be run in parallel. The analytic and reporting jobs were split and their respective outputs would be combined in a new job to produce the final outputs, a kind of primitive Map-Reduce solution. The data extraction jobs were a little more challenging to partition as they were built around some sophisticated fulfilment algorithms that dynamically adjusted record selection probabilities based on the recent achievement history, these required additional work so that they would perform correctly on the partitioned master database. The changes made to the extraction algorithms also resulted in us being able to run all of the extractions required in a single set of jobs, further reducing the elapsed run time for the complete schedule.

The Lesson

Don't get me wrong, I think that we doing some fantastic things with Big Data these days and are finding new and creative solutions to many of the associated problems. However, I also think that we too readily dismiss some of the approaches and solutions that were applied in the past to the equivalent classes of problems.

"Grey hair get's that colour because it leaks from the grey cells beneath."