Nov 8, 2009

At last - someone we understand ...


It is rare with all the background reading and research that I have to do that I read a contemporary book and say "Yes, this guy gets it" - notice that I say contemporary. There are plenty of good books about the day-to-day business intelligence problems that I see which were written over 20 years ago (The Mythical Man Month by Frederick P Brooks [1975] and Peopleware: Productive Projects & Teams by Tom DeMarco and Timothy Lister [1987] to name but two). So it was with immense pleasure that I read the opening chapters of Netezza Underground: The Unauthorized Tales of Derring-do and Adventures in Resilient Data Warehousing Solutions by David Birmingham [2008]. This book describes, in a hugely readable way, many of the same concepts and ideas that I have been espousing for the last 15 years about how to deliver a successsful data warehouse solution.

David describes the building of data warehouses as the need to think of terabytes, not transactional systems and a discussion of very-large-scale data systems. He says that the rules are different here and yet oddly the same:
  1. Everything is requirements driven
  2. Simplify and clarify
  3. Use correctly powered and scalable systems
  4. Governance
  5. Data management
  6. Strong architectural approach
  7. Building the environment with the expectation of change
  8. Testability
  9. Go Parallel
  10. Never do bulk inside a traditional RDBMS
I would love to drill into each of these in more detail but doing so would simply reprise an excellent book and probably breach copyright! And if the mention of Netezza as a vendor puts you off, then read the book anyway, because whilst the author is an avid enthusiast of the product (and some reviewers dislike the consequentially irreverent style), the approach, techniques and philosophy he describes provide much of the same advice that we have been giving our customers for so long and that can be applied to other technical solutions. Of course if you are interested in implementing a system we are here to help too!
Share/Save/Bookmark

Nov 4, 2009

Data Warehousing Appliances – Fad or Future?

This article was originally written for Conspectus Magazine in December 2006 and has been updated in November 2009 by the original author

Despite all the hype from vendors the basics of data warehousing have remained fundamentally unchanged – extract data from multiple source systems, reformat the information into an easy to query structure, load it into a dedicated database and add an effective user interface to allow users to query the information. The cost of this environment is substantial and directly relates to the complexity of the Extract, Transform and Load (ETL) process and the volume of data held in the system.

The complexity of the ETL process has two cost impacts: the first is in the cost of the initial design and development and is reasonably well understood. The second is the cost of changes over the lifetime of the system, for example if an organisation have four source systems and each system under goes a change once a quarter then the data warehouse support team have to modify and test an interface every three weeks, and all this without any changes in the users requirements. The volume of data also hits the bottom line, not only in the cost of storage but in the size and (more expensive) skills of team required to support it, especially as data explosion forces the business to enter the very large database arena where load time and user query performance are critical.

Against this background it is unsurprising that vendors are looking to compete by reducing storage, improving query times and simplify administration. Oracle have taken steps to enhance their core database engine with features such as Exadata that improve each of these areas and continue to develop their strategy, however more and more is built into the core of its flagship general purpose engine resulting in software that has many features not needed by a specific application. Sybase have taken the more radical step of creating an entirely new database engine called Sybase IQ that does away with some of the limitations required of a general purpose engine to produce a solution that is both much faster in load and user query performance and far more efficient in its disk usage than other general purpose databases. The other traditional database vendors have all upgraded their product suite to chase the Business Intelligance market

Into this market enters the data warehousing appliance vendors, a breed of dedicated hardware and software solutions designed to solve a business’ data warehousing woes. Such systems use low cost commodity components in large volumes with dedicated business intelligence engines to deliver radically faster load times whilst at the same time reducing the query times and simplifying the systems administration process.

The first hurdle for many organisations is that data warehousing appliances are to some extent proprietary and therefore going against a corporate policy of open systems to allow technology re-use, however a solution built on one of the current market leading platforms, Teradata, is no less so. In fact Teradata can be considered one of the original data warehouse appliances and it is the use of the low-cost commodity components and the ability to achieve massive parallelism by the new-comers that differentiates them.

The second hurdle is credibility – the promises of such large benefits (typically query performance of ten to fifty times faster whilst using three to six times less storage on a platform that only requires a small amount of systems administration support) will be doubted, especially by systems and database administrators who have had to work so hard to maintain the performance of the existing solution. Vendors such as Netezza have overcome this challenge with some key accounts by providing a system on the basis that if it meets agreed performance criteria it will be purchased and thus significantly reducing the risk to the purchasing company.

The final obstacle is migration and its associated cost: an existing solution that is built, for example, on an Oracle database, using Oracle Warehouse Builder and Oracle Discoverer is effectively proprietary and therefore more difficult, but not impossible, to migrate.  This is also a reason to review the existing data warehousing architecture now to ensure that when these and other new technologies come along the business will be able to take advantage of them. Companies will be able to get a clear competative advantage if, by architecting their Business Intelligence systems into functional components, they can quickly change and adopt bigger, faster, cheaper technologies.

Those organisations that have overcome the hurdles report that they are achieving the immediate huge performance gains for their queries without the need for tuning the database whilst lowering the disk footprint and reducing the support costs. The systems also continue to deliver benefit as the fast query times allow more complex data models to be queried, which in turn reduces the need for complex ETL to restructure the data. These changes to the data model and to reduce the complexity of the ETL can be made either as part of the migration project (which delivers the largest benefit quickly but at the greatest risk) or as part of the change management process for the source systems (which delivers benefit over a longer time frame but significantly reduces the risk).

Added to this is the emergence of MapReduce, originally develeoped by Google, it is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This is becoming a must-have feature for appliance vendors handling very large data sets.

There are now a significant number of vendors working to produce some form of data warehousing appliance (Netezza, GreenPlum, AsterData, etc. to name but a few) and it is clear that appliances are going to form a key part of data warehouse architectures going forward, the risks of using a smaller vendor and a proprietary solution being outweighed by the business benefit of much more timely information at a significantly reduced cost. Note also that there will be market consolidation and some vendors will disappear.

For further information analyst Curt Monash is just one of a number of analysts who follow this subject and provides regular updates on the market.

Read the original version of the article
Share/Save/Bookmark

Nov 2, 2009

Creating graphics for BI Reports in Ruby On Rails

So your manager wants reports with sexy graphs - what are your options and what are the pros and cons

I have found eight usable packages with different considerations

OpenFlashChart

This is a series of libraries including one for Ruby that allow the generation of great looking charts that are very dynamic (see the demo on the website). Data is easily passed via JSON making it easy to integrate. The only downside is that it requires the client to have an Adobe Flash player installed. This is common enough for it not to be a problem for most people but it does limit the portability. Also as it generates a flash object rather than a JPG, PNG or GIF it means that this is suitable for online output but not useful in RSS feeds etc and when printed all the dynamic features are lost anyway.

Gruff

Gruff provides a simple and easy to use Ruby package that works well with Rails. It does require that ImageMagick/RMagick is installed on the server (HowTo for OSX) and this can be a pain for some people. The advantage is that it produces good looking, native SVG, PNG, JPG, GIF files so they are easily portable and printable

Scruffy


Scruffy is very similar to Gruff although appears to be less well used and not recently updated. It too requires ImageMagick/RMagick installed and whilst it is based on SVG is capable of producing other graph formats.

Sparklines


Sparklines are small graphs that are usually embedded within the text of a paragraph or placed inside tables. The standard output is a png file and it also requires ImageMagick/RMagick to be installed. This size of the graphs makes it useful for small graphs such as indicators rather than main report graphics.

gchartrb & Google Charts
See also http://www.infoq.com/articles/bass-google-charts-gchartrb


Google charts provide perhaps the easiest interface to creating a graph. All the parameters get bundled up in the URL and Google does all the work. Added to this is gchartrb that provides an easy and intuitive way to build the URL. All this means that there is nothing to install on the server side however it also means that to get access to your image you have to be connected to the internet and pass data which your company may regard as secure to the Google servers. This may make it impractical for use with BI data.

SVG::Graph


SVG:::Graph is a pure Ruby library for generating charts, which are a type of graph where the values of one axis are not scalar. The output is an SVG which might not be supported in all browsers or a PNG file

Flot


Flot is a pure Javascript plotting library for jQuery. It produces graphical plots of arbitrary datasets on-the-fly client-side. The package once again takes a JSON string as input and produces great looking graphs that can be dynamically updated

CSS Graph Helper


A Ruby on Rails helper for making simple graphs. The graphs use only CSS/HTML (and an optional gradient image). Great at creating quick simple online graphs but very limited in its capabilities

So:
If you are looking for a simple graph that is easy to generate then gchartrb is for you
If you are looking for dynamic and very attractive content then Flot or OpenFlashChart are your options
If you are looking for small KPI indicator type graphs consider Sparklines
If you are looking for good looking images that can be distributed consider Gruff, Scruffy and SVG::Graph
If you are happy with a quick and easy CSS based solution then look at CSS Graph Helper
Share/Save/Bookmark

Nov 1, 2009

Analysis by ETL ...

Performing source-system analysis by writing ETL is just about the most expensive way you can do it; but many organisations do exactly that. Why? Because they "can't afford" to do their analysis the cheaper way. No, this does not make sense.

This is how it works...

In most data warehousing projects, business analysts are separated from data profilers, data modellers, technical designers, ETL developers, and every other possible sub-division of the task you could imagine.

We sometimes find brilliant business analysts, but sadly, they are rare. More usually the business analysts don't like to get their hands dirty, and present the designers with works of elaborate fiction which have some resemblance to how the system might work in an idealised world. Usually their documents are at a "high level" (actually meaning vague and inaccurate).

Maybe we get some help from data profilers, who run the data profiling tool on a tiny sample of the data. (One tool, in the top right corner of Gartner's Magic Quadrant, limits samples to 999,999 records. I suggest that this is completely useless for a database table of a quarter of a billion rows, and such scale is usual in the sort of enterprise that would feel the need for a data warehouse.) The data profilers dutifully print out the verbose reports from the data profiling tool and the technical designers ignore them (not through malice, or even laziness, but because they know that these reports will not be much use).

The designers are more likely to add a further layer of invention to produce "Mapping Specifications" based on the way the system must work (surely!).

Then the ETL developers build their ETL jobs, generally pretty faithful to the mapping specifications, but often with some elaboration to take care of exceptional cases (most of which never happen in the real data) and with some "corrections" to the mapping specifications which (surely!) must be wrong on these points.

Now we could do with some test data. It is, of course, too expensive to build test data to cover all the test cases (which have not been defined until now) so work stops while negotiations begin to get extracts of real data from the operational systems.

Eventually the data arrives. The people providing it have carefully selected it to represent the cleanest view of the source system. Testing commences.

Actually, what happens now is that analysis commences. The ETL crashes into an idealised version of the real world and... crashes. The data is very unlike what was defined and it just doesn't work. We don't get an end-to-end run because the data formats are not accurate and even the extracts fail. We patch up the extracts, one defect at a time, until - eventually - the data gets as far as the staging area. Unfortunately, the project is now beginning to run late, so we don't have time to correct the mapping specifications. As for the source systems analysis - well those guys are long gone.

So we've stumbled through the extract and now we can start on the transformation. Well, if we couldn't get the source data formats right what chance have we got with the relationships which exist in the operational systems (which have been patched and enhanced over the years, with no regard to good database design, object-oriented principles, user interface good practice, or any discipline which might have given us a nice database to work with)?

Many months later, after many late nights and weekends of work by the ETL team, we finally have something which appears to work.

But this is on the sample data. When the system goes live, the ETL is exposed to the data quality errors, missing reference data, old test data, clever bits of encoding within fields (which the users have invented to overcome shortcomings in the system) and real data volumes.

Everything fails again, there are recriminations, more late nights and weekends. Many months later the system finally goes live and users start running reports. By this time the users have no confidence in the data warehouse - it has got a bit of a reputation - so they question everything which does not match what they had before, even if the new reports happen to be more accurate.

But some good has come out of this. Finally we have a team of people - the ETL developers - who actually understand how the source systems work and where data quality and design flaws exist. But the ETL developers have finished their work, so their contracts end and they walk out of the door taking the knowledge with them. No-one is going to pay for them to write it down.

This all sounds terribly cynical - doesn't it? But sadly it happens, over and over again. The tool vendors encourage this approach by the lie that their tool is the Silver Bullet (see The Mythical Man-Month by F.P. Brooks) that will solve all these ETL problems. The IT analysis and advice companies are accessories to the crime, re-enforcing the myths from the vendors.

It doesn't have to be this way. The application of realistic, sound, rigorous, good practices can overcome all of these problems, but this does require that managers understand how such projects work and don't just believe that pretending to be Alan Sugar or Gordon Ramsay is the best way to get results.

So what makes it better. First of all, we need people with skills across the range required for data warehousing projects. We want ETL developers who can talk to users, we want business analysts who understand the principles of good database design.

Then we carefully dig for real requirements. We don't just gather them. Quantity is not our goal. We don't want a spreadsheet with a list of 1,500 fields that appear on today's reports. We want to know what the users really need, what will save them the most time, what will improve the quality of the information they use. In the terminology of Agile development we want "stories" - the users' definition of what they need, why they want it, how important it is to them and the "acceptance criteria" which will enable us to know when we have delivered what they want.

The people doing the development must be involved at this stage. As they build working relationships with the users they can ask for clarification, they can show what they have discovered, and they can expose what they are building for the users to verify as they go along.

The team may use data profiling tools to find out the extreme edges of the system, the biggest and smallest values, the unusual patterns, the missing reference data, the orphaned transactions. The users will be able to explain some of the apparent strange values, and will discover why they have not always got the results they expected in the past. With the users' involvement, data quality issues can be fixed in the source systems. The data warehousing project is starting to deliver benefits before a line of ETL code has been written.

We may find that some master data management is required. Typically we will find different versions of the product catalogue in different applications. A project to sort out a single product catalogue will also save money in the business before a line of ETL code has been written.

With a proper understanding of the source systems the team can build a data model for the data warehouse which really models the work of the enterprise, not a fictional idealisation of it, and not a mirror of each operational system, but a process neutral data model which will be a valuable investment for years to come.

So, at this stage we have a robust target model (the data warehouse) and a good understanding of the source systems. We can now define the source to target mapping accurately, and, at the same time, build test cases to prove that the mappings work correctly. Yes, we build test cases, we do not just grab some real(ish) data and bash the ETL into it to see what happens. It does take a little time, but we automate our test runs so that the test case we build today can be run over and over again to keep on proving that the system is running correctly. The data volumes needed for this kind of unit testing are tiny, so test runs are completed quickly and can be run every time a new ETL module is added or an existing one is changed. (The ETL tools do not give much help in this, despite their 6-figure price tags.)

When we get a part of the data warehouse built successfully, then we can try some volume tests. Real data is useful here, but if it is not available, we use test data generators to ensure that our ETL and our data warehouse can handle real-world volumes (and a bit more). We may need to do some tuning. This may be as simple as adding some indexes, or it may cause us to rewrite chunks of our ETL code. Either way, we run our automated test suite, which now contains hundreds, maybe thousands of tests, to prove that we have not broken anything. Now the cost of building the tests is really paying us back.

At this stage the users can start running reports on the first part of the data warehouse. We can work with them to reconcile the results back to the source systems and to determine where any discrepancies lie. If we have done the preceding steps well, the new data warehouse will be right more often than not. When it is wrong, we can fix it and run our test suite to make sure we have not broken anything else. The users' involvement here builds confidence in the new data warehouse and smoothes the path to implementation.

So, if the first part of this article was cynical; is the second part hopelessly optimistic? No, it is not. We have done this and made it work, many times.

The keys to success are:
• people in the team with broad and deep skills
• involvement of the users from start to finish
• openness and honesty

The tools we use encourage these characteristics, but no tool is a silver bullet - without the buy-in of the team (users and developers) no tool will make a data warehousing project work.

To develop and manage documents, we use a wiki, and we do not lock down the security within the team (it is secure from competitors and any other possibly malicious access). Everyone in the team - users, managers, developers - can add to or change the wiki at any time, and everyone can see the current content and the history. Mistakes can always be fixed because the wiki keeps a complete history of changes, but the need to revert to history is very rare in our experience.

We use a tracking system for tasks, risks, issues, enhancements, test cases and defects. Again, everyone can see and comment on all of these. If a problem is brewing, then the team knows about it before it becomes a catastrophe.

We do not accept quick and dirty fixes. Technical debt is paid off as soon as possible, it is not allowed to accumulate. We are not afraid to "re-factor" to make our database and our code clean - we can easily prove that it is still working by running our automated tests.

Of course, we also use data modelling, data profiling, ETL, version control, and other tools, but we do not expect the tools to solve all our problems, and we choose free or low-cost tools where they do the job.

And, we do not do our analysis by writing ETL. We do analysis the much less expensive way, using our brains before we write all the code.
Share/Save/Bookmark

Oct 26, 2009

The project wiki - a cost reduction tool

Some readers will be familiar with TED: Ideas Worth Spreading a series of talks on just about everything worthwhile in Technology, Education and Design. I recently revisited Yochai Benklers talk on the new open-source economics from 2005 where he explains how collaborative projects like Wikipedia and Linux represent the next stage of human organisation.

The principle discussed is one it is more productive for a large number of people to work collaboratively on producing content than it is for a single individual to do so.

In Business Intelligence projects there are a number of formal documents that need to be produced to satisfy constraints external to the project (e.g. company business processes). Data Warehousing projects often fail to satisfy two masters, firstly failing to record information and knowledge pertinent to the on-going success of the project and secondly producing too many 'formal' documents that are time consumingly produced by key individuals but of little value to the project itself.

Data Management & Warehousing, my company, uses something called Project Services that combines Trac, SVN and the ideas of the Data Warehouse Documentation Roadmap to combine a Wiki, Version Control, Ticketing and Project/Team Website. Bringing these together and using them optimally is a micro-example of the concepts in Yochai Benklers talk.

A data warehouse team can quickly and efficiently build wiki pages that record much, if not all of the information that is internal to the project. This can include the business definition dictionary, various pieces of the requirements, architecture, analysis, standards and definitions, etc. What is more in this environment it is possible to have links between wiki pages and documents and between wiki pages and the source code itself. This means that users of the system can quickly and intuitively navigate through information and, where there is missing information or inaccuracies they can fix it themselves. In the case of project services there is also a ticketing system for task, risk, issue, enhancement tracking built in to complete the project management and governance aspects. Because this approach makes it easier for users they are more willing to contribute to the overall solution and to follow the required processes.

Implementing and using a tool such as project services is key to creating a BI-on-Rails solution because whilst the strict version control and issue management can be enforced it also allows agile processes to work and strong internal communications within the team.
Share/Save/Bookmark

Oct 22, 2009

Oh what a tangled web we weave, when first we practise to deceive!

The deception in our case is the accuracy of the data and the return on investment of the data warehouse solution. Too often projects make compromises in the implementation of a business intelligence solution that create massive cost and user dis-satisfaction downstream that lead to the failure of BI projects:

For example:

* In the project management setting un-realistic or un-deliverable targets, opting for tactical solutions that soon overtake the main project and the failure to communicate issues to senior management and business users that will affect the delivery timescales and costs.

* In the requirements phase believing that you know better that the user and not getting their sign off that they understand and agree with the requirements.

* In the analysis phase the failure to correctly identify the master sources of information and to do sufficient work to understand how the data is stored in the source system and what data is needed to meet the requirements.

* In the design phase the expediency of allowing the developers to 'get on with their job' by not thoroughly validating the design and checking the sources to ensure that the design meets criteria such as timeliness, reliability and accuracy.

* In the build phase by coding to the bear minimum to get the job done and not to a standard that ensures it will run successfully time after time.

* In the testing phase where as well as the standard data and volume data you test all the data boundary conditions that might occur.

* In the data quality aspects of the programme where data is fixed on the way in to the system or after it has been loaded because it is 'too difficult' to fix it back at the source system.

These are only a few examples but each of them makes the final solution more difficult (a tangled web) and therefore more costly to operate and maintain. If you are in a project where doing the right thing is compromised then remember that the project will have to pay for these 'efficiencies' later. Project managers should always strive to improve any project lifecycle and deliver on time. Too often, however, it becomes expedient to bury the truth and deliver anything to meet an arbitrary date that, in the long term, leaves the business user unsatisfied.

For more on this subject have a look at:

* Black Swans and White Elephants - RoI in Business Intelligence

* Getting to NO

* Technical Debt

By the way the quote is often mistakenly attributed to William Shakespeare but was written by Sir Walter Scott (1771 - 1832) in his poem Marmion - another data quality problem!
Share/Save/Bookmark

Oct 14, 2009

BI Convention over Configuration

The convention over configuration discussion when designing a data warehouse often leads to fanatical discussions by technical people over the 'best' approach to do things.

Convention can be used to define the standard way to design a data model (see Process Neutral Data Modelling) or to implement an ETL transformation (for example a reducing sets approach) amongst other things but inevitably someone will claim to be able to make an individual element 'better' bet by configuration.

For example if we have an ETL mapping and always approach handling data change in a certain way then unless we are really sure of the benefits we should stick to the convention. In our example a DBA offers to make a single ETL mapping faster. The DBA can halve the time - so what is the harm?

Firstly if the new algorithm is as complete but just faster then there is no harm. In fact I would encourage you to update your conventions and re-factor your code so that all of the mappings exploit the algorithm.

But what if the faster algorithm works because it omits some element? Maybe it doesn't need to check all the columns, perhaps there is an assumption of referential integrity, or it joins two steps together where other tables need the intermediate results.

This is where the cost creeps back in. In the future the omitted column is needed, referential integrity is no longer true, the intermediate step is required, etc. At best this causes un-necessary rework. At worse it causes silent corruption of the data that causes a crash or takes months to unravel. It also means that because it does not conform to the standard the person fixing the code has to work through the code to understand it first, and assuming they understand it perfectly, only then they can then fix it - all of this taking more resource time and cost.

The convention over configuration option does not mean that there is only one way to do something, nor does it mean that it is optimal for the individual case - it means that there are a defined set of algorithms for a given process and that these are collectively optimal for the solution - a case of looking at the whole picture rather than individual elements. In the process neutral data modelling technique it will often create a data model with over 1000 tables but there are 10 named approaches to the data modelling issue. How many ETL algorithms will you need? My guess is that it is in the same order of magnitude, about 100:1 against the total number or transformations. What is more the number of ETL transformation algorithms required drops dramatically if the data model is uniform because it is easy to describe a standard ETL algorithm for a standard data model.

Returning to the start of this article - your DBA can do this mapping better? But in what sense? Is it more cost-effective, more maintainable? Easier to understand, or cheaper to maintain? The decision to go the bespoke route for this one case should balance these factors and their costs against the necessity of the performance gain.

Experience tells me that in 99.9% of cases bespoke is not worthwhile in the long term.
Share/Save/Bookmark

Oct 13, 2009

We've got a new puppy - we've called him ETL!

Monday ...
Our house is full of optimism today - we've been out and spent an absolute fortune on a pedigree pup. We were really impressed by the salesman at the shop. He offered something to each family member. Mum (our family manager and my wife) was sold on the the cost efficiency. We won't need to install an alarm system to protect our house (we live at Data Quality Villa, on Single Version Of The Truth Avenue) and he can be taught to fetch and carry just about anything we need. Our daughter and oldest child Elizabeth-Ann, known as Busy-Ann to her friends, liked the way it was easy to train him to do exactly what she wanted. My son Dev is just grateful to have a new play thing. He's been talking about going on long hot summer days hacking through the country side - so that's him out of my hair. And me ? Thomas Archibald (T.Arch. to my friends), well I'm pleased that all the research I did on the web before hand has paid off - the salesman told me I had paid the right price for the right pup. I must admit that I enjoyed all those trips off to Dog Shows to evaluate different breeds and meet other owners - who would have thought you would have to travel to so many exotic places to see different dogs?

Tuesday ...
So Mum's a little bit upset this morning - I don't know why - I told here there would be a learning curve with this breed before he did anything - but that it would be worth it in the end. She said I was just looking for the latest accessory, but a dog is not just an accessory its for life! Anyway Dev is cleaning up the little accidents in the utility room (you know where all the machines are kept) and says that he is starting to get ETL to do what he wanted. He's mentioned that he has started to learn the same language at ETL - I don't know why they expected it to be as easy as dragging a toy or dropping a bone but there you go.

Wednesday ...
ETL is settling in well - he's taken over an entire corner in the utility room and when he lies in his bed paws and ears hang over the edge looking cute. Mum was moaning again about the little accidents. She says it's taking too much time but I don't see the problem - Dev and his girlfriend DeBrA clean them up so Mum really doesn't have to do anything to worry about. It was quite funny when I got home tonight, Dev had all his mates around, DeBrA was there along with the twins Si Admin and Nat Admin eating cold pizza, they were all chatting about how ETL just wasn't very quick at learning new things or at fetching things when he has learnt them. The good news is that ETL is finally fetching something - he brought me my fluffy slippers when I got home tonight.

Thursday ...
Mum is still unhappy about the wasted time looking after ETL - so I popped in to the shop where I got him and asked for some advice. They sent me to the local vets - who can cure anything the salesman said. I popped in on the way home and met a couple of the firms partners. 'SI Partners' is a big company with 'global reach' and a 'local presence' so I was delighted that they said they could help. I've had to take out an insurance policy with lots of small print about exemptions and things that aren't covered by the standard contract but that's normal for these sort of things. One of the partners is coming round tomorrow to do a review and let us know what we need to do.

Friday ...
Mum's still unhappy (how did we know that was coming?) but the partner came around early and set our minds to rest. We're missing the obvious, we didn't get the accessories - there are a complete range of toys, bells and whistles made especially for this breed! We got a leash to keep him under control when we bought him but we haven't used it (according to the partner that is why ETL doesn't have any discipline) and we didn't think the extras would be needed but there you go, the only downside is that the extras cost almost as much as the dog did. The partner is also sending around a couple of practice nurses to help keep an eye on things. It would have all been expensive but the for call-off insurance (he said something about the extras and the nursing not being covered but we will sort that out)

Saturday ...
Still not got ETL trained. Busy-Ann is unhappy because a simple change seems to cause so many problems. ETL could fetch my old slippers but doesn't seem to understand that they are now redundant as I got some new ones for my birthday and so I really want him to fetch my new ones. The old ones were furry and he seemed to like those but he doesn't get on with these new leather ones and has chewed the corners off the toes. It's pretty bad news as fetching the slippers is the only thing we've got him to do so far and I felt the toes were an integral part of the slipper!

Sunday ...
Didn't sleep too well last night. There was a lot of noise in the middle of the night and I went down stairs to find Dev howling at the moon in the garden. I asked him what he was up to and he said that since he had learnt ETLs language he found it easier to bark himself! Eventually I persuaded him to go to bed. I overslept because of the disturbance and when I got downstairs I found a note from my wife to say she had left and taken Dev with her. She said that since we got the dog our house had become the dirtiest on the street, not good for a house called 'Data Quality Villas'. She was going somewhere where they would listen to her and appreciate Dev for his new found skills. I don't know what the divorce settlement will be like but it is bound to be expensive. Busy-Ann has stayed with me but she won't have anything to do with ETL. Oh! the dog, that's right, that's why you are reading this - he's fine, but you would be too if you had round the clock nursing from SI practice, he's spread out too and has taken over the whole of the utility room. If I didn't know better I would say all those extras and no exercise is making him a little bloated and he is still fetching the wrong slippers.

Monday ...
New week, fresh start. I got a new girlfriend - you might think it is too soon but I was lonely - she's not permanent or anything and co-incidentally she's got a son called Dev too. They've moved in and are helping out with ETL. Between the nurses and Dev 2 (our little joke - he's the second Dev we've had in the house you see) ETL has at least got round to bringing me my new slippers! I've asked a few friends about their dogs. Most of them have pedigree dogs too, some have tried several breeds over the years, most have had about the same results. It just goes to show that it was nothing we did that was wrong, it's just how it is with dogs. There was one guy who had a simple mongrel which he called Michelle Script - a strange name if you ask me - but he was delighted. He said he started with lower expectations, looked after the dog well and got lots of love and devotion back - but I said that I couldn't be seen with a mongrel, what would my friends at the Analyst Club think? I've decided to have a look at other breeds again, I'll let you know how I get on.
Share/Save/Bookmark

Oct 11, 2009

Version Control on Rails

In my consultancy assignments, I usually work in large organisations that have outsourced IT departments and very formal processes for everything. One aspect of this is version control.

In such an organisation every document contains a long table near the beginning with a row for every change, who made the change and when it was made and of course there will be a version number for every change. Usually the version number has at least 2 parts.

Also, the file name will have the version number in it, such as: "Some Formal Document v1.3a.doc".

Traditionally (over the last 15 years or so) the files are stored in a Windows share. There will also be copies on local drives on various individuals' PCs. How often do we find different people have different versions and they all believe them to be the latest?

Now, many organisations are waking up to the benefits of version control systems. However, they usually do two stupid things. The first is that they buy a version control system and the second is that they use their expensive version control system as if it were a Windows share.

You don't need to buy a version control system. You almost certainly don't need to buy a "Software Configuration Management" system, nor do you need to buy an "Application Lifecycle Management" system. The best version control systems are free and they are called Subversion and CVS. Using these tools is easy and simple. Discipline is necessary to get the best out of them, but you don't need a massive amount of training. If you are really keen, you could read Pragmatic Version Control using Subversion. (You are really keen, aren't you?)

While you're reading the book you could minimise your chances of going off the rails by following the simple recommendations listed below:

  • Let the version control system do the work!
  • Remember that the version in the version control system is the master version; any file on your local disk is not the master until you check it in to the version control system.
  • Do not change the name of a file from one version to another; all versions of one file go in the same folder, under the same name (then you can see the change history).
  • If you do need to change the name of a file or folder, do it in the the version control system client, not on the working file on your local disk. (If you do it in the version control system it will change the working file for you and everything will stay in sync.)
  • Never, never, never put a version number, or date, or time in a filename - this is a version control system, not a Windows share!
  • Check in atomic units – one component per file - do not check in zip files or tar files
  • The version control system version information is trusted over any version information recorded inside a file - so do not record it inside the file.
So what has this got to do with Rails? Well, there is the DRY principle, to start with:

Every piece of knowledge in the system should be expressed in just one place.

The version control system keeps a log of your versions, so don't keep an incomplete and out-of-date one in the document - delete that big table from the front of your document - you really do not need it. And do not put the version number in the document or in the file name. This is just repeating yourself and if you do that you will end up with many conflicting versions of the truth.

Every time someone works with the wrong version of a document, they waste company resources. Don't tolerate it.
Share/Save/Bookmark

Oct 8, 2009

Agile development of a BI application

Wandering around the web I found a presentation by Soren Burkhart of Hawaii Business Consulting, LLC given at the Rails Conference in Portland, Oregon (May 2007) that describes a classic agile approach to developing a Business Intelligence application for the Office of Foreign Assets Control ("OFAC") of the US Department of the Treasury

It demonstrates the collecting of data from sources and manipulating it into a presentation format. Along the way he interacts with the customer to determine look and feel and builds test cases for now and to allow re-testing when the sources inevitably change. A great little demo of fast and maintainable (rather than quick and dirty) BI reporting development - all in just 45 minutes.

It is an enclosure to this article or you can download it from http://www.hawaiibcllc.com/BIWR.pdf
Share/Save/Bookmark

Can Business Intelligence Be Agile?


The agile manifesto and it's underlying principles set out four key values for developing software solutions:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

If an organisation is willing to accept these values for software development then how much more important should adoption of them be for the development of business intelligence solution upon which critical business decisions are made.

Furthermore business intelligence projects do not see themselves as 'developing software' because they use tools to generate ETL and tools to generate reports. Management teams also shy away from the idea that their business intelligence teams are developing software because they have long since moved away from in-house development to implementing packages. This also goes some way to explaining why management are so keen to buy Business Intelligence packages.

The reality is that ALL business intelligence projects are producing a massive amount of software. This software mainly consists of interfaces between systems (ETL) and interfaces with users (the reporting tools). Because the system is not aligned to a single business process individuals use the system in ad-hoc rather than structured ways. The BI system is also the recipient of information from all the source systems and therefore is constantly subjected to change that is outside it's control.

However the organisations business intelligence system is built the reality is that there will be millions of lines of code, massive amounts of change and demanding users doing things in unexpected ways.

The question that should be asked is not 'Can Business Intelligence Be Agile?' but 'Developing Business Intelligence solutions requires an agile process - can my company organise itself to successfully use agile methods to develop the required solution?'


Share/Save/Bookmark