- 论坛徽章:
- 0
|
By nisingh
(Open Source Analytics Evangelist) Posted 3/30/2006
In the next post, we'll start
doing some hands on. But before that we need to select a Data Warehouse
Platform. Don't worry, I am not going to nudge you towards some
commercial "Data Warehouse Solution" here! As promised, it will be open
source and free.
Since a Data Warehouse is essentially a
database, we could use any standard database for the purpose. But since
we aren't making just a toy to impress people with, let's stop for a
moment and look at what we need. So lets go over that...
Unlike
an OLTP system that needs to write very fast, we would be better off
with a database that can read lots of data faster because that is what
we would be doing most of the time. (See previous post
Database vs Data Warehouse
).
We
would also like the system to be production grade, robust, 24x7, and
common enough so that finding expertise is not impossible. Do note that
24x7 availability may not be as critical for a DW as it would be for
the OLTP servicing your bank's ATM or rerouting your flights.
Demonstrated
ability to handle huge (really HUGE) amounts of data is a must as well,
and it should allow high-speed bulk loading in batch jobs. And it would
be great if there was a way to split very huge tables into smaller ones
(like Oracle partitioning).
And then you would need transaction support. Right? Wrong.
Huh?
Well, look at it this way. If your OLTP system did not have
transactions, we'd be dead. But a DW is not a transaction oriented
system. It would be more efficient to use a non-transactional database
as you do away with the overheads required for enforcing transactions.
This is a debatable point, but a non-transactional database would be a
lot faster than a transactional database in simple read write tasks.
MySQL, configured right, makes an obvious and affordable choice for Data Warehousing.
MySQL's
non-transactional MyISAM db engine is one of the fastest around when it
comes to querying large amounts of data. You don't get transaction
support, and that's alright because you don't need transactions (and
the huge overheads they bring). And its bulk insert features (LOAD DATA
INFILE |
|