- Do not start with the installation of an Hadoop Cluster -- the "how"
- Start to talk to business people to understand their problem -- the "why"
- Understand the data you must process
- Look at the volume -- very often it is not "that" big
- Then implement it, and take a simple approach, for example start with MongoDB + Apache Spark
The infamous "big data project"
You can meet the same person few months later; the cluster is still sitting here, with no activity on it. I even met some consultants telling me they received calls from their customer asking the following:
Start with the "Why" not with the "How"!
- You are an insurance, you customers has no contact with your representative, or the customer satisfaction is medium/bad; you start to see some customer name in quotes coming from price comparison website…. hmm you can guess that they are looking for a new insurance.
- Still in the insurance, when your customer are close to the requirement age, or has teenagers learning how to drive, moving to college, you know that you have some opportunity to sell new contract, or adapt existing ones to the new needs
- In retail, you may want to look to all customers and what they have ordered, and based on this be able to recommend some products to a customer that "looks" the same.
- Another very common use case these days, you want to do some sentiment analysis of social networks to see how your brand is perceived by your community
Let's now talk about the "How"
Before that, you should continue to analyze the data:
- What is the structure of the data that I have to analyze?
- How big is my dataset?
- How much data I have to ingest on a period of time (minute, hour, day, ...)
- Belgium is what, 11+ millions of people
- If you store a 50kb object for each person this represent:
- Your full dataset will be 524Gb, yes not even a Terabyte!
Any database will do the job, starting with MongoDB. I think it is really interesting to start this project with a MongoDB cluster, not only because it will allow you to scale out as much as you need, but also because you will leverage the flexibility of the document model. This will allow you to store any type of data, and easily adapt the structure to the new data, or requirements.
Storing the data is only one part of the equation. The other part is how you achieve the data processing. Lately I am playing a lot with Apache Spark. Spark provides a very powerful engine for large scale data processing, and it is a lot simpler than Map Reduce jobs. In addition to this, you can run Spark without Hadoop! This means you can connect you Spark to your MongoDB, with the MongoDB Hadoop Connector and other data sources and directly execute job on your main database.