In the last segment we looked at the first two stages of undertaking a predictive analytics project. We learned that finding a business question to address and identifying your data sources are critical first steps on the road to predictive success. Once you’ve figured out your data sources, determined that there’s enough information to support your business question, and have identified where the data resides, you’re ready for the data preparation process.
Preparing Your Data for Analysis
It’s easy to get the conception that once you get the data then it’s just ready to analyze. However, data comes in all shapes and sizes and formats. There are mistakes, duplicates, anomalies, and missing values that all have to be accounted for if the predictive modeling is going to make any sense. A common rule of thumb in the predictive analytics industry is that preparing and cleansing the data accounts for 80% of the analyst’s time. The data preparation is crucially important because the final results of the predictive analysis are only as good as the data quality.
Data sets may be incomplete, wrong, or inconsistent and therefore steps will need to be taken to ensure that the data is properly cleansed and validated. There are several key points you’ll need to address during the data preparation stage:
There are a number of data preparation and validation techniques that can be employed to ensure that your data sets are accurate. Here are some major ones to consider:
Now the data preparation process can be very cumbersome and time-consuming for even the seasoned analyst. Fortunately, there are many tools on the market that help automate the data preparation and cleansing process using graphical ETL (Extract, transform, and load) capabilities.
For example, a resource like Datamartist, can provide a tremendous help in taking some of the headache out of the data cleansing process. The dashboard offers a variety of features that help you profile and visualize large sets of data from multiple sources. You can easily import tables or views, combine data sets, select data from multiple sources, use the calculation editor to transform the data, tie into local or remote databases, and more.
When you look at predictive analytics charts and visualizations about future outcomes, it’s easy to forget the considerable efforts that went into generating that work. Just remember, the value of those results are directly related to the quality of the data. The one takeaway we’ve seen is this: the data preparation phase is the most time consuming but is really central for ensuring the integrity of the analysis.
In the next part we’ll turn to the critical stage of predictive modeling, where the rigor and methodology of predictive analytics really come to focus on deriving the best outcomes for your business objective.