Everything about Web and Network Monitoring

Predictive analytics for the small business – part 3

1.smallIn the last segment we looked at the first two stages of undertaking a predictive analytics project. We learned that finding a business question to address and identifying your data sources are critical first steps on the road to predictive success. Once you’ve figured out your data sources, determined that there’s enough information to support your business question, and have identified where the data resides, you’re ready for the data preparation process.


Preparing Your Data for Analysis


It’s easy to get the conception that once you get the data then it’s just ready to analyze. However, data comes in all shapes and sizes and formats. There are mistakes, duplicates, anomalies, and missing values that all have to be accounted for if the predictive modeling is going to make any sense. A common rule of thumb in the predictive analytics industry is that preparing and cleansing the data accounts for 80% of the analyst’s time. The data preparation is crucially important because the final results of the predictive analysis are only as good as the data quality.






Data sets may be incomplete, wrong, or inconsistent and therefore steps will need to be taken to ensure that the data is properly cleansed and validated. There are several key points you’ll need to address during the data preparation stage:


  • What specific steps will be used to fill in duplicate, wrong, inconsistent, or missing values?
  • Does the data contain format, keying, or reference errors that cause some values to be in error?
  • What will be your strategy for addressing outlier values (values outside the statistical norm that may invalidate the model accuracy)?






There are a number of data preparation and validation techniques that can be employed to ensure that your data sets are accurate. Here are some major ones to consider:


  • Use graphics and visualizations such as bar charts, pie charts or pareto charts to review the data distribution (i.e., irregularities, missing values).
  • Use descriptive statistics to check if data makes sense (check mean, median, mode). For example, scatterplots can be used to look for outliers and nonsense values.
  • Check for and delete duplicate data entries.
  • Make sure variables are of the appropriate type (numerical, string) and variable measures are the right level (binary, ordinal, nominal, numeric).
  • Ensure that variable and value labels and headings are correct and free of typos.
  • Review all data variables to ensure that values and attributes are consistent across multiple data sets and files.


Now the data preparation process can be very cumbersome and time-consuming for even the seasoned analyst. Fortunately, there are many tools on the market that help automate the data preparation and cleansing process using graphical ETL (Extract, transform, and load) capabilities.


For example, a resource like Datamartist, can provide a tremendous help in taking some of the headache out of the data cleansing process. The dashboard offers a variety of features that help you profile and visualize large sets of data from multiple sources. You can easily import tables or views, combine data sets, select data from multiple sources, use the calculation editor to transform the data, tie into local or remote databases, and more.






When you look at predictive analytics charts and visualizations about future outcomes, it’s easy to forget the considerable efforts that went into generating that work. Just remember, the value of those results are directly related to the quality of the data. The one takeaway we’ve seen is this: the data preparation phase is the most time consuming but is really central for ensuring the integrity of the analysis.






In the next part we’ll turn to the critical stage of predictive modeling, where the rigor and methodology of predictive analytics really come to focus on deriving the best outcomes for your business objective.


Post Tagged with

About Jeffrey Walker

Jeff is a business development consultant who specializes in helping businesses grow through technology innovations and solutions. He holds multiple master’s degrees from institutions such as Andrews University and Columbia University, and leverages this background towards empowering people in today’s digital world. He currently works as a research specialist for a Fortune 100 firm in Boston. When not writing on the latest technology trends, Jeff runs a robotics startup called virtupresence.com, along with oversight and leadership of startuplabs.co - an emerging market assistance company that helps businesses grow through innovation.