Collecting, matching data sets together and preparing data for analysis.
Before meaningful analysis can be applied to data, it must be collected and prepared. The source data is typically spread across multiple repositories in different business systems. So the first task is to connect to the different sources and pull data together in a standardised way. The tools used to perform this will depend on the where the data is available.
This is best achieved using ETL (Export Transform Load) tools like Talend, which provide connectors and can also be customised to connect to any data source. Where data is available in web pages, then a crawler, like Apache Nutch needs to be used to collect the data. If the data is in scanned documents, then an OCR (Optical Character Recognition) tool, like Tesseract is needed to convert the images into text. Once collected different data sources need to be matched together to unify them, using a tool like Exorbyte MatchMaker.
For “Big Data” where data processing needs to be distributed across a whole cluster of machines, tools like Apache Hadoop or Apache Storm, allow a scalable system to be setup to process huge data loads.
Some of the tools we support include:
- Crawler, Apache Nutch
- ETL, Talend
- OCR, Tesseract
- Matching, Exorbyte Matchmaker
- Big Data Batching, Apache Hadoop
- Big Data Streaming, Apache Storm
We can help you setup the tools and processes needed to collect data, ready for analysis that can bring new business insights.