Data Science

In the age of information technology interpretation of data is becoming more important. Big companies like Google and Facebook use data mining techniques to extract useful information from huge amounts of data. At the Center of Expertise in Computer Vision we face similar challenges.

Computer vision is a field in which information is extracted from images. Usually the main goal is to automate visual inspections, for example: license plate recognition on cars. With the increasing resolution of camera’s and the increasing demand for information extraction and interpretation, the complexity also increases. Because of this complexity, advanced techniques like machine learning need to be utilized to analyze image data and to build models to interpret data. The increasing resolution, frame-rates, and types of cameras require solutions to manage, and build models from big amounts of (image) data or Big Data. Feedback and visualization is needed to communicate to high level systems or provide feedback for human operators. This triangle of data management, building models and feedback is central to the field of Data Science.

Data Science is identified by the Centre of Expertise in Computer Vision to be one of the important support fields of computer vision. We perform commercial and subsidized applied research projects on several areas of Data Science. Usually, but not necessarily using image data. Each semester we need for students willing to participate in these projects. For example [1]: Optimization of sorting machine parameters, automatic antibiotic susceptibility testing and comparison of inoculation methods. This article provides some further explanation and breakdown of Data Science using a simple example application.

Data Science is a generic term for extracting knowledge from data. It is usually of a multidisciplinary nature and encompasses integral solutions to knowledge extraction from data. Several parts can be distinguished:

Data acquisition and management: Sensor fusion, Feature Extraction, Big Data and Parallelization.

This first part of Data Science is closely related to Information Science. Data can come from a great variety of input sources. Typically information is measured by sensors like visual, environmental, and movement sensors. All sensors are different in the type of information they provide and the frequency at which they operate. This sensor information needs to be fused together into structures before anything useful can be done with it. The amount of raw data that is produced by cameras easily reached multiple gigabytes. To efficiently extract information from image data (feature extraction), parallelization of information extraction algorithms is necessary. Big Data is the term used to handle big amounts of data and to apply knowledge modeling to it.

Knowledge Modeling: Machine Learning and Pattern Recognition.

The second part of Data Science is closely related to Mathematics and Statistics. For intelligent systems to be able to identify or recognize objects, or to make predictions based on data, machine learning or pattern recognition is used. The goal is to find the model of a specific task which relates input data to output data. The key is to find this model by training from examples. Typically no explicit model is specified. The main idea it that knowledge about the problem is captured by a trained model. For example, in recognition of license plates: A model is trained to relate images of digits to known digits of license plates. After training, the goal is to read the digits of unknown license plates.

Feedback: Reporting and Visualization.

The third part of Data Science is related to multiple fields, including Multimedia, Computer Vision, Information Science, Mathematics and Statistics. Data needs to be displayed or reported in a convenient way to be able to manually interpret it. For example, a nice rectangle is drawn around a recognized license plate. But also the performance of the license plate recognition is measured using various performance metrics like execution speed, accuracy, etc. This data should maybe be presented continuously on a website or other type of reporting medium.

[1] Publications of the Center of Expertise Computer Vision.