## Data Mining

Data mining is the process of searching through data sets (typically large) and extracting smaller data sets that often show some form of pattern or relevance to the data that was not obvious before the extraction. As data sets grow larger, the need for automated tools also grows. Usually, the purpose of data mining is some form of knowledge discovery or at least confirmation of suspected knowledge.

Mining data patterns and finding relevance to the patterns usually begins with smaller data sets and is done manually. As the size of the data set grows, it becomes necessary to use automated tools. Often, a database is used to collect and organize some basic attributes of the data. Those attributes that are obvious are abstracted and used to create “fields” (columns) in the database that can be used to sort and query the data.

Simple manual inspection and creating some basic statistics on the data can reveal some patterns. Using visual representations can make this easier and faster. The process continues by changing the point of view:

- Filtering to smaller data sets – this can make patterns that were “lost” in the larger data group become easier to see. It also leads to insights into further actions.
- Changing sort criteria – changing sort order and sorting filters will change the way the data looks.
- Changing the statistics that are being used.
- Changing the visual presentation approach being used

Hybrid combinations of all of these techniques to change perspective can reveal new approaches to find patterns and make correlations. But at some point, these manual and visual techniques offer diminishing returns and a different level of automation is required.

Mathematical and logical techniques being developed in the field of artificial intelligence offer a more efficient and more accurate means to handle both larger data sets and to find more elusive data patterns. These techniques include:

- Supervised Learning – using labeled data for training
- Classification
- Regression
- Unsupervised Learning – using unlabeled data
- Clustering
- Blind signal separation
- Reinforcement Learning – techniques that try to find a “policy” that will maximize some output value
- Game theory
- Markov Decision Processes – use discrete time steps to describe transitions from one state to another according to probability weighting
- Monte Carlo Methods – tend to rely on random sampling to create models with significant uncertainty in the input data

SEE ALSO:

**Statistical Data Mining Tutorials** – [autonlab.org]

**Free Data Mining Tutorial Booklet** – [twocrows.com]

**Mining of Massive Datasets** – [stanford.edu]