Encyclopedia > Data mining

  Article Content

Data mining

Data mining is the practice of searching large stores of data for patterns. Used in the technical context of data warehousing it is neutral. However, it also has a wider, more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist.

Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [1].

It is also known as knowledge-discovery in databases (KDD).

Used in this sense, "data mining" implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached by data mining are likely to be highly suspect. In spite of this some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data mining is less than clear.

Here is an example. The insurance industry has found that people with good credit records tend to be more likely to make car insurance claims, and have therefore modified their pricing. While this appears to be a legitimate finding, politicians in the United States have queried its legitimacy, on the 'common-sense' grounds that how a person handles their credit card doesn't affect how they handle a car. So a finding that is statistically legitimate might not hold up to public scrutiny.

A more significant danger is finding correlations that do not really exist. An example of this is found at the investment website The Motley Fool[?]. In the late 1990s the website had a suggested investment portfolio known as the Foolish Four, which was based on a data mining analysis of trends in the stock market. Further research in the early 2000s has highlighted that the correlations they found were an artifact of the particular data set they used, rather than reflecting reality. This experience is one of many similar false findings linked to the stock market.

There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they may screen out people with diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.

There are many legitimate uses of data mining. For example, a database of all prescription drugs taken by people can be used to find combinations of drugs with an adverse reaction. Since the combination may occur only in 100 people and the reaction in 10 of them, a single case may not raise a red flag. Such a database could find reactions and save lives. However, there is huge potential for abuse of such a database.

Basically, data mining gives information that wouldn't be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.

See Also

[1] W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine , Fall 1992, pgs 213-228.

Note: if you got here by looking for the rapper KDD, see KDD (rapper)[?].

All Wikipedia text is available under the terms of the GNU Free Documentation License

  Search Encyclopedia

Search over one million articles, find something about almost anything!
  Featured Article
Riemann-Roch theorem

... the study of Riemann surfaces, and of algebraic curve over general fields, the Riemann-Roch theorem is an important relation in the computation of the dimension of the ...