AN OVERVIEW OF KNOWLEDGE DISCOVERY IN DATABASE (KDD) PROCESS TOWARDS DATA MINING

by admin on July 2, 2011

AN OVERVIEW OF KNOWLEDGE DISCOVERY IN DATABASE (KDD) PROCESS TOWARDS DATA MINING

1. INTRODUCTION

         Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing.

          The rapid emergence of electronic data management methods has lead some to call recent times as the “Information Age.” Powerful database systems for collecting and managing are in use in virtually all large and mid-range companies — there is hardly a transaction that does not generate a computer record somewhere. Each year more operations are being computerized, all accumulate data on operations, activities and performance. All these data hold valuable information, e.g., trends and patterns, which could be used to improve business decisions and optimize success.

          However, today’s databases contain so much data that it becomes almost impossible to manually analyze them for valuable decision-making information. In many cases, hundreds of independent attributes need to be simultaneously considered in order to accurately model system behavior. The term data mining has mostly been used by statisticians, data analysts, and the management information systems (MIS) communities. It has also gained popularity in the database field. The phrase knowledge discovery in databases was coined at the first KDD workshop in 1989 [1] (Piatetsky-Shapiro 1991)  to emphasize that knowledge is the end product of a data-driven discovery. It has been popularized in the AI and machine-learning fields. In our view, KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data. The distinction between the KDD process and the data-mining step (within the process) is a central point of this article. The additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. Blind application of data-mining methods (rightly criticized as data dredging in the statistical literature) can be a dangerous activity, easily leading to the discovery of meaningless and invalid patterns.

 2. THE INTERDISCIPLINARY NATURE    OF KDD

         KDD has evolved, and continues to evolve, from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems, data visualization, and high-performance computing. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets. The data-mining component of KDD currently relies heavily on known techniques from machine learning, pattern recognition, and statistics to find patterns from data in the data-mining step of the KDD process.

         A natural question is how is KDD different from pattern recognition or machine learning (and related fields)? The answer is that these fields provide some of the data-mining methods that are used in the data-mining step of the KDD process. KDD focuses on the overall process of knowledge discovery from data, including how the data are stored and accessed, how algorithms can be scaled to massive data sets still run efficiently, how results can be interpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported.

         The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other fields of AI (besides machine learning) to contribute to KDD. KDD places a special emphasis on finding understandable patterns that can be interpreted as useful or interesting knowledge.

         Thus, for example, neural networks, although a powerful modeling tool, are relatively difficult to understand compared to decision trees. KDD also emphasizes scaling and robustness properties of modeling algorithms for large noisy data sets. Related AI research fields include machine discovery, which targets the discovery of empirical laws from observation and experimentation [10] (Shrager and Langley 1990) and causal modeling for the inference of causal models from data [11] (Spirtes, Glymour, and Scheines 1993). Statistics in particular has much in common with KDD. Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quantifying the uncertainty that results when one tries to infer general patterns from a particular sample of an overall population. As mentioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were first introduced.

        The concern arose because if one searches long enough in any data set (even randomly generated data), one can find patterns that appear to be statistically significant but, in fact, are not. Clearly, this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct relevance to KDD. Thus, data mining is a legitimate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical aspects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree possible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.

         A driving force behind KDD is the database field (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot fit in the main memory is of fundamental importance to KDD. Database techniques for gaining efficient data access, grouping and ordering operations when accessing data, and optimizing queries constitute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memory and pay no attention to how the algorithm breaks down if only limited views of the data are possible. A related field evolving from databases is data warehousing, which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways:

(1) Data Cleaning

(2) Data Access.

 Data cleaning

         As organizations are forced to think about a unified logical view of the wide variety of data and databases they possess, they have to address the issues of mapping data to a single naming convention, uniformly representing and handling missing data, and handling noise and errors when possible.

 Data access

        Uniform and well-defined methods must be created for accessing the data and providing access paths to data that were historically difficult to get to (for example, stored offline). Once organizations and individuals have solved the problem of how to store and access their data, the natural next step is the question, what else do we do with all the data? This is where opportunities for KDD naturally arise.

        A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles proposed by [12] Codd (1993). OLAP tools focus on providing multidimensional data analysis, which is superior to SQL in computing summaries and breakdowns along many dimensions. OLAP tools are targeted toward simplifying and supporting interactive data analysis, but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.

 3. DATA MINING AND KNOWLEDGE DISCOVERY IN THE REAL WORLD

          A large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week, Newsweek, Byte, PC Week, and other large-circulation periodicals. Unfortunately, it is not always easy to separate fact from media hype. Nonetheless, several well documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.

          In science, one of the primary application areas is astronomy. Here, a notable success was achieved by SKICAT, a system used by astronomers to perform image analysis, classification, and cataloging of sky objects from sky-survey images [2] (Fayyad, Djorgovski, and Weir 1996). In its first application, the system was used to process the 3 terabytes (1012 bytes) of image data resulting from the Second Palomar Observatory Sky Survey, where it is estimated that on the order of 109 sky objects are detectable. SKICAT can outperform humans and traditional computational techniques in classifying faint sky objects. See [3] Fayyad, Haussler, and Stolorz (1996) for a survey of scientific applications.

          In business, main KDD application areas includes marketing, finance (especially investment), fraud detection, manufacturing, telecommunications, and Internet agents.

Marketing

          In marketing, the primary application is database marketing systems, which analyze customer databases to identify different customer groups and forecast their behavior. Business Week [4] (Berry 1994) estimated that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for example, American Express reports a 10- to 15- percent increase in credit-card use. Another notable marketing application is market-basket analysis [5] (Agrawal et al. 1996) systems, which find patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.

 Investment

          Numerous companies use data mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling 0 million; since its start in 1993, the system has outperformed the broad stock market [6] (Hall, Mani, and Barr 1996).

 Fraud detection

          HNC Falcon and Nestor PRISM systems are used for monitoring credit card fraud, watching over millions of accounts. The FAIS system [7] (Senator et al. 1995), from the U.S. Treasury Financial Crimes Enforcement Network, is used to identify financial transactions that might indicate money laundering activity.

 Manufacturing

           The ASSIOPEE troubleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major European airlines to diagnose and predict problems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European first prize for innovative applications.

 Telecommunications

          The telecommunications alarm-sequence analyzer (TASA) was built in cooperation with a manufacturer of telecommunications equipment and three telephone networks [8]        (Mannila, Toivonen, and Verkamo 1995). The system uses a novel framework for locating frequently occurring alarm episodes from the alarm stream and presenting them as rules. Large sets of discovered rules can be explored with flexible information-retrieval tools supporting interactivity and iteration. In this way, TASA offers pruning, grouping, and ordering tools to refine the results of a basic brute-force search for rules.

 Data cleaning

           The MERGE-PURGE system was applied to the identification of duplicate welfare claims [9] (Hernandez and Stolfo 1995). It was used successfully on data from the Welfare Department of the State of Washington. In other areas, a well-publicized system is IBM’s ADVANCED SCOUT, a specialized data-mining system that helps National Basketball Association (NBA) coaches organize and interpret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Supersonics, which reached the NBA finals. Finally, a novel and increasingly important type of discovery is one based on the use of intelligent agents to navigate through an information-rich environment. Although the idea of active triggers has long been analyzed in the database field, really successful applications of this idea appeared only with the advent of the Internet. These systems ask the user to specify a profile of interest and search for related information among a wide variety of public-domain and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like.

 4. KNOWLEDGE DISCOVERY AND DATA MINING

           This section provides an introduction into the area of knowledge discovery and data mining tasks.

 The Knowledge Discovery Process

           There is still some confusion about the terms Knowledge Discovery in Databases (KDD) and data mining. Often these two terms are used interchangeably. We use the term KDD to denote the overall process of turning low-level data into high-level knowledge. A simple definition of KDD is as follows: Knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. We also adopt the commonly used definition of data mining as the extraction of patterns or models from observed data. Although at the core of the knowledge discovery process, this step usually takes only a small part (estimated at 15% to 25 %) of the overall effort. Hence data mining is just one step in the overall KDD process.

            Other steps for example involve: Developing an understanding of the application domain and the goals of the data mining process Acquiring or selecting a target data set Integrating and checking the data set Data cleaning, preprocessing, and transformation Model development and hypothesis building Choosing suitable data mining algorithms Result interpretation and visualization Result testing and verification Using and maintaining the discovered knowledge.

 Data Mining Tasks

          At the core of the KDD process are the data mining methods for extracting patterns from data. These methods can have different goals, dependent on the intended outcome of the overall KDD process. It should also be noted that several methods with different goals may be applied successively to achieve a desired result. For example, to determine which customers are likely to buy a new product, a business analyst might need to first use clustering to segment the customer database, and then apply regression to predict buying behavior for each cluster. Most data mining goals fall under the following categories:

Data Processing

           Depending on the goals and requirements of the KDD process, analysts may select, filter, aggregate, sample, clean and/or transform data. Automating some of the most typical data processing tasks and integrating them seamlessly into the overall process may eliminate or at least greatly reduce the need for programming specialized routines and for data export/import, thus improving the analyst’s productivity.

 Prediction

          Given a data item and a predictive model, predict the value for a specific attribute of the data item. For example, given a predictive model of credit card transactions, predict the likelihood that a specific transaction is fraudulent.

 Regression

           Given a set of data items, regression is the analysis of the dependency of some attribute values upon the values of other attributes in the same item, and the automatic production of a model that can predict these attribute values for new records. For example, given a data set of credit card transactions, build a model that can predict the likelihood of fraudulence for new transactions.

 Classification

           Given a set of predefined categorical classes, determine to which of these classes a specific data item belongs. For example, given classes of patients that corresponds to medical treatment responses; identify the form of treatment to which a new patient is most likely to respond.

 Clustering

           Given a set of data items, partition this set into a set of classes such that items with similar characteristics are grouped together. Clustering is best used for finding groups of items that are similar. For example, given a data set of customers, identify subgroups of customers that have a similar buying behavior.

 Link Analysis (Associations)

           Given a set of data items, identify relationships between attributes and items such as the presence of one pattern implies the presence of another pattern. These relations may be associations between attributes within the same data item. The investigation of relationships between items over a period of time is also often referred to as ‘sequential pattern analysis’.

 Model Visualization

           Visualization plays an important role in making the discovered knowledge understandable and interpretable by humans. Besides, the human eye-brain system itself still remains the best pattern-recognition device known. Visualization techniques may range from simple scatter plots and histogram plots over parallel coordinates to 3D movies.

 5. THE DATA-MINING STEP OF THE KDD PROCESS

          The data-mining component of the KDD process often involves repeated iterative application of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algorithms that incorporate these methods. The knowledge discovery goals are defined by the intended use of the system.

We can distinguish two types of goals:

 (1) Verification

 (2) Discovery.

          With verification, the system is limited to verifying the user’s hypothesis. With discovery, the system autonomously finds new patterns. We further subdivide the discovery goal into prediction, where the system finds patterns for predicting the future behavior of some entities, and description, where the system finds patterns for presentation to a user in a human-understandable form.

           In this article, we are primarily concerned with discovery-oriented data mining. Data mining involves fitting models to, or determining patterns from, observed data. The fitted models play the role of inferred knowledge: Whether the models reflect useful or interesting knowledge is part of the over all, interactive KDD process where subjective human judgment is typically required.

Two primary mathematical formalisms are used in model fitting:

            (1)  Statistical

            (2) Logical.

          The statistical approach allows for nondeterministic effects in the model, whereas a logical model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applications given the typical presence of uncertainty in real-world data-generating processes.

           Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classification, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewildering to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fundamental techniques.

6. RESEARCH AND APPLICATION CHALLENGES

We outline some of the current primary research and application challenges for KDD.

          This list is by no means exhaustive and is intended to give the reader a feel for the types of problem that KDD practitioners wrestle with.

 Larger databases

          Databases with hundreds of fields and tables and millions of records and of a multi gigabyte size are commonplace, and terabyte (1012 bytes) databases are beginning to appear. Methods for dealing with large data volumes include more efficient algorithms sampling, approximation, and massively parallel processing.

 High dimensionality

          Not only is there often a large number of records in the database, but there can also be a large number of fields (attributes, variables); so, the dimensionality of the problem is high. A high-dimensional data set creates problems in terms of increasing the size of the search space for model induction in a combinatorial explosive manner. In addition, it increases the chances that a data-mining algorithm will find spurious patterns that are not valid in general. Approaches to this problem include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables.

 Over fitting

           When the algorithm searches for the best parameters for one particular model using a limited set of data, it can model not only the general patterns in the data but also any noise specific to the data set, resulting in poor performance of the model on test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies.

 Assessing of statistical significance

           A problem (related to over fitting) occurs when the system is searching over many possible models. For example, if a system tests models at the 0.001 significance level, then on average, with purely random data, N/1000 of these models will be accepted as significant edge is important in all the steps of the KDD process. Bayesian approaches [13] (for example, Cheeseman [1990]) use prior probabilities over data and distributions as one form of encoding prior knowledge. Others employ deductive database capabilities to discover knowledge that is then used to guide the data-mining search [14] (for example, Simoudis, Livezey, and Kerber [1995]).

 Integration with other systems

            A standalone discovery system might not be very useful. Typical integration issues include integration with a database management system (for example, through a query interface), integration with spreadsheets and visualization tools, and accommodating of real-time sensor readings. Examples of integrated KDD systems are described by [14] Simoudis, Livezey, and Kerber (1995).

 7. CONCLUSION

           This article represents a step toward a common framework that We hope will ultimately provide a unifying vision of the common overall goals and methods used in KDD. We hope this would eventually lead to a better understanding of the variety of approaches in this multidisciplinary field and how they fit together.

 9. REFERENCES

[1] Piatetsky – Shapiro, G. 1991. Knowledge Discovery in Real Databases: A Report on the IJCAI-89 Workshop. AI Magazine 11(5): 68–70.

 [2] Fayyad, U. M.; Djorgovski, S. G.; and Weir, N. 1996. From Digitized Images to On-Line Catalogs: Data Mining a Sky Survey. AI Magazine 17(2): 51–66.

 [3] Fayyad, U. M.; Haussler, D.; and Stolorz, Z. 1996. KDD for Science Data Analysis: Issues and Examples. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 50–56. Menlo Park, Calif.: American Association for Artificial Intelligence.

 [4] Berry, J. 1994. Database Marketing. Business Week, September 5, 56–62.

[5] Agrawal, R., and Psaila, G. 1995. Active Data Mining. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), 3–8. Menlo Park, Calif.: American Association for Artificial Intelligence

[6] Hall, J.; Mani, G.; and Barr, D. 1996. Applying Computational Intelligence to the Investment Process. In Proceedings of CIFER-96: Computational Intelligence in Financial Engineering. Washington, D.C.: IEEE Computer Society.

 [7] Senator, T.; Goldberg, H. G.; Wooton, J.; Cottini, M. A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.; Marrone, M. P.; and Wong, R. W. H. 1995. The Financial Crimes Enforcement Network AI System (FAIS): Identifying Potential Money Laundering from Reports of Large Cash Transactions. AI Magazine 16(4): 21–39.

 [8] Mannila, H.; Toivonen, H.; and Verkamo, A. I. 1995. Discovering Frequent Episodes in Sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), 210–215. Menlo Park, Calif.: American Association for Artificial Intelligence.

 

data visualization

Related Data Visualization Articles

Previous post:

Next post: