AI Seminar ------------------------------- Tuesday, April 6th, 2004 4:00 pm - 5:30 pm 1504 G.G. Brown (Lee Iacocca Room) **NOTE THE DIFFERENT LOCATION** "Data Mining Large Databases: Scalability Considerations" Dr. Usama Fayyad President, DMX Group ---------------------------------- Data mining methods have grown in importance as data sets grew larger and more numerous. Many of the fundamental problems in performing data mining tasks rely on statistical estimation and modeling. However, many of the computational advances in statistical analysis methods paid little attention to the problems of scaling to massive data sets (in contrast to much of the work on the database systems side where scalability is a central theme). In this talk, we present several algorithms and considerations in the area of scaling data mining algorithms to large databases. The approaches fundamentally rely on the notion of decomposing algorithms into basic components that more easily lend themselves to scaling to large data. It turns out that most popular algorithms can be decomposed into components that need to be close to the data, and others than can operate over reduced forms or sufficient statistics of the data. The key to a good decomposition is to keep the components that need to "touch the data" simple and fast. In addition, it is important to consider the number of times an algorithm requires a scan of the data. After covering a couple of illustrative examples of scaling algorithms to large databases, we consider the converse approach: can we utilize fundamental notions in data mining to help solve classical database problems such as indexing high-dimensional data and estimating query selectivity etc. The theme here is that database considerations are important in data mining while statistical and data mining considerations play an important role in database systems considerations. We wrap-up the discussion of databases with a brief coverage of some work on integrating data mining in a major commercial database system (Microsoft SQL Server). We conclude the talk with a summary of the numerous remaining technical challenges facing the field of data mining.