Authors: Mehmed Kantardzic
Although it has proved invaluable to the company and their clients in its current incarnation, additional features are being planned and implemented to extend the LifeSeq functionality into research areas such as
Although the LifeSeq database is an invaluable research resource, queries to the database often produce very large data sets that are difficult to analyze in text format. For this reason, Incyte developed the LifeSeq 3-D application that provides visualization of data sets, and also allows users to cluster or classify and display information about genes. The 3-D version has been developed using the Silicon Graphics MineSet tool. This version has customized functions that let researchers explore data from LifeSeq and discover novel genes within the context of targeted protein functions and tissue types.
Maine Medical Center (USA)
Maine Medical Center—a teaching hospital and the major community hospital for the Portland, Maine, area—has been named in the U.S. News and World Report Best Hospitals list twice in orthopedics and heart care. In order to improve quality of patient care in measurable ways, Maine Medical Center has used scorecards as key performance indicators. Using SAS, the hospital creates balanced scorecards that measure everything from staff hand washing compliance to whether a congestive heart patient is actually offered a flu vaccination. One hundred percent of heart failure patients are getting quality care as benchmarked by national organizations, and a medication error reduction process has improved by 35%.
http://www.sas.com/success/mainemedicalcenter.html
In November 2009, the Central Maine Medical Group (CMMG) announced the launch of a prevention and screening campaign called “Saving Lives Through Evidence-Based Medicine.” The new initiative is employed to redesign the ways that it works as a team of providers to make certain that each of our patients undergoes the necessary screening tests identified by the current medical literature using data-mining techniques. In particular, data-mining process identifies someone at risk for an undetected health problem
http://www.cmmc.org/news.taf
].
B.5 DATA MINING IN SCIENCE AND ENGINEERING
Enormous amounts of data have been generated in science and engineering, for example, in cosmology, molecular biology, and chemical engineering. In cosmology, advanced computational tools are needed to help astronomers understand the origin of large-scale cosmological structures as well as the formation and evolution of their astrophysical components (galaxies, quasars, and clusters). Over 3 terabytes of image data have been collected by the Digital Palomar Observatory Sky Survey, which contain on the order of 2 billion sky objects. It has been a challenging task for astronomers to catalog the entire data set, that is, a record of the sky location of each object and its corresponding classification such as a star or a galaxy. The Sky Image Cataloguing and Analysis Tool (SKICAT) has been developed to automate this task. The SKICAT system integrates methods from machine learning, image processing, classification, and databases, and it is reported to be able to classify objects, replacing visual classification, with high accuracy.
In molecular biology, recent technological advances are applied in such areas as molecular genetics, protein sequencing, and macro-molecular structure determination as was mentioned earlier. Artificial neural networks and some advanced statistical methods have shown particular promise in these applications. In chemical engineering, advanced models have been used to describe the interaction among various chemical processes, and also new tools have been developed to obtain a visualization of these structures and processes. Let us have a brief look at a few important cases of data-mining applications in engineering problems. Pavilion Technologies’ Process Insights, an application-development tool that combines neural networks, fuzzy logic, and statistical methods has been successfully used by Eastman Kodak and other companies to develop chemical manufacturing and control applications to reduce waste, improve product quality, and increase plant throughput. Historical process data is used to build a predictive model of plant behavior and this model is then used to change the control set points in the plant for optimization.
DataEnginee is another data-mining tool that has been used in a wide range of engineering applications, especially in the process industry. The basic components of the tool are neural networks, fuzzy logic, and advanced graphical user interfaces. The tool has been applied to process analysis in the chemical, steel, and rubber industries, resulting in a saving in input materials and improvements in quality and productivity. Successful data-mining applications in some industrial complexes and engineering environments follow.
Boeing
To improve its manufacturing process, Boeing has successfully applied machine-learning algorithms to the discovery of informative and useful rules from its plant data. In particular, it has been found that it is more beneficial to seek concise predictive rules that cover small subsets of the data, rather than generate general decision trees. A variety of rules were extracted to predict such events as when a manufactured part is likely to fail inspection or when a delay will occur at a particular machine. These rules have been found to facilitate the identification of relatively rare but potentially important anomalies.
R.R. Donnelly
This is an interesting application of data-mining technology in printing press control. During rotogravure printing, grooves sometimes develop on the printing cylinder, ruining the final product. This phenomenon is known as banding. The printing company R.R. Donnelly hired a consultant for advice on how to reduce its banding problems, and at the same time used machine learning to create rules for determining the process parameters (e.g., the viscosity of the ink) to reduce banding. The learned rules were superior to the consultant’s advice in that they were more specific to the plant where the training data was collected and they filled gaps in the consultant’s advice and thus were more complete. In fact, one learned rule contradicted the consultant’s advice and proved to be correct. The learned rules have been in everyday use in the Donnelly plant in Gallatin, Tennessee, for over a decade and have reduced the number of banding occurrences from 538 to 26.
Southern California Gas Company
The Southern California Gas Company is using SAS software as a strategic marketing tool. The company maintains a data mart called the Customer Marketing Information Database that contains internal billing and order data along with external demographic data. According to the company, it has saved hundreds of thousands of dollars by identifying and discarding ineffective marketing practices.
WebWatcher
Despite the best effort of Web designers, we all have had the experience of not being able to find a certain Web page we want. A bad design for a commercial Web site obviously means the loss of customers. One challenge for the data-mining community has been the creation of “adaptive Web sites”; Web sites that automatically improve their organization and presentation by learning from user-access patterns. One early attempt is WebWatcher, an operational tour guide for the WWW. It learns to predict what links users will follow on a particular page, highlight the links along the way, and learn from experience to improve its advice-giving skills. The prediction is based on many previous access patterns and the current user’s stated interests. It has also been reported that Microsoft is to include in its electronic-commerce system a feature called Intelligent Cross Sell that can be used to analyze the activity of shoppers on a Web site and automatically adapt the site to that user’s preferences.
AbitibiBowater Inc. (Canada)
AbitibiBowater Inc. is a pulp and paper manufacturer headquartered in Montreal, Quebec, Canada. The pulp and paper, a key component of the forest products industry, is a major contributor to Canada’s economy. In addition to market pulp, the sector produces newsprint, specialty papers, paperboard, building board and other paper products. It is the largest industrial energy consumer, representing 23% of industrial energy consumption in Canada. AbitibiBowater Inc. used data-mining techniques to detect a period of high performance and reduce energy consumption in the paper making process, so that they recognized that lower temporary consumption is caused by the reduced set point for chip preheating and cleaning of the heating tower on the reject refiners. AbitibiBowater Inc. was able to reproduce the process conditions required to maintain steam recovery. This has saved AbitibiBowater 200 gigajoules
1
daily—the equivalent of $600,000 a year. [Head Up CIPEC (Canadian Industry Program for Energy Conservation) new letter: Aug. 15, 2009 Vol. XIII, No.15]
eHarmony
The eHarmony dating service, which rather than matching prospective partners on the basis of their stated preferences, uses statistical analysis to match prospective partners, based on a 29-parameter model derived from 5000 successful marriages. Its competitors such as Perfectmatch use different models, such as the Jungian Meyers-Briggs personality typing technique to parameterize individuals entered into their database. It is worth observing that while the process of matching partners may amount to little more than data retrieval using some complex set of rules, the process of determining what these rules need to be involves often complex knowledge discovery and mining techniques.
The maintenance of military platforms
Another area where data-mining techniques offer promising gains in efficiency is in the maintenance of military platforms. Good and analytically based maintenance programs, with the Amberley Ageing Aircraft Program for the F-111 a good example, systematically analyze component failure statistics to identify components with wear out or other failure rate problems. They can then be removed from the fleet by replacement with new or reengineered and thus more reliable components. This type of analysis is a simple rule-based approach, where the rule is simply the frequency of faults in specific components.
B.6 PITFALLS OF DATA MINING
Despite the above and many other success stories often presented by vendors and consultants to show the benefits that data mining provides, this technology has several pitfalls. When used improperly, data mining can generate lots of “garbage.” As one professor from MIT pointed out: “Given enough time, enough attempts, and enough imagination, almost any set of data can be teased out of any conclusion.” David J. Lainweber, managing director of First Quadrant Corp. in Pasadena, California, gives an example of the pitfalls of data mining. Working with a United Nations data set, he found that historically, butter production in Bangladesh is the single best predictor of the Standard & Poor’s 500-stock index. This example is similar to another absurd correlation that is heard yearly around Super Bowl time—a win by the NFC team implies a rise in stock prices. Peter Coy, Business Week’s associate economics editor, warns of four pitfalls in data mining:
1.
It is tempting to develop a theory to fit an oddity found in the data.
2.
One can find evidence to support any preconception if you let the computer churn long enough.
3.
A finding makes more sense if there is a plausible theory for it. But a beguiling story can disguise weaknesses in the data.
4.
The more factors or features in a data set the computer considers, the more likely the program will find a relationship, valid or not.
It is crucial to realize that data mining can involve a great deal of planning and preparation. Just having a large amount of data alone is no guarantee of the success of a data-mining project. In the words of one senior product manager from Oracle: “Be prepared to generate a lot of garbage until you hit something that is actionable and meaningful for your business.”
This appendix is certainly not an inclusive list of all data-mining activities, but it does provide examples of how data-mining technology is employed today. We expect that new generations of data-mining tools and methodologies will increase and extend the spectrum of application domains.
Note
1
A gigajoule (GJ) is a metric term used for measuring energy use. For example, 1 GJ is equivalent to the amount of energy available from either: 277.8 kWh of electricity, or 26.1 m
3
of natural gas, or 25.8 L of heating oil.
BIBLIOGRAPHY
CHAPTER 1
Adriaans, P., D. Zantinge,
Data Mining
, Addison-Wesley Publ. Co., New York, 1996.
Agosta, L.,
The Essential Guide to Data Warehousing
, Prentice Hall, Inc., Upper Saddle River, NJ, 2000.
An, A., C. Chun, N. Shan, N. Cercone, W. Ziarko, Applying Knowledge Discovery to Predict Watter-Supply Consumption,
IEEE Expert
, July/August 1997, pp. 72–78.
Barquin, R., H. Edelstein,
Building, Using, and Managing the Data Warehouse
, Prentice Hall, Inc., Upper Saddle River, NJ, 1997.
Ben, H., E. King, How to Prepare for Data Mining,
http://www.b-eye-network.com/channels/1415/view/10880
, July 2009.
Berson, A., S. Smith, K. Thearling,
Building Data Mining Applications for CRM
, McGraw-Hill, New York, 2000.
Bischoff, J., T. Alexander,
Data Warehouse: Practical Advice from the Experts
, Prentice Hall, Inc., Upper Saddle River, NJ, 1997.
Brachman, R. J., T. Khabaza, W. Kloesgen, G. S. Shapiro, E. Simoudis, Mining Business Databases,
CACM
, Vol. 39, No. 11, 1996, pp. 42–48.
De Ville, B., Managing the Data Mining Project,
Microsoft Data Mining
, 2001, pp. 93–116.
Djoko, S., D. J. Cook, L. B. Holder, An Empirical Study of Domain Knowledge and Its Benefits to Substructure Discovery,
IEEE Transactions on Knowledge and Data Engineering
, Vol. 9, No. 4, 1997, pp. 575–585.
Fayyad, U., G. P. Shapiro, P. Smyth, The KDD Process for Extracting Useful Knowledge from Volumes of Data,
CACM
, Vol. 39, No. 11, 1966, pp. 27–34.
Fayyad, U. M., G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy, eds.,
Advances in Knowledge Discovery and Data Mining
, AAAI Press/MIT Press, Cambridge, 1996a.
Fayyad, U., G. P. Shapiro, P. Smyth, From Data Mining to Knowledge Discovery in Databases,
AI Magazine
, Fall 1996b, pp. 37–53.
Friedland, L., Accessing the Data Warehouse: Designing Tools to Facilitate Business Understanding,
Interactions
, January–February 1998, pp. 25–36.
Ganti, V., J. Gehrke, R. Ramakrishnan, Mining Very Large Databases,
Computer
, Vol. 32, No. 8, 1999, pp. 38–45.
Groth, R.,
Data Mining: A Hands-On Approach for Business Professionals
, Prentice Hall, Inc., Upper Saddle River, NJ, 1998.
Han, J., M. Kamber,
Data Mining: Concepts and Techniques
, 2nd edition, Morgan Kaufmann, San Francisco, CA, 2006.
Kaudel, A., M. Last, H. Bunke, eds.,
Data Mining and Computational Intelligence
, Physica-Verlag, Heidelberg, Germany, 2001.
Kriegel, H. P., et al., Future Trends in Data Mining,
Data Mining and Knowledge Discovery
, Vol. 15, 2007, pp. 87–97.
Lavrac, N., et al., Introduction: Lessons Learned from Data Mining Applications and Collaborative Problem Solving,
Machine Learning
, Vol. 57, 2004, pp. 13–34.
Maxus Systems International, What Is Data Mining,
Internal Documentation,
http://www.maxussystems.com/datamining.html
.
Olson, D. L., Data mining in business services,
Service Business
, Springer Berlin/Heidelberg, Vol. 1, No. 3, 2007, pp. 181–193.
Pyle, D., Getting the Initial Model: Basic Practices of Data Mining,
Business Modeling and Data Mining
, 2003, pp. 361–425.
Ramakrishnan, N., A. Y. Grama, Data Mining: From Serendipity to Science,
Computer
, Vol. 32, No. 8, 1999, pp. 34–37.
Shapiro, G. P., The Data-Mining Industry Coming of Age,
IEEE Intelligent Systems
, November/December 1999, pp. 32–33.
Thomsen, E.,
OLAP Solution: Building Multidimensional Information System
, John Wiley, New York, 1997.
Thuraisingham, B.,
Data Mining: Technologies, Techniques, Tools, and Trends
, CRC Press LLC, Boca Raton, FL, 1999.
Tsur, S., Data Mining in the Bioinformatics Domain, Proceedings of the 26th YLDB Conference, Cairo, Egypt, 2000, pp. 711–714.
Two Crows Corp.,
Introduction to Data Mining and Knowledge Discovery
, Two Crows Corporation, Maryland, 2005.
Waltz, D., S. J. Hong, Data Mining: A Long Term Dream,
IEEE Intelligent Systems
, November/December 1999, pp. 30–34.
CHAPTER 2
Adriaans, P., D. Zantinge,
Data Mining
, Addison-Wesley Publ. Co., New York, 1996.
Anand, S. S., D. A. Bell, J. G. Hughes, The Role of Domain Knowledge in Data Mining, Proceedings of the CIKM’95 Conference, Baltimore, 1995, pp. 37–43.
Barquin, R., H. Edelstein,
Building, Using, and Managing the Data Warehouse
, Prentice Hall, Inc., Upper Saddle River, NJ, 1997.
Ben, H., E. King, How to Prepare for Data Mining,
http://www.b-eye-network.com/channels/1415/view/10880
, July 2009.
Berson, A., S. Smith, K. Thearling,
Building Data Mining Applications for CRM
, McGraw-Hill, New York, 2000.
Bischoff, J., T. Alexander,
Data Warehouse: Practical Advice from the Experts
, Prentice Hall, Inc., Upper Saddle River, NJ, 1997.
Boriah, S., V. Chandola, V. Kumar, Similarity Measures for Categorical Data: A Comparative Evaluation, SIAM Conference, 2008, pp. 243–254.
Brachman, R. J., T. Khabaza, W. Kloesgen, G. S. Shapiro, E. Simoudis, Mining Business Databases,
CACM
, Vol. 39, No. 11, 1996, pp. 42–48.
Chen, C. H., L. F. Pau, P. S. P. Wang,
Handbook of Pattern Recognition & Computer Vision
, World Scientific Publ. Co., Singapore, 1993.
Clark, W. A. V., M. C. Deurloo, Categorical Modeling/Automatic Interaction Detection,
Encyclopedia of Social Measurement
, 2005, pp. 251–258.
Dwinnell, W., Data Cleansing: An Automated Approach,
PC AI
, March/April 2001, pp 21–23.
Fayyad, U. M., G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy, eds.,
Advances in Knowledge Discovery and Data Mining
, AAAI Press/MIT Press, Cambridge, 1996a.
Fayyad, U., D. Haussier, P. Stolorz, Mining Scientific Data,
CACM
, Vol. 39, No. 11, 1966b, pp. 51–57.
Ganti, V., J. Gehrke, R. Ramakrishnan, Mining Very Large Databases,
Computer
, Vol. 32, No. 8, 1999, pp. 38–45.
Groth, R.,
Data Mining: A Hands-On Approach for Business Professionals
, Prentice hall, Inc., Upper Saddle River, NJ, 1998.
Han, J., M. Kamber,
Data Mining: Concepts and Techniques
, 2nd edition, Morgan Kaufmann, San Francisco, CA, 2006.
Liu, H., H. Motoda, eds.,
Feature Extraction, Construction and Selection: A Data Mining Perspective
, Kluwer Academic Publishers, Boston, MA, 1998.
Liu, H., H. Motoda,
Feature Selection for Knowledge Discovery and Data Mining
, Second Printing, Kluwer Academic Publishers, Boston, 2000.
Pass, S., Discovering Value in a Mountain of Data,
OR/MS Today
, October 1997, 24–28.
Pyle, D.,
Data Preparation for Data Mining
, Morgan Kaufmann Publ. Inc., New York, 1999.
Refaat, M., Treatment of Missing Values,
Data Preparation for Data Mining Using SAS
, 2007, pp. 171–206.
Tan, P.-N., M. Steinbach, V. Kumar,
Introduction to Data Mining
, Pearson Addison-Wesley, Boston, 2006.
Weiss, S. M., N. Indurkhya,
Predictive Data Mining: A Practical Guide
, Morgan Kaufman Publishers, Inc., San Francisco, 1998.
Westphal, C., T. Blaxton,
Data Mining Solutions: Methods and Tools for Solving Real-World Problems
, John Wiley & Sons, Inc., New York, 1998.
Witten, I. H., E. Frank,
Data Mining: Practical Machine Learning Tools and Techniques
, 2nd edition, Elsevier Inc., St. Louis, MO, 2005.
CHAPTER 3
Adriaans, P., D. Zantinge,
Data Mining
, Addison-Wesley Publ. Co., New York, 1996.
Berson, A., S. Smith, K. Thearling,
Building Data Mining Applications for CRM
, McGraw-Hill, New York, 2000.
Brachman, R. J., T. Khabaza, W. Kloesgen, G. S. Shapiro, E. Simoudis, Mining Business Databases,
CACM
, Vol. 39, No. 11, 1996, pp. 42–48.
Chen, C. H., L. F. Pau, P. S. P. Wang,
Handbook of Pattern Recognition and Computer Vision
, World Scientific Publ. Co., Singapore, 1993.
Clark, W. A. V., M. C. Deurloo, Categorical Modeling/Automatic Interaction Detection,
Encyclopedia of Social Measurement
, 2005, pp. 251–258.
Dwinnell, W., Data Cleansing: An Automated Approach,
PC AI
, March/April 2001, pp. 21–23.
Eddy, W. F., Large Data Sets in Statistical Computing, in
International Encyclopedia of the Social & Behavioral Sciences
, N. J. Smelser, P. B. Battes, ed., Pergamon, Oxford, 2004, pp. 8382–8386.
Fayyad, U. M., G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy, eds.,
Advances in Knowledge Discovery and Data Mining
, AAAI Press/MIT Press, Cambridge, 1996.
Groth, R.,
Data Mining: A Hands-On Approach for Business Professionals
, Prentice Hall, Inc., Upper Saddle River, NJ, 1998.
Han, J., M. Kamber,
Data Mining: Concepts and Techniques
, 2nd edition, Morgan Kaufmann, San Francisco, CA, 2006.
Jain, A., R. P. W. Duin, J. Mao, Statistical Pattern Recognition,
IEEE Transactions on Pattern Analysis and Machine Intelligence
, Vol. 22, No. 1, 2000, pp. 4–37.
Kennedy, R. L., et al.
Solving Data Mining Problems through Pattern Recognition
, Prentice Hall, Upper Saddle River, NJ, 1998.
Kil, D. H., F. B. Shin,
Pattern Recognition and Prediction with Applications to Signal Characterization
, AIP Press, Woodburg, NY, 1996.
Liu, H., H. Motoda, eds.,
Feature Extraction, Construction and Selection: A Data Mining Perspective
, Kluwer Academic Publishers, Boston, MA, 1998.
Liu, H., H. Motoda,
Feature Selection for Knowledge Discovery and Data Mining
, Second Printing, Kluwer Academic Publishers, Boston, 2000.
Liu, H., H. Motoda, eds.,
Instance Selection and Construction for Data Mining
, Kluwer Academic Publishers, Boston, MA, 2001.
Maimon, O., M. Last,
Knowledge Discovery and Data Mining: The Info-Fuzzy Network (IFN) Methodology
, Kluwer Academic Publishers, Boston, MA, 2001.
Pyle, D.,
Data Preparation for Data Mining
, Morgan Kaufmann Publ. Inc., New York, 1999.
Sun, Y., Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications,
IEEE Transactions on Pattern Analysis and Machine Intelligence
, Vol. 29, No. 6, 2007, pp. 1035–1051.
Sun, Y., D. Wu, Feature Extraction through Local Learning, Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.
Sun, Y., D. Wu, A RELIEF Based Feature Extraction Algorithm, Proc. of the 8th SIAM Intl. Conf. Data Mining, 2008.
Tan, P.-N., M. Steinbach, V. Kumar,
Introduction to Data Mining
, Pearson Addison-Wesley, Boston, 2006.
Wang, Y., F. Makedon, Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification Using Microarray Data, 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04), Stanford, CA, August 2004.
Weiss, S. M., N. Indurkhya,
Predictive Data Mining: A Practical Guide
, Morgan Kaufman Publishers, Inc., San Francisco, CA, 1998.
Westphal, C., T. Blaxton,
Data Mining Solutions: Methods and Tools for Solving Real-World Problems
, John Wiley & Sons, Inc., New York, 1998.
Witten, I. H., E. Frank,
Data Mining: Practical Machine Learning Tools and Techniques
, 2nd edition, Elsevier Inc., St. Louis, MO, 2005.
Yang, Q., X. Wu, 10 Challenging Problems in Data Mining Research,
International Journal of Information Technology and Decision Making
, Vol. 5, No. 4, 2006, pp. 597–604.
CHAPTER 4
Alpaydin, E.,
Introduction to Machine Learning
, 2nd edition, The MIT Press, Cambridge, 2010.
Berbaum, K. S., D. D. Dorfman, E. A. Franken Jr., Measuring Observer Performance by ROC Analysis: Indications and Complications,
Investigative Radiology
, Vol. 2A, 1989, pp. 228–233.
Berthold, M., D. J. Hand, eds.,
Intelligent Data Analysis—An Introduction
, Springer, Berlin, 1999.
Bow, S.,
Pattern Recognition and Image Preprocessing
, Marcel Dekker, New York, 1992.
Cherkassky, V., F. Mulier,
Learning from Data: Concepts, Theory and Methods
, John Wiley & Sons, Inc., New York, 1998.
Diettrich, T. G., Machine-Learning Research: Four Current Directions,
AI Magazine
, Winter 1997, pp. 97–136.
Engel, A., C. Van den Broeck,
Statistical Mechanics of Learning
, Cambridge University Press, Cambridge, UK, 2001.
Gunopulos, D., R. Khardon, H. Mannila, H. Toivonen, Data Mining, Hypergraph Traversals, and Machine Learning, Proceedings of PODS’97 Conference, Tucson, 1997, pp. 209–216.
Hand, D., H. Mannila, P. Smyth,
Principles of Data Mining
, The MIT Press, Cambridge, 2001.
Hearst, M., Support Vector Machines,
IEEE Intelligent Systems
, July/August 1998, pp. 18–28.
Hilderman, R. J., H. J. Hamilton,
Knowledge Discovery and Measures of Interest
, Kluwer Academic Publishers, Boston, MA, 2001.
Hirji, K. K., Exploring Data Mining Implementation,
CACM
, Vol. 44, No. 7, 2001, pp. 87–93.
Hsu, C., C. Chang, C. Lin, A Practical Guide to Support Vector Classification,
http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/guide.pdf
, 2009.