Data Mining (109 page)

Read Data Mining Online

Authors: Mehmed Kantardzic

The key ethical issues in mining personal data are that people are generally:

1.
not aware that their personal information is being gathered,

2.
do not know to what use the data will be made, and/or

3.
have not consented to such collection of data or data use.

In order to alleviate concerns about data privacy, a number of techniques have recently been proposed in order to perform the data-mining tasks in a privacy-preserving way. These techniques for performing privacy-preserving data mining are drawn from a wide array of related topics such as cryptography and information hiding. Most privacy-preserving data-mining methods apply a transformation that reduces the effectiveness of the underlying data when they are applied to data-mining methods or algorithms. In fact, there is a natural trade-off between privacy and accuracy although this trade-off is affected by the particular algorithm that is used for privacy preservation. The key directions in the field of privacy-preserving data mining include:

  • Privacy-Preserving Data Publishing:
    These techniques tend to study different transformation methods associated with privacy. They concentrate on how the perturbed data can be used in conjunction with classical data-mining methods.
  • Changing the Results of Data-Mining Applications to Preserve Privacy:
    These techniques are concentrated on the privacy of data-mining results where some results are modified in order to preserve the privacy. A classic example of such techniques are association-rule hiding methods, in which some of the association rules are suppressed in order to preserve privacy.
  • Cryptographic Methods for Distributed Privacy:
    If the data are distributed across multiple sites, a variety of cryptographic protocols may be used in order to communicate among the different sites so that secure function computation is possible without revealing sensitive information.

Recent research trends propose that issues of privacy protection, currently viewed in terms of data
access,
be reconceptualized in terms of data
use
. From a technology perspective, this requires supplementing legal and technical mechanisms for access control with new mechanisms for
transparency
and
accountability
of data used in a data-mining process. Current technical solutions of the impact of data mining on privacy have generally focused on limiting access to data at the point of collection or storage. Most effort has been put into the application of cryptographic and statistical techniques to construct finely tuned access-limiting mechanisms. Even if privacy-preserving data-mining techniques prove to be practical, they are unlikely to provide sufficient public assurance that data-mining inferences conform to legal restrictions. While privacy-preserving data-mining techniques are certainly necessary in some contexts, they are not a sufficient privacy protection without the transparency and accountability.

In the long run, access restriction alone is not enough to protect privacy or to ensure reliable conclusions, and the best example of these challenges is Web and Web-mining technology. As we leave the well-bounded world of enterprise databases and enter the open, unbounded world of the Web, data users need a new class of tools to verify that the results they see are based on data that are from trustworthy sources and are used according to agreed-upon institutional and legal requirements. The implications of data mining on digital social networks such as Facebook, Myspace, or Twitter may be enormous. Unless it is part of a public record designed for consumption by everyone or describes an activity observed by strangers, the stored information is rarely known outside our families, much less outside our social networks. An expectation that such information and potential derivatives will remain “private” on the Internet is not anymore a reasonable assumption from the social network perspective. One of the major contributors to these controversies is the absence of clear legal standards. Thirty years ago the lack of relevant law was understandable: The technologies were new; their capacity was largely unknown; and the types of legal issues they might raise were novel. Today, it is inexplicable and threatens to undermine both privacy and security. Hence, we must develop technical, legal, and policy foundations for transparency and accountability of large-scale mining across distributed heterogeneous data sources.
Policy awareness
is a property of the Semantic Web still in development that should provide users with accessible and understandable views of the policies associated with resources.

The following issues related to privacy concerns may assist in individual privacy protection during a data-mining process, and should be a part of the best data-mining practices:

  • Whether there is a clear description of a program’s collection of personal information, including how the collected information will serve the program’s purpose?
    In other words, be transparent early on about a data-mining project’s purpose. Clearly state up-front the business benefits that will be achieved by data mining. Provide notice of the combining of information from different sources. Companies like Walmart or Kroger store much of their business and customer data in large warehouses. Their customers are not told of the extent of the information that is accumulated on them, how long it will be kept, the uses to which the data will be put, or other users with which data will be shared.
  • Whether information collected for one purpose will then be used for additional, secondary purposes in the future?
    Ensure that any new purpose of a project is consistent with the project’s original purpose. Maintain oversight of a data-mining project and create audit requirements.
  • Whether privacy protections are built-in to systems in the early developmental stage?
    Build in privacy considerations up-front, and bring in all stakeholders at the beginning, including privacy advocates to get input from them. Ensure the accuracy of data entry.
  • What type of action will be taken on the basis of information discovered through a data-mining process?
    Where appropriate, anonymize personal information. Limit the actions that may be taken as a result of unverified findings from data mining.
  • Whether there is an adequate feedback system for individuals to review and correct their personal information that is collected and maintained in order to avoid “false positives” in a data-mining program?
    Determine whether an individual should have a choice in the collection of information. Provide notice to individuals about use of their personal information. Create a system where individuals can ensure that any incorrect personal information can be corrected.
  • Whether there are proper disposal procedures for collected and derived personal information that has been determined to be irrelevant?

Some observers suggest that the privacy issues presented by data mining will be resolved by technologies, not by law or policy. But even the best technological solutions will still require a legal framework in which to operate, and the absence of that framework may not only slow down their development and deployment, but make them entirely unworkable. Although there is no explicit right to privacy of personal data in the Constitution, legislation and court decisions on privacy are usually based on parts of the First, Fourth, Fifth, and Fourteenth Amendments. Except for health-care and financial organizations, and data collected from children, there is no law that governs the collection and use of personal data by commercial enterprises. Therefore, it is essentially up to each organization to decide how they will use the personal data they have accumulated on their customers. In early March 2005, hackers stole the personal information of 32,000 people from the databases of LexisNexis. The stolen data included Social Security numbers and financial information. Although the chief executive officer (CEO) of LexisNexis claimed that the information they collect is governed by the U.S. Fair Credit Reporting Act, members of Congress disagreed. As a result of this and other large-scale identity thefts in recent years, Congress is considering new laws explaining what personal data a company can collect and share. For example, Congress is considering a law to prohibit almost all sales of Social Security numbers.

At the same time, especially since 9/11, government agencies have been eager to experiment with the data-mining process as a way of nabbing criminals and terrorists. Although details of their operation often remain unknown, a number of such programs have come to light since 2001. The Department of Justice (DOJ), through the Federal Bureau of Investigation (FBI), has been collecting telephone logs, banking records, and other personal information regarding thousands of Americans not only in connection with counterterrorism efforts, but also in furtherance of ordinary law enforcement. A 2004 report by the Government Accountability Office (GAO) found 42 federal departments—including every cabinet-level agency that responded to the GAO’s survey—engaged in, or were planning to engage in, 122 data-mining efforts involving personal information (U.S. General Accounting Office,
Data Mining: Federal Efforts Cover a Wide Range of Uses
[GAO-04-548], May 2004, pp. 27–64). Recently, the U.S. Government recognized that sensible regulation of data mining depends on understanding its many variants and its potential harms, and many of these data-mining programs are reevaluated. In the United Kingdom, the problem is being addressed more comprehensively by the Foundation for Information Policy Research, an independent organization examining the interaction between information technology and society with goals to identify technical developments with significant social impact, commission research into public policy alternatives, and promote public understanding and dialog between technologists and policy makers in the United Kingdom and Europe. It combines information technology researchers with people interested in social impacts, and uses a strong media presence to disseminate its arguments and educate the public.

There is one additional legal challenge related specifically to data mining. Today’s privacy laws and guidelines, where they exist, protect data that are explicit, confidential, and exchanged between databases. However, there is no legal or normative protection for data that are implicit, nonconfidential, and not exchanged. Data mining can reveal sensitive information that is derived from nonsensitive data and meta-data through the inference process. Information gathered in data mining is usually implicit patterns, models, or outliers in the data, and questionable is the application of privacy regulations primarily written for traditional, explicit data.

In addition to data privacy issues, data mining raises other social concerns. For example, some researchers argue that data mining and the use of consumer profiles in some companies can actually exclude groups of customers from full participation in the marketplace and limit their access to information.

Good privacy protection not only can help build support for data mining and other tools to enhance security, it can also contribute to making those tools more effective. As technology designers, we should provide an information infrastructure that helps society to be more certain that data-mining power is used only in legally approved ways, and that the data that may give rise to consequences for individuals are based on inferences that are derived from accurate, approved, and legally available data. Future data-mining solutions reconciling any social issues must not only be applicable to the ever changing technological environment, but also flexible with regard to specific contexts and disputes.

12.7 REVIEW QUESTIONS AND PROBLEMS

1.
What are the benefits in modeling social networks with a graph structure? What kind of graphs would you use in this case?

2.
For the given undirected graph G:

(a)
compute the degree and variability parameters of the graph;

(b)
find adjacency matrix for the graph G;

(c)
determine binary
code(G)
for the graph;

(d)
find closeness parameter or each node of the graph; and

(e)
what is the betweeness measure for node number 2?

3.
For the graph given in Problem number 2, find partial
betweeness centrality
using modified graph starting with node number 5.

4.
Give real-world examples for traditional analyses of temporal data (i.e., trends, cycles, seasonal patterns, outliers).

5.
Given the temporal sequence S = {1 2 3 2 4 6 7 5 3 1 0 2}:

(a)
find PAA for four sections of the sequence;

(b)
determine SAX values for solution in (a) if (1) α = 3, (2) α = 4;

(c)
find PAA for three sections of the sequence; and

(d)
determine SAX values for solution in (c) if (1) α = 3, (2) α = 4.

6.
Given the sequence S = {A B C B A A B A B C B A B A B B C B A C C}:

(a)
Find the longest subsequence with frequency ≥ 3.

(b)
Construct finite-state automaton (FSA) for the subsequence found in (a).

7.
Find normalized contiguity matrix for the table of U.S. cities:

Minneapolis
Chicago
New York
Nashville
Louisville
Charlotte

Make assumption that only neighboring cities (vertical and horizontal) in the table are close.

8.
For the BN in Figure
12.38
determine:

(a)
P(C, R, W)

(b)
P(
C,
S, W)

Other books

An Uncommon Family by Christa Polkinhorn
The Shards of Serenity by Yusuf Blanton
The Supreme Gift by Paulo Coelho
Cut by Emily Duvall
Safe House by Dez Burke
When a Rake Falls by Sally Orr
Science...For Her! by Megan Amram
The Wolf Prince by Karen Whiddon
After the Fire by Belva Plain