Data Mining of Personal Information

Data mining can be used to create data profiles of individuals from the anonymous data that is available on the Internet. While it is not perfect in developing a "factual and solid" representation of an individual, the vast amount of information available can provide a reasonably reliable approximation.


Before the advent of the Internet, the confidentiality of data sorted in information of various organizations (and individuals) was largely protected by inaccessibility, but networked computing has made the Web into the equivalent of a single data repository with a myriad of vulnerabilities.

Within IT, it is a "common secret" that personal data that should be protected and kept confidential is constantly leaking - whether through intentional misuse or carelessness and neglect - and once data has leaked, its secrecy cannot be recovered.

With the advent of Web 2.0, individuals themselves are contributing vast amounts of personal data to this repository: personal profiles on social networking systems, backing up confidential files to Web servers, etc. There is also the wealth of information an individual provides incrementally, without considering the aggregation of that data: purchase, web sites visited, etc.

The last hurdle to assembling a dossier of information about an individual is the task of aggregating data from various sources about a single individual and assembling it in a meaningful way. Data Mining provides a partial solution for doing exactly this, and as it continues to progress, it will become an increasing threat to privacy and confidentiality of personal information.


In a general sense, data mining is the practice of aggregating large amounts of data and searching it for patterns and trends that would not be evident by the consideration of a single record or datum. For that reason, the information produced is considered "new" knowledge.

In its simplest sense, it is a form of cross referencing: all the individuals who live in a certain state, have a certain income level, drive a certain type of car, practice a certain religion, etc. This produces a profile that can be used as a behavioral model (they can be predicted to buy product X, or vote for candidate Y). Naturally, this is based on inferential statistics, which is not perfectly accurate, but is often generally accurate to a certain "level of confidence."

Retrieving data from "free text" is far more difficult than retrieving it front structured data (databases with specific fields). The author devotes quite a bit of detail to demonstrating that it's not as difficult anymore: that documents of a specific kind have similar structures, which enables a model to be developed to extract information in a meaningful way.

Another gob of detail to illustrate a simple concept: while data on the internet appears as free text, it's also highly structured: Web pages, e-mail messages, conversation threads, all have very specific structural elements, and as users turn to Web publishing, the user-friendly tools are often database driven, so users provide data in ways that make it easy to parse and associate with other data.

There have also been significant advancement in the ability to retrieve data from audio, video, and images, and the growing popularity of "sharing" services (YouTube, Flickr) that enable individuals to publish this information.

The net result is that "information" is being digested into data, which can easily be processed.


The "Semantic Web" refers to the use of metadata that add layers of meaning to Web content, including

The net result of this is to show the associations between data that seem to have no inherent connection. Some of this information is visible to person (a person's mySpace page links to an image on Flickr and a video on YouTube, and the implication is that they all relate to or were originated by the same individual), and it is increasingly possible for a software "intelligent agent" to make those same connections.

The emergence of XML and standards also facilitates association: a piece of information (article, comment, chat line) is associated the name and e-mail address of its author, chances are that every site that does this creates an object called "author" with properties of "name" and "e-mail", facilitating coordination by a third party. Even if the ontology is not perfectly the same, there is sufficient similarity that they can be coordinated by an aggregator.


Aside of identifiable information that has been consciously and purposefully posted, there is a create quantity of information that is created and collected by the actions of an individual - in the log files that monitor system usage. In layman's terms a record is made of everything a person has done online - every site they visit, every e-mail they send or receive.

These actions themselves can provide a profile that can indicate, with some confidence, a persons' occupation, religion, gender, political beliefs, economic class, social behaviors, hobbies and interests, and other facts they may not knowingly wish to provide.

This profile can later be associated to a specific person, either through the login ID associated with the IP address that appears in the logs, or by catching information entered into a Web form that belies their identity.

It is conceded that this is a "quite fuzzy procedure", but given a large amount of data, it can develop a detailed and explicit profile that has a high level of confidence.


Various attempts are being made by legislative bodies to regulate the storage and process of "Information relating to natural persons" in the name of privacy rights; however, it is often based on specific and singular circumstances. Regulators attempt to govern the nature of data a site operator can collect from a visitor, restrict the operator's use of the data, and require the operator to keep the data confidential and protect it against breach. It also deals with information that, in itself, can be considered sensitive in that it has an explicit relation to a person. Current legislation does not apply to the processing of anonymous data, or data that is provided by the user with no explicit expectation of privacy.

For example, a Web site that explicitly requests a person to enter their race is required to handle that piece of information with strictest confidence. A data miner who examines the sites a person has visited, the comments they have made in public forums, and the personal profile they have posted on a social networking sites could easily come to possess the exact same datum (race) and would not be bound by the same legal covenants as to its use or safekeeping.

Neither does current legislation apply to actions taken based on anonymous profiles, such that it is possible for a bank (for example) to purchase information fro a data miner, and based on the IP address of a site visitor, "flag" that user according to the miner's profile, perhaps refusing to provide access to an application form or offer products at a less favorable interest rate to a visitor, known only by IP address, but suspected to be of a particular race.

The conclusion to which the author is heading is that "profile" data collected by miners and provided to site operators can have the same detrimental impact on privacy, liberty, and civil rights as the explicit personal information, and should be afforded the same protections under the law. (EN: though it seems that, in the example given, the act of discrimination is punishable under existing law, regardless of how the profile was obtained.)

He strays across a few other topics on his journey:


The author's suggestion in this article is that data mining should not be overlooked as a potential violation of privacy. Even though each piece of information was public knowledge, to at least publicly available, its aggregation and analysis yields highly sensitive information about individuals, which merits careful handling, and doubly so because of potential for inaccuracy.

He stresses that he is not against data mining, and asserts it has a number of legitimate and beneficial uses, merely that it should be regulated to the same extent as any other personal, private, and sensitive information.