Strategy for Information Markets/Collaborative Filtering

Although collaborative filtering has been used to filter through large data sets since the early 1990s, the practice has taken off with the growth of massive data sets that result from the exponential growth of the internet. The once experimental data-sifting method has now become an ubiquitous aspect of business online.

First, though - what is collaborative filtering, exactly? In their paper "Social Information Filtering: Algorithms for Automating "Word of Mouth (1995)," Upendra Shardanand and Pattie Maes offer this definition (using the term "social information filtering" in the place of collaborative filtering):

"Social Information filtering exploits similarities between the tastes of different users to recommend (or advise against) items. It relies on the fact that people's tastes are not randomly distributed: there are general trends and patterns within the taste of a person and as well as between groups of people. Social Information filtering automates a process of 'word-of-mouth' recommendations. A significant difference is that instead of having to ask a couple friends about a few items, a social information filtering system can consider thousands of other people, and consider thousands of different items, all happening autonomously and automatically."

There are two main types of collaborative filtering: active and passive.

Active collaborative filtering
In effect, active collaborative filtering is the word-of-mouth in the age of the information marketplace. Instead of neighbors chatting at the mailbox, everyone with an internet connection and an opinion is free to offer their views - good or bad - on just about any good or service offered. This kind of filtering is considered "active" because it requires active participation from those offering their opinions - as we shall see later, this is not always the case. A classic example of this method are the "Customer Reviews" offered by Amazon. Reviewers are allowed to give reviews, either with user accounts or anonymously, rating products from a scale of one-star to five-star. The results of these reviews are aggregated at the top, with an average score of all contributors and a breakdown of how many 5-star, 4-star, 3-star, etc. ratings were given.

Active collaborative filtering is not perfect, though. There is often a "first-mover bias," in which the first person to rank a product skews the rest of the raters towards that person's score, be it positive or negative. There is also what might be called a "lone-mover bias," in which a relatively unpopular product (i.e. one with a handful of rankings at most) only contains extreme rankings, either those with a bone to pick or a sort of "astrotruf" ranking by someone associated with the product. Such bottom-up rankings can also be driven by factors other than product quality or worth.

Passive collaborative filtering
In contrast to active filtering, passive collaborative filtering requires no participation on the part of the consumer. Instead, information is collected on all users as they navigate a given site or resources. This information is generally quantitative, gathering and organizing data such as page visits, article choices, frequency of visits, and other such metrics. If active collaborative filtering represents the collective wisdom of the crowd, passive collaborative filtering acts as an un-forgetting adviser, offering bits of advice based on past behavior.

A common example of this can also be seen on the Amazon website. When entering the site, users see a graphic of "recently viewed products," reflecting the user's browsing history on the site. Below this, the site features a similar widget displaying other products that the user might be interested in, based on this history. The site even offers rudimentary filtering for non-users - for every product page, Amazon also includes a "Consumers Who Brought This Item Also Bought. . .", listing items that are in the same vein as the originally sought-after product. By using this passive filtering tool, Amazon hopes to encourage shoppers to purchase products that they usually would not, in effect by "speaking the language" of their interest.

Explicit and Implicit Filtering
(SOURCE: http://en.wikipedia.org/wiki/Collaborative_filtering#Explicit_versus_implicit_filtering)

Although less common in the parlance, potential entrants into the information marketplace would be wise to also consider the difference between explicit and implicit forms of collaborative filtering. When explicit filtering systems are used, users directly input a ranking for each product in question - stars for a movie, say. These classifications are sorted, aggregated, and churned out to show system-wide preferences. Implicit filtering uses a similar ranking system; but rather than directly asking consumers to share their preferences, implicit filtering instead infers preferences on the basis of observed behavior.

Both systems are used in Apple's iTunes system. Users can explicitly rate songs in their music library on a "star" system from one (worst) to five (best). Yet they can also order songs on the basis of play-count, a stat which is recorded implicitly. This example helps to highlight the different functions of the different types: although explicit filtering may indicate genuine preferences, implicit filtering represents more of an "adapted preference." One may truly enjoy a 15-minute song, but that song may not be played as much as a similarly enjoyed song of a more traditional length. Businesses should cater to both interests, taking into account what customers want and what they need.

Improving Collaborative Filtering
When filtering information there are two perspectives to consider, User-centric or Item-centric. User-centric or Collaborative filtering looks for users who share the same rating patterns with the query user and uses the ratings from those like-minded users to calculate a prediction for the query users. Item-centric or Information filtering is based on content. Building an item to item matrix to determine relationships between pairs of items. By using the matrix and the data on the current user to infer his or her taste.

The algorithm used to figure out each individual's preference is different for many companies due to the use of the information on hand, the size of the data set and the type of filtering used. There are many challenges for collaborative filtering. The algorithms are required to have the ability to deal with highly sparse data, to scale with the increasing numbers of users and items, to make satisfactory recommendations in a short time period, and to deal with other problems like synonymy (the tendency of the same or similar items to have different names), shilling attacks, gray sheep, and privacy protection problems. Synonymy refers to the tendency of a number of the same or very similar items to have different names or entries. Gray sheep refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from collaborative filtering Shilling Attacks are when people give tons of positive recommendations for their own materials and negative recommendations for their competitors.

A company that bases an abundance of faith into collaborative filter is Netflix. This growing company depended on this process to better assist there customer's. To improve the performance of there movie recommendations, Netflix launched a million dollar prize challenge. The individual or group that could improve there collaborative filter the most would win.

Here are some links to tables that represent the user item matrix. These are only very small examples of what Netflix or Amazon due for there collaborative filtering.

Image 1 Image 2 Image 3

The images show each row representing a user and each column representing the user rating for a particular movie. The goal is to predict those ratings the customer would assign to the movie once he has watched and evaluated it. Prediction is based on user profiles which indicate similar interests, i.e. as inferred by the movies which were rated and which movies actually were watched. In the example above Jack and Tom share similar movie preferences and we try to predict our own rating for movie I4 which turns out to be 4.5 for this particular case given that Jack and Tom voted 4 and 5 for this movie, respectively.

Privacy issues
Collaborative Filtering has entered mainstream discussion as the proportion of internet users to total population has grown. The primary use of collaborative filtering is referred to often as targeted advertising. With the base assumption that someone who agreed to something in the past will be likely to do so again in the future, collaborative filtering aims to collect information regarding purchases, interests and other personal information to more accurately tailor or redirect relevant advertisements to you. The internet has sparked waves of discussion and debate regarding this practice merely due to the sheer volume of transaction information available through online resources. If anyone has visited a site, or made a purchase online, the various information (geographic region, website subject, etc) would be collected by a third party and amassed into datasets which can then be analyzed to determine specific preferences of users. By using these statistics, proceeding visitors to those sites will be shown to redirected to other advertisements which would be considered more desirable given the aggregate preferences of previous visitors. While seemingly innocuous, this practice has drawn the ire of internet users who claim that actions on the internet, when performed in private settings (like the home), should be equally protected from scrutiny under privacy laws as other activities like the level of dress when in the home.