Data Science: An Introduction/The Impact of Data Science

 Data Science: An Introduction  Chapter 04: The Impact of Data Science



Chapter Summary
In this chapter we explore how data science has revolutionized several different aspects of our world: Baseball, Health, and Robotics.

MoneyBall
(This section is an edited version of the MoneyBall Wikipedia page, from 3 October 2012.)

According to the Wikipedia, Moneyball (film) refers to a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team's analytical, evidence-based, sabermetric approach to assembling a competitive baseball team, despite Oakland's disadvantaged revenue situation. A film based on the book starring Brad Pitt was released in 2011.



The central premise of Moneyball is that the collected wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, run batted in|runs batted in, and batting average, typically used to gauge players, are relics of a 19th century view of the game and the statistics that were available at the time. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could compete successfully against richer competitors in Major League Baseball (MLB).

Rigorous statistical analysis had demonstrated that on-base percentage and slugging percentage are better indicators of offensive success, and the A's became convinced that these qualities were cheaper to obtain on the open market than more historically valued qualities such as speed and contact. These observations often flew in the face of conventional baseball wisdom and the beliefs of many baseball scouts and executives.


 * By re-evaluating the strategies that produce wins on the field, the 2002 Oakland Athletics, with approximately $41 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over $125 million in payroll that same season. Because of the team's smaller revenues, Oakland is forced to find players undervalued by the market, and their system for finding value in undervalued players has proven itself thus far.

Several themes Lewis explored in the book include: insiders vs. outsiders (established traditionalists vs. upstart proponents of sabermetrics), the democratization of information causing a flattening of hierarchies, and "the ruthless drive for efficiency that capitalism demands." The book also touches on Oakland's underlying economic need to stay ahead of the curve; as other teams begin mirroring Beane's strategies to evaluate offensive talent, diminishing the Athletics' advantage, Oakland begins looking for other undervalued baseball skills such as defensive capabilities.

Moneyball also touches on the A's methods of prospect selection. Sabermetricians argue that a college baseball player's chance of MLB success is much higher than a traditional high school draft pick. Beane maintains that high draft picks spent on high school prospects, regardless of talent or physical potential as evaluated by traditional scouting, are riskier than if they were spent on more polished college players. Lewis cites A's minor leaguer Jeremy Bonderman, drafted out of high school in 2001 over Beane's objections, as but one example of precisely the type of draft pick Beane would avoid. Bonderman had all of the traditional "tools" that scouts look for, but thousands of such players have been signed by MLB organizations out of high school over the years and failed to develop. Lewis explores the A's approach to the 2002 MLB Draft, when the team had a nearly unprecedented run of early picks. The book documents Beane's often-tense discussions with his scouting staff (who favored traditional subjective evaluation of potential rather than objective sabermetrics) in preparation for the draft to the actual draft, which defied all expectations and was considered at the time a wildly successful (if unorthodox) effort by Beane.


 * In addition, Moneyball traces the history of the sabermetric movement back to such people as Bill James (now a member of the Boston Red Sox front office) and Craig R. Wright. Lewis explores how James' seminal Baseball Abstract, an annual publication that was published from the late 1970s through the late 1980s, influenced many of the young, up-and-coming baseball minds that are now joining the ranks of baseball management.

Moneyball has made such an impact in professional baseball that the term itself has entered the lexicon of baseball. Teams which appear to value the concepts of sabermetrics are often said to be playing "Moneyball." Baseball traditionalists, in particular some scouts and media members, decry the sabermetric revolution and have disparaged Moneyball for emphasizing concepts of sabermetrics over more traditional methods of player evaluation. Nevertheless, the impact of Moneyball upon major league front offices is undeniable. Since the book's publication and success, Lewis has discussed plans for a sequel to Moneyball called Underdogs, revisiting the players and their relative success several years into their careers. When the New York Mets hired Sandy Alderson – Beane's predecessor and mentor with the A's – as their general manager after the 2010 season, and hired Beane's former associates Paul DePodesta and J.P. Ricciardi to the front office, the team became known as the "Moneyball Mets". Michael Lewis has acknowledged that the book's success may have negatively affected the Athletics' fortunes as other teams have accepted the use of sabermetrics, reducing the edge that Oakland received from using sabermetric-based evaluations.
 * In its wake, teams such as the New York Mets, New York Yankees, San Diego Padres, St. Louis Cardinals, Boston Red Sox, Washington Nationals, Arizona Diamondbacks, Cleveland Indians, and the Toronto Blue Jays have hired full-time '''sabermetric data scientists.

23 and Me
(This section is adapted from the company's Wikipedia article, from 3 October 2012.)

According to the Wikipedia, 23 and Me is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named "Invention of the Year" by Time magazine in 2008. The company was founded by Linda Avey and Anne Wojcicki after both recognized the need for a way to organize and study genetic data, the possibility for individual consumers to use the information and the need for expertise to interpret the results.



23andMe began offering DNA testing services in November, 2007, the results of which are posted online and allow the subscriber to view an assessment of inherited traits, genealogy and possible congenital risk factors. Customers provide a 2.5 mL spit sample which is analyzed on a DNA microarray of Illumina, for 960,000 specific single-nucleotide polymorphisms (SNPs). An eventual goal is to provide whole genome sequencing.
 * In June 2011, 23andMe announced it had accumulated a database of more than 100,000 individuals.

The organization also provides testing for certain research initiatives providing confidential customer datasets to, and partnering with research foundations with a goal of establishing genetic associations with specific illnesses and disorders. Google co-founder Sergey Brin (whose mother suffers from Parkinson's disease and who is married to 23andMe co-founder Anne Wojcicki) underwrote the cost of the company's Parkinson's disease Genetics Initiative to provide free testing for people suffering from the condition. An analysis of the results of research on Parkinson's disease comparing 23andMe with a National Institutes of Health (NIH) initiative suggested that the company's use of large amounts of computational power and data sets might offer comparable results, though in much less time.

The company gathers personal and social data from its subscribers via on-line surveys. The personal data includes a person's health history, their environmental history, and such things as the ability to smell certain odors. The social data includes family histories and the sorts of activities one participates in. The company employs a number of data scientists to work on this wealth of data—a million genetic variables and many hundreds of personal and social variables for over 100,000 people.
 * The company's data scientists are able to correlate and cluster certain personal and social behaviors with genetic markers. When these correlations are shown to be significant, they report the results back to the subscribers,  indicating that certain personal or social aspects of their lives may, indeed, have a genetic basis.   They also publish results in scientific journals.

(If the instructor has a 23 and Me account, she could log on and project the website to show the results of the 23 and Me data scientists to the students in the class.)

Google's Driverless Car
(This section is an edited version of the Wikipedia article on Google's Driverless Car, from 3 October 2012.)

According to the Wikipedia, Google's Driverless Car is a project by Google engineer Sebastian Thrun, director of the Stanford Artificial Intelligence Laboratory and co-inventor of Google Street View. Thrun's team at Stanford created the robotic vehicle Stanley which won the 2005 DARPA Grand Challenge and its $2 million prize from the United States Department of Defense.



The U.S. state of Nevada passed a law in June 2011 permitting the operation of driverless cars in Nevada. Google had been lobbying for driverless car laws. Google executives have not stated the precise reason they chose Nevada to be the maiden state for the driverless car. The Nevada law went into effect on March 1, 2012, and the Nevada Department of Motor Vehicles issued the first license for a self-driven car in May 2012. The license was issued to a Toyota Prius modified with Google's experimental driver-less technology. In August 2012, the team announced that they have completed over 300,000 autonomous-driving miles accident-free, typically have about a dozen cars on the road at any given time, and are starting to test them with single drivers instead of in pairs. Three U.S. states have passed laws permitting driverless cars, as of September 2012: Nevada, Florida and California.

The system combines information gathered from Google Street View with artificial intelligence software that combines input from video cameras inside the car, a LIDAR sensor on top of the vehicle, radar sensors on the front of the vehicle and a position sensor attached to one of the rear wheels that helps locate the car's position on the map. In 2009, Google obtained 3,500 miles of Street View images from driverless cars with minor human intervention. As of 2010, Google has tested several vehicles equipped with the system, driving 1609 km without any human intervention, in addition to 225308 km with occasional human intervention. Google expects that the increased accuracy of its automated driving system could help reduce the number of traffic-related injuries and deaths, while using energy and space on roadways more efficiently.

The project team has equipped a test fleet of at least eight vehicles, each accompanied in the driver's seat by one of a dozen drivers with unblemished driving records and in the passenger seat by one of Google's engineers. The car has traversed San Francisco's Lombard Street, famed for its steep hairpin turns and through city traffic. The vehicles have driven over the Golden Gate Bridge and on the Pacific Coast Highway, and have circled Lake Tahoe.

Google's driverless test cars have about $150,000 in equipment including a $70,000 lidar (laser radar) system. The system drives at the speed limit it has stored on its maps and maintains its distance from other vehicles using its system of sensors. The system provides an override that allows a human driver to take control of the car by stepping on the brake or turning the wheel, similar to cruise control systems already in cars.

While Google had no immediate plans to commercially develop the system, the company hopes to develop a business which would market the system and the data behind it to automobile manufacturers. An attorney for the California Department of Motor Vehicles raised concerns that "The technology is ahead of the law in many areas," citing state laws that "all presume to have a human being operating the vehicle". According to The New York Times, policy makers and regulators have argued that new laws will be required if driverless vehicles are to become a reality because "the technology is now advancing so quickly that it is in danger of outstripping existing law, some of which dates back to the era of horse-drawn carriages".

In August 2011, a human-controlled Google driverless car was involved in the project's first crash near Google headquarters in Mountain View, CA. Google has stated that the car was being driven manually at the time of the accident. A second incident involved a Google driverless car being rear-ended while stopped at a stoplight.

CONSIDER THIS: In 2005 the DARPRA Grand Challenge driverless car winner went 123 miles at an average of 19 miles per hour. Just 5 years later, the Google driverless car had gone hundreds of thousands of miles at the speed limit of 55 to 65 miles per hour. Did the discipline of artificial intelligence advance that much in 5 years? No. The difference was the data science. The Google data scientists made a 3-D wire mesh model of every street the driverless car was going to drive. In real-time, the car's data science algorithms compared actual observations against the model (including the white stripes on the road), and made corrections accordingly.

Assignment/Exercise
Get into groups of 4 or 5 students. Together, watch the movie Moneyball. While watching, take brief notes on how data science made a difference to the characters in the movie. After the movie is over, brainstorm as a group, other areas of life where data science could make a difference. Speculate on the arguments opponents of data science might make to using data science. Pick one for further consideration. As a group, create a 4 slide presentation that introduces the area of life you picked; how data science would make a difference; what the counter-arguments are; and whether or not your group thinks, in the end, it would be a good idea to introduce data science into that area of life.

Copyright Notice


You are free: Under the following conditions:
 * to Share — to copy, distribute, display, and perform the work (pages from this wiki)
 * to Remix — to adapt or make derivative works
 * Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
 * Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
 * Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
 * Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
 * Other Rights — In no way are any of the following rights affected by the license:
 * Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
 * The author's moral rights;
 * Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.


 * Notice — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.
 * http://creativecommons.org/licenses/by-nc-sa/3.0/