Open Social Scholarship Annotated Bibliography/Open Data

Category Overview
Open data concerns the availability and accessibility of research data to the public. Research data may include government, university, institutional, corporate, and educational materials (Bradley et al. 2009; Davies 2010; Stadler, Lehmann, Hoffner, and Auer 2012). The authors of the publications annotated here explore why certain researchers do not make their data publically available and what motivates institutional attitudes towards open data (Murray-Rust 2008; Piwowar and Vision 2013). Authors are concerned with educating faculty about the importance of preserving research data and metadata, as well as the political implications of free data distribution as opposed to corporate or institutional holdings (Molloy 2011). Many resources address government data and government policies on openness (Davies 2010; Geiger and Lucke 2012; Janssen 2012; Janssen et al. 2012; Kalampokis et al. 2011; Shadbolt et al. 2012). Research in this category examines the effectiveness of open data policies within institutions and various strategies for data management (Bauer and Kaltenbock 2012; Gorlitz and Staab 2011). Generally, these articles argue that open data should be published in ways that can be utilized by the public in order to be beneficial.

Annotations
Anokwa, Yaw, Carl Hartung, Waylon Brunette, Gaetano Boriello, and Adam Lerer. 2009. “Open Source Data Collection in the Developing World.” Computer 42 (10): 97–99. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5280663&tag=1.
 * Anokwa, Hartung, Brunette, Boriello, and Lerer, all members of the Open Data Kit (ODK) development team, present evidence for how the ODK can be used for accessible data collection in the developing world. The authors suggest that current services are inflexible, closed source, and based on closed standards. When this article was published in 2009, the ODK had not yet been developed as a tool, and the authors provide arguments for funding agencies to give consideration to their proposal. As a case study, the authors discuss AMPATH Kenya, the most comprehensive initiative in the country to combat HIV. AMPATH opted to use the ODK to improve their methods of data collection and retrieval. The ODK research team argues that the ability to collect data is key to the success of many organizations in the developing world.

Auer, Soren, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. “DBpedia: A Nucleus for a Web of Open Data.” In The Semantic Web, edited by Karl Aberer, Key-Sun Choi, Natasha Noy, Dean Allemang, Kyung-Il Lee, Lyndon Nixon, Jennifer Golbeck, et al., 722–35. Busan Korea: Springer. http://link.springer.com/chapter/10.1007/978-3-540-76298-0_52.
 * Auer et al. outline the basic premises and mission of DBpedia: a community effort to extract structured information from Wikipedia and make the data available for access on the web. DBpedia provides support for sophisticated queries against Wikipedia datasets. Infobox templates, categorization information, images, geo-coordinates, links to external sites, and various language editions of Wikipedia form the nucleus of information that is extracted and queried. DBpedia operates using three mechanisms: linked data, the SPARQL protocol, and downloadable RDF dumps. Their datasets can be accessed royalty-free via the GNU free-documentation license. The authors then provide instructions on how to efficiently search through DBpedia to access relevant materials, a list of third-party user interfaces, and a catalogue of related work. In the future, the research team wishes to further automate the data extraction process to increase the currency of DBpedia’s dataset and to synchronize it with changes in Wikipedia.

Bauer, Florian, and Martin Kaltenböck. 2012. Linked Open Data: The Essentials. Vienna: edition mono/monochrom. https://www.reeep.org/sites/default/files/LOD-the-Essentials_0.pdf.
 * Bauer and Kaltenböck write a guide for administrators describing how to wisely manage and use linked open data. The guide provides basic definitions that clarify the differences between open data and linked open data. The authors expound on the industrial potential of using the linked approach and provide advice and examples on how to start a linked open data catalogue. Bauer and Kaltenböck select the reegle.info country profiles, UK legislation, and Open EI definitions as representative of larger linked open data trends. The authors articulate a vision that depicts how these tools can be used to create the semantic web of the future. The guide provides links to web resources and uses visual graphs to simplify the process of linking and cataloguing data.

** Bradley, Jean-Claude, Robert J. Lancashire, Andrew SID Lang, and Anthony J. Williams. 2009. “The Spectral Game: Leveraging Open Data and Crowd-Sourcing for Education.” Journal of Cheminformatics 1 (9): 1–10. http://link.springer.com/article/10.1186/1758-2946-1-9.
 * Bradley et al. use The Spectral Game to frame their discussion of leveraging open data and crowdsourcing techniques in education. The Spectral Game is a game used to assist in the teaching of spectroscopy in an entertaining manner. It was created by combining open source spectral data, a spectrum-viewing tool, and appropriate workflows, and it delivers these resources through the game medium. The authors evaluate the game in an undergraduate organic chemistry class, and the authors argue that The Spectral Game demonstrates the importance of open data for remixing educational curriculum.

** Brown, Susan, and John Simpson. 2015. “An Entity By Any Other Name: Linked Open Data as a Basis for a Decentered, Dynamic Scholarly Publishing Ecology.” Scholarly and Research Communication 6 (2): n.p. http://src-online.ca/index.php/src/article/view/212.
 * Brown and Simpson propose that linked open data enables more easily navigable scholarly environments that permit better integration of research materials and greater interlinkage between individuals and institutions. They frame linked open data integration as an ecological problem in a complex system of parts and relationships. The different parts of the ecology co-evolve and change according to the relationships in the system. The authors suggest that tools are needed for establishing automated conditions; for evaluating the provenance, authority, and trustworthiness of linked open data resources; and for developing tools that facilitate corrections and enhancements. The authors suggest that an ontology negotiation tool would be a most valuable contribution to Semantic Web. Such a tool would represent an opportunity for collaboration between different sectors of the knowledge economy and would allow the Semantic Web to develop as an evolving space of knowledge production and dissemination.

Davies, Tim. 2010. “Open Data, Democracy and Public Sector Reform. A Look at Open Government Data Use from Data.gov.uk.” Open Data Impacts. http://www.opendataimpacts.net/report/wp-content/uploads/2010/08/How-is-open-government-data-being-used-in-practice.pdf.
 * Davies explores the use of open government data (OGD) from the United Kingdom website data.gov.uk. Davies begins with a theoretical discussion of open government data by arguing that the digital turn has undermined the government’s monopoly on data processing and interpretation. By contrast, the open data movement aspires to promote transparency and accountability by empowering citizens. In this exploratory case study, Davies details who uses OGD, how OGD is being used, and the potential implications OGD has on the public sector. This empirical study uses a variety of research methods and draws on survey, interview, and participant-observation data. Overall, Davies found that OGD was used an overwhelmingly male audience with occupations in the private sector, public sector, and at academic institutions. The use of open data generally fell into five categories: data to fact, data to information, data to interface, data to data, and data to service. This study highlights real-world, practical uses of OGD and the lays the groundwork for future research to test the adequacy and applicability of Davies’ typologies.

Di Noia, Tommaso, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, and Markus Zanker. 2012. “Linked Open Data to Support Content-Based Recommender Systems.” In Proceedings of the 8th International Conference on Semantic Systems, 1–8. New York: ACM. http://dl.acm.org/citation.cfm?id=2362501.
 * Di Noia, Mirizzi, Ostuni, Romito, and Zanker analyze the open data approach for supporting content-based recommender systems. The research team performs an evaluation of MovieLens: the historical dataset for movie recommender systems. The researchers link this data to DBpedia datasets and perform one-to-one mapping. Evidence shows that 298 of 3,952 mappings in MovieLens have no correspondence with DBpedia. Their content-based recommender system leverages the knowledge encoded in the semantic datasets of linked open data with DBpedia, Freebase, and LinkedMDB to collect metadata (such as actors, directors, or genres) on movies.

Geiger, Christian P., and Jorn Von Lucke. 2012. “Open Government and (Linked) (Open) (Government) (Data).” eJournal of eDemocracy and Open Government 4 (2): 265–78.
 * Geiger and Lucke explore free usage of stored public sector data. The authors state that it is not enough to simply put data online; data needs to be considered, weighed, and determined if and where it can be published. Geiger and Lucke describe different types of machine readable and open formats for data. The open data movement currently faces difficulty with different national and international laws about access and transparency. The authors argue that a fair balance between the interests of individual authors, publishers, and the general public must be reached. Misinterpretation by third parties, as well as the structure and culture of the public sector are further difficulties faced by open data directives. Administrations and individual actors should cooperate with each other to achieve sustainability of open government data communities.

Gorlitz, Olaf, and Steffen Staab. 2011. “Federated Data Management and Query Optimization for Linked Open Data.” In New Directions in Web Data Management 1, edited by A. Vakali and L.C. Jain, 109–37. Berlin: Springer. https://doi.org/10.1007/978-3-642-17551-0_5#page-1.
 * Gorlitz and Staab provide tips for federated data management and query optimization for linked open data. For the authors, complex queries are the only means of leveraging the full potential of linked open data. The authors argue that a federation infrastructure is necessary for linked open data and they provide the architecture for their own model. The basic components for this model are a declarative query language, a data catalog, a query optimizer, data protocol, result ranking, and provenance information. Data source federation combines the advantages of both centralized repositories and explorative query processing for efficient query execution and returning complete results. This model allows for transparent querying of distributed linked open data sources. The authors suggest that the SPARQL standard does not support all requirements to efficiently process federated queries. To improve these, the authors recommend focusing on join order optimization (the optimization of basic graph patterns).

Gray, Jonathan. 2015. “Five Ways Open Data Can Boost Democracy around the World.” The Guardian. February 20, 2015. https://www.theguardian.com/public-leaders-network/2015/feb/20/open-data-day-fairer-taxes.
 * Gray provides information on the amount of public resource spending on goods and services per year and how open data can improve political standards. He argues that open data policies can help protect public resources and expenditures, control corporate lobbyists, fight pollution, and hold politicians accountable. He provides evidence of dozens of parliamentary monitoring websites, which are often built by civic hackers to track speeches and votes and to hold politicians accountable to voters. Open data is commensurate with democratic values for Gray, and incorporating such policies will allow for the development of increasingly open and accountable democracies worldwide.

Gurstein, Michael B. 2011. “Open Data: Empowering the Empowered or Effective Data Use for Everyone?” First Monday 16 (2). https://gurstein.wordpress.com/2010/09/02/open-data-empowering-the-empowered-or-effective-data-use-for-everyone/.
 * Gurstein is supportive of the open data project but maintains that the impact on poor and marginalized communities must be investigated. Policy should ensure that there is a wide basis of opportunity for effective data use. He uses Solly Benjamin’s research on the impact of digitization of land records in Bangalore as evidence of the potential for land surveyors, lawyers, and other high ranking officials to exploit gaps in titles, take advantage of mistakes in documentation, and identify opportunities and targets for crimes. Gurstein creates a seven-point framework for making effective use of open data. This should be combined with training on computer/software use, accessible formatting of datasets, interpretation training, and a supportive advocacy network for the community.

Hartung, Carl, Adam Lerer, Yaw Anokwa, Clint Tseng, Waylon Brunette, and Gaetano Borriello. 2010. “Open Data Kit: Tools to Build Information Services for Developing Regions.” In In ICDT ’10: Proceedings of the 4th ACM/IEEE International Conference on Information and Communication Technologies and Development. New York: ACM. http://dl.acm.org/citation.cfm?id=2369236.
 * Hartung, Lerer, Anokwa, Tseng, Brunette, and Borriello present the development of the Open Data Kit (ODK), which contains four tools: collect, aggregate, voice, and build. The collect platform renders complex application logic and supports the manipulation of data types. Aggregate performs a “click to deploy” server that supports data upload and storage transfer in the cloud. Voice renders application logic using automated phone prompts that the user responds to with the keypad. Build is a drag and drop application designer that generates logic used by the tools. The ODK was created to empower individuals and organizations and allow them to build services for distributing data in developing countries. The authors provide outlines of tool designs and charts of system architecture, a list of the drivers and clients employed by their program, and a list of organizations that support open source applications such as ODK. The tool uses a modular, extensible, and open source design to allow users to choose tools best suited for their own specific deployments.

Hausenblas, M., and M. Karnstedt. 2010. “Understanding Linked Open Data as a Web-Scale Database.” In 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications, 56–61. Menuires: IEEE Computer Society. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5477146.
 * Hausenblas and Karnstedt compare linked open data to relational, web-scale databases. The authors provide general linked data principles and reference the linked open data community project to open their discussion and pose questions regarding the steps that should be taken to migrate parts of the relational database. Hausenblas and Karnstedt maintain that a database perspective is required for linked open data to ensure its acceptance and realize a low-barrier adoption process. The linked open data community needs to create specialized web-scale database engines made for the requirements of the linked open data cloud.

International Council for Science. 2015. “Open Data in a Big Data World.” Paris: International Council for Science (ICSU), International Social Science Council (ISSC), The World Academy of Sciences (TWAS), InterAcademy Partnership (IAP). https://www.icsu.org/cms/2017/04/open-data-in-a-big-data-world-long.pdf
 * The International Council for Science (ICSU), the InterAcademy Partnership (IAP), the World Academy of Sciences (TWAS), and the International Social Science Council (ISSC), in the accord from the first Science International meeting, address the opportunities and challenges of data revolution in the realm of global science policy. They explain the digital revolution as a world-historical event, considering the amounts of data produced and their effect on the research industry. The authors characterize big data by volume, variety, velocity, and veracity. Another important element is linked data and its importance for the Semantic Web. The accord also addresses the open data imperative for various reasons. For example, when it comes to “self-correction,” the openness and transparency of relevant data allow testing and reproducibility, whereby in terms of non-replicability, attempts of replicating data has deemed rather unsuccessful, which again calls for transparency in the publication of data and metadata. The document also contains principles of open data, which include boundaries of openness, enabling practices, and responsibilities (of scientists; research institutions and universities; publishers; funding agencies; professional associations, scholarly societies, and academies; and libraries, archives, and repositories).

Jain, Prateek, Pascal Hitzler, Amit P. Seth, Kunal Verma, and Peter Z. Yeh. 2010. “Ontology Alignment for Linked Open Data.” In  The Semantic Web - ISCW 2010, edited by P.F. Patel-Schneider et al., 402–17. Berlin; Heidelberg: Springer. http://link.springer.com/chapter/10.1007/978-3-642-17746-0_26.
 * Jain et al. argue that the Linked Open Data project is a major step in realizing the early open access vision for the Semantic Web. The group discuses their findings on alignment systems available for linked open data and how they had struggled to find systems that performed satisfactorily. The group suggests that the system they have developed, BLOOMS, outperforms contemporary state-of-the-art ontology alignment systems in linked open data schema. BLOOMS uses the Wikipedia category hierarchy, pre-processes the input ontologies, and post-processes with the assistance of an Alignment API and a reasoner. The research group provides charts that details specifications for precision and recall rates across several different open data schema ontology alignment programs. The authors then suggest that further inquiries should be made into partonomical relationships and disjointedness on the linked open data cloud.

Janssen, Katleen. 2012. “Open Government Data and the Right to Information: Opportunities and Obstacles.” The Journal of Community Informatics 8 (2): n.p. http://www.ci-journal.net/index.php/ciej/article/view/952.
 * Janssen provides an overview of the current discussion on open government data and the right to information. She argues that the open government data movement has close ties with the Right to Information movement in their promotion of access to government information as a fundamental right and for greater availability of data held by government bodies. Janssen argues that access to government information is a key component of any transparency and accountability process for government activities. Transparency results in better-informed citizens who can contribute to governmental processes and express meaningful views with regards to government policy. Janssen concludes that the two movements should be seen as complementary and argues that they can promote each other through legislation. For example, the European Commission’s 2011 open data strategy promoted open data as indispensable for a smart, sustainable and inclusive economy, and as a strategy to increase accountability.

Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. “Benefits, Adoption Barriers and Myths of Open Data and Open Government.” Information Systems Management 29 (4): 258–68. http://www.tandfonline.com/doi/abs/10.1080/10580530.2012.716740.
 * Janssen, Charalabidis, and Zuiderwijk provide a political analogue to many of the barriers preventing true, open data publication. Open government demands that the government give up control and that the public sector undergo considerable transformation. The authors use systems theory to draw attention to the distinctions between systems that are open to their environment and systems that are closed. The authors deem several points of open access rhetoric as myth: that publicizing data will yield benefits, that all information should be unrestrictedly publicized, that publishing public data is the whole of the task, and that every constituent can make use of open data. Finally, the myth that open data will result in open government is refuted. Janssen, Charalabidis, and Zuiderwijk suggest that open data only becomes valuable through use and that research demands more inquiry into the conversion of public data into services of public value.

Johnson, Jeffrey Alan. 2014. “From Open Data to Information Justice.” Ethics and Information Technology 16 (4): 263–74. http://link.springer.com/article/10.1007/s10676-014-9351-8.
 * Johnson argues that scholarly discussions of information justice should subsume the question of open data. His article examines the embedding of social privilege in datasets, the different capabilities of data users, and the norms that data systems impose through disciplinary functions. For Johnson, open data has potential to exacerbate rather than alleviate social injustices. Data sovereignty should trump open data and active pro-social countermeasures need to be taken to insure ethical practices. Johnson calls for information pluralism, which would embrace, rather than problematize, the messiness of data. He argues that an information justice movement is vital for drumming up the participation necessary to make information pluralism a reality. Johnson calls for further inquiry into how existing social structures are perpetuated, exacerbated, and mitigated by information systems.

Kalampokis, Evangelos, Efthimios Tambouris, and Konstantinos Tarabanis. 2011. “Open Government Data: A Stage Model.” In 10th IFIP WG 8.5 International Conference, EGOV 2011, 235–46. Delft, Netherlands: Springer. http://link.springer.com/chapter/10.1007/978-3-642-22878-0_20.
 * Kalampokis, Tambouris, and Tarabinis create a stage model for open government data in this article. For the authors, governments have a mandate to enable and facilitate data consumption by both citizens and businesses. A lack of information on available data poses considerable difficulty to the field. The objective of this article is to supplement existing eGovernment stage models by providing a roadmap for open government data re-use and enabling evaluation of relevant initiatives. The stage model is made up of four parts: aggregation of government data, integration of that data, integration of government data with formal nongovernment data, and integration of government data with formal and social nongovernment data. Public agencies are advised to easily and quickly make their data available online. The authors recommend that open government data initiatives should be thoroughly studied to identify important data sets for each stage of the model to be identified.

Molloy, Jennifer. 2011. “The Open Knowledge Foundation: Open Data Means Better Science.” PLOS Biology 9 (12): 1–4. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001195.
 * Molloy stipulates that implementing open data allows for initiatives within science-related disciplines to provide new infrastructure that supports data archiving and the development of stronger data management policies. The author suggests that there is little value in making data open and accessible if it is not being used. Molloy provides evidence from a recent collaboration between the Open Data in Science working group, the Joint Information Services Council and Semantic Web Applications, and Tools for Life Sciences in creating collections of open publications and datasets through available bibliographic data and crowdsourced summaries of non-open content. An accessible open data approach within the sciences will allow the disciplines to generate a wealth of tools, apps, and datasets that will facilitate the discovery and re-circulation of data.

Murray-Rust, Peter. 2008. “Open Data in Science.” Serial Reviews 34 (1): 52–64. http://www.tandfonline.com/doi/abs/10.1080/00987913.2008.10765152.
 * Murray-Rust argues for the need of open access publishing initiatives in the sciences and provides an outline of several early initiatives in the field. He discusses concepts such as reuse, mash-up, community norms, and permission barriers. Most of the data filed in chemistry, for example, is published as a collection of facts, and open access publishing could help re-orient the method by which data in the sciences is collected. Murray-Rust provides paper extracts with structures from organic chemistry to provide examples of the types of data that could be openly distributed in the sciences. He concludes his observations with the argument that the sciences should adopt Open Notebook science in parallel with formal publications in order to achieve the goal of liberating old data.

Piwowar, Heather A., and Todd J. Vision. 2013. “Data Reuse and the Open Data Citation Advantage.” PeerJ 1:e175. https://doi:.org/10.7717/peerj.175.
 * Piwowar and Vision argue that reusing data and opting for a data management policy that makes use of open citation are effective means of facilitating science. This type of policy allows these resources to circulate and contribute to discussions far beyond their original analysis. The authors discuss the advantages and challenges of making research publicly available. Piwowar and Vision conduct a small-scale manual review of citation contexts and use attribution, through mentions of data accession numbers, to explore patterns in data reuse on a larger scale. The researchers determine that data availability is associated with citation benefit and data reuse is a demonstrable component of citation benefit.

Shadbolt, Nigel, Kieron O’Hara, Tim Berners-Lee, Nicholas Gibbins, Hugh Glaser, Wendy Hall, and M.C. Schraefel. 2012. “Linked Open Government Data: Lessons from Data.gov.uk.” IEEE Intelligent Systems 27 (3): 16–24. http://eprints.soton.ac.uk/340564/.
 * Shadbolt, O'Hara, Berners-Lee, Gibbins, Glasner, Hall, and Schraefel present their findings from the data.gov.uk website and its approach to open data management. They argue that the top-down political culture creates a data monopoly. Transparency in the UK is focused on data.gov.uk, which is a public data catalogue with thousands of downloadable datasets under permissive open government license. The adoption of open government data is important for the linked data web, which can enhance the data discovery processes. The authors suggest that geography provides an intuitive way of aligning datasets.

Stadler, Claus, Jens Lehmann, Konrad Hoffner, and Soren Auer. 2012. “LinkedGeoData: A Core for a Web of Spatial Open Data.” Semantic Web 0 (1): 1–22. https://doi:.org/10.3233/SW-2011-0052.
 * Stadler, Lehmann, Hoffner, and Auer present their research on the collaborative development of a spatial data web: OpenStreetMap (OSM). They describe how their data is interlinked through the LinkedGeoData project and can be accessed via the Linked Data paradigm. They describe the makeup of OSM, which consists of nodes to represent geographic points with latitude and longitude, as well as the general architecture of the OSM model. The representational state transfer (REST) API gives full access to OSM’s nodes and can be combined with RDF.XML, N-Triples or Turtle. The authors provide a list of browser tools that use LinkedGeoData.

Vision, Todd J. 2010. “Open Data and the Social Contract of Scientific Publishing.” BioScience 60 (5): 330–31. http://bioscience.oxfordjournals.org/content/60/5/330.full.
 * Vision considers open data and the social contract of scientific publishing. He begins with an appraisal of the scientific enterprise’s effectiveness for providing scientists with a means to publish their findings and receive credit for their work. To improve upon these standards, however, Vision believes that data needs to be included in the arrangement, and that the printed page can no longer be the unit of measurement for attributing scholarship and research. Vision believes that publishers can assist this process by having journals require data archiving at the time of publication. Un-archived data files are often misplaced, corrupted, and rendered obsolete over time. Vision moves on to a discussion of Dryad: a tool that promotes data citations through assigning unique DOIs and compiling data in a shared repository. He concludes with the suggestion that permanent archives for research data would allow the social contract of publishing to give authors and their data their due.

Xu, Guan-Hua. 2007. “Open Access to Scientific Data: Promoting Science and Innovation.” Data Science Journal 6 (17): 21–25. https://www.jstage.jst.go.jp/article/dsj/6/0/6_0_OD21/_article.
 * Xu details open access policies for scientific data in China. In 2002, the Ministry of Science and Technology launched the Scientific Data Sharing program with 24 government agencies participating in this program. The government of China is expecting to make 80 percent of scientific research data available to the general public. Xu notes that relevant laws and regulations need to be established and authentic data resources need to be further integrated. The author maintains that China is committed to the policy of reform and making information more readily available, especially with regard to shared and open data.