Data Analysis in Astronomy

= Data Analysis in Astronomy =

Introduction
An integral part of the scientific process is the design of experiments and the analysis of the resulting data. The astronomical case is different in practice in that there is little latitude in designing the experiments: we cannot, perhaps fortunately, make a star go supernova to test our models. In addition, the data obtained in astronomy is often what is called noise in other fields - be it the annoying cosmic ray hits or the background noise in telephone and antenna signals. As a result, we have to deal with situations in which the data are incomplete, inconclusive and often hidden. Dealing with these situations requires some amount of imagination and a willingness to try different approaches. Our goal in this book is to lay the foundation for the analysis and interpretation of the data assuming some familiarity with computers and sufficient mathematical and programming ability to understand and implement basic statistical techniques.

Data analysis in astronomy has a long tradition from the very first astronomers who used the positions of the Moon and stars to plan their planting and harvesting cycles. Models were created to explain the observations of the time and did a wonderful job of prediction within the limitations of the observations. As the observations became better, the models became correspondingly better with the Geocentric model of the universe replaced with the Copernican model. Kepler used the observations of Brahe to show that the orbits of the planets were ellipses with one focus at the Sun rather than the circles of antiquity.

Observations and data are still at the heart of modern astronomy but there are new challenges of different types, all of which boil down to the desire to extract more information from the data than is apparent to the eye. The ultimate goal of the astronomer is to observe the entire sky with high cadence and sensitivity: a goal which is getting closer with projects such as the Large Synoptic Survey Telescope (LSST) or Gaia. These data sets, which are readily available, encourage new modes of data analysis from the large scale statistical properties of white dwarfs in the Galaxy to the long term monitoring of individual stars in the search for exoplanets.

Our approach in this Wikibook is to present an introduction to data analysis using examples from astronomy. Each concept will be illustrated by programs in GDL and Python and we encourage the reader to try the programs and to write their own. One cannot analyse scientific data without statistics and we will introduce a selection of topics relating to astronomical data analysis. Data analytics and data mining have become corporate buzz words although without a proper understanding of the techniques or limitations. Astronomical data analysis presents many of the same challenges as industry but in a more controlled environment and the lessons learned in understanding the results may be applied to many other contexts.

Tools
There have been many tools designed for the analysis of astronomical data. These were largely defined by wavelength regions with optical astronomers using the Image Reduction and Analysis Facility (IRAF), radio astronomers using the Astronomical Image Processing System (AIPS) and ultraviolet astronomers using the Interactive Data Language (IDL). As data analysis has increased in importance, many astronomers have begun to use Python and are developing tools to replicate much of the functionality of the older languages. Also widely used is R, particularly for statistical purposes. There are many more tools available for the statistical analysis of data (for example see this list). The reader is encouraged to try out different tools/languages and choose the one most appropriate for the task at hand. In this work, we will give examples in GDL (note that GDL is source code compatible with IDL in most cases) and in Python, but only to illustrate the techniques.

Obtaining IDL and its clones
There are three IDL versions under which we have tested the examples given in this book. For the sake of consistency, and for historical reasons, we will use the term "IDL" throughout this book, with the understanding that the code presented here will work under any of the 3 variants.
 * 1) The Interactive Data Language (IDL) may be purchased for a number of operating systems from Harris Geospatial. The commercial product provides many features and enhancements which may not be available in the free versions.
 * 2) The Gnu Data Language (GDL) is available from Sourceforge with binaries for many Linux and BSD systems as well as for Mac OS X. It should compile cleanly on most UVIX systems and under Cygwin for Windows. GDL may also be installed in a Linux system under VirtualBox.
 * 3) The Fawlty Data Language (FDL) is another actively developed clone of IDL and is available from here.

Starting IDL
Invoking IDL is dependent on the operating system, the variant of IDL and the local environment. The installation instructions should be consulted for details, including system variables which may control aspects of the program behavior. For example, my .bash_profile file contains the two lines (amongst other unrelated commands): export GDL_STARTUP="/Users/jayanth/user/idluser/GDL_STARTUP"

export GDL_PATH="+/Users/jayanth/user/idluser/idllib:/opt/local/share/gnudatalanguage/lib: The first of these instructs GDL to execute the commands in the GDL_STARTUP file as it starts up while the second gives the directories where GDL should look for library routines.

My startup file is as follows: journal; Keep a log of all input and output to the command line. Default is to save in a file called gdljournal.pro.

defsysv,"!red",255; Define a system variable called !red. We will look at the application of this later.

defsysv,"!green",65535; Define a system variable called !green.

!quiet=1; Operate in "quiet" mode. Don't tell me which programs are being compiled. Anything after the semicolon (;) will not be compiled by IDL and can used for program documentation. Although my startup file only contains four lines, it could have been more complex and might contain any commands that I run on every invocation of IDL.

Learning IDL
There are many resources on the web for learning IDL, both the basics and more advanced techniques. We will get the reader started here; more sophisticated techniques will be discussed with the applications below and are covered in other works. It is important to understand that IDL is an interpreted language; that is, all commands are single line commands. The philosophy of IDL was that it was a tool to explore data and, as such, it is an advantage to get immediate feedback from commands. We will discuss IDL functions and procedures later. Note that IDL commands are case independent.

We will not assume any prior knowledge of IDL in this work and will explain every command used in the examples. However, IDL is a rich language with a complexity and power that is beyond the scope of this book. We recommend that the dedicated reader explore the rich variety of resources available. A Slug's Guide to IDL is a good place to start with links to many other useful sites.

Python
Linux distributions come packaged with Python (version 2) by default. A newer version 3 of Python is available (however, not all supportive packages have been ported to the new version). In our examples here we will use   (to be future insured and also considering that   backward compatible). In addition to the regular  interpreter, iPython is an option particularly the Notebook option.

Terminology
The statistical inferences one makes from data depend on the mathematical tools of probability. Some background on probabilistic terms and concepts are, therefore, in order here. In Science there are a number of situations where the result of an experiment is not known a priori, however, the result could be one of the expected outcomes (e.g. whether a newly discovered object in the sky is one of the known astrophysical objects [asteroid, planet, star, galaxy, quasar] or an entirely new species of object, whether a signal obtained in a pixel of CCD is cosmic in origin or one due to random fluctuation of temperature within the CCD) much as the expected outcomes in a toss of a coin or the roll of a die.

The procedure (example the observations made with a telescope, the tossing of a coin or the rolling of a die) that can be repeated multiple number of times is termed "Experiment".

The entire gamut of possible outcomes is called our "Sample space".

The result of the experiment is an "Event".

The aim of the experiment is to find the likelihood of the "Event" to be one of the outcomes in our "Sample Space". This will have a "Statistical significance" and we will then form an "Inference" from our experiment.