Next: 5 dataset_2d Tool Up: User's Guide for the Previous: 3 Event Browser

Subsections

# 4 dataset_1d Tool

The dataset_1d tool is used to study the distribution of univariate datasets. If supplied with a simple vector containing N measurements of some quantity, dataset_1d will estimate the probability density function that describes the dataset, for example by computing a histogram of the data, and will allow the user to visualize and fit that density. Analysis may be restricted to a subset of the data by specifying a range of values to include or exclude. The following sections describe the specific capabilities of dataset_1d.

## 4.1 Getting Datasets into dataset_1d

• As described in Sections 3.3.2 & 3.3, Event Browser can send one or more named datasets into dataset_1d.

• The menu selection File Load 1-D Dataset can be used to read a column of data from a FITS binary table or an ASCII file.

• From the IDL prompt you can load a vector of data into dataset_1d (Section 8).

## 4.2 Mode Droplist

A droplist just above the plot controls the basic mathematical entity that is plotted.
• Scatter Plot: The value of each 1-D datapoint is simply plotted against the datapoint's index. The plot symbol and color may be changed by pressing the Edit button.

• Density Function: A binned density function for the 1-D dataset is estimated and plotted. The Y-axis values are normalized by the binsize. Thus they represent the number of data points per unit of the quantity being measured, e.g. seconds, eV, pixels, rather than the number of datapoints falling in each bin which is how histograms are often displayed.

The line style, color, bin size, bin phase, and error bar presentation may be changed by pressing the Edit button. By default, the density function is simply a scaled histogram. An error (sigma) estimate for each bin is made based on simple Poisson counting statistics, i.e. the error on a bin with N events is . Bins with 0 events are arbitrarily assigned an error of 1.

You may, however, specify that the histogram should be smoothed. Note that a smoothed histogram made with a small binsize approximates the result obtained by the kernel smoothing method often recommended by statisticians in the field of non-parametric density estimation. [Silverman1986] Kernel smoothing avoids spurious features often found in histograms - features that change dramatically when the phase of the bins is changed. Unfortunately, I do not know how to put error bars on a smoothed histogram.

• Distribution Function: A distribution function (the integral of the density) for the 1-D dataset is plotted.

## 4.3 Axis & Title Controls

Axis ranges are specified by pan-and-zoom-style controls found below and to the left of the plot window.
• : These buttons pan the plot window left & right.

• Zoom-, Zoom+: These buttons change the range of the axis.

• Auto: Setting this button makes the axis range follow the data.

• Range: This button prompts you to choose the axis range by clicking the mouse on the plot.

• Center: This button prompts you to click on the plot location that should become the new center of the plot.

• X-edit/Y-edit: This button brings up a dialog box that lets you type values for the axis endpoints, lets you choose a logarithmic style, and lets you specify the margins to the left & right of the plot (where the Y-title goes).
When the 1-1 button is checked a 1-1 aspect ratio is maintained and the Zoom, Center, and Range buttons affect both axes. For example, with 1-1 checked you can "zoom in" on a image feature either by pressing either Range button and selecting the corners of the region you want to display, OR by pressing Center and clicking on the feature and then pressing Zoom to scale.

The Titles button brings up a dialog box that lets you specify miscellaneous properties of the plot.

• window dimensions (the size of the plot window on the screen)

• titles

• date annotation

• marker positions: Two markers exist in the plot coordinate system, displayed by red plus signs. Their plot coordinates may be changed in this dialog box, and if world coordinates have been defined their positions in that system are displayed in this dialog box. The markers are used for defining regions of interest and for specifying positions used in various analyses (see below).

The Big Marker may also be moved with the left mouse button and the Small Marker may be moved with the right mouse button. Clicking the middle mouse button will display information about the nearest datapoint, density function bin, or distribution function sample. If world coordinates have been defined for the axes, the mouse position in those coordinates is displayed continuously as the mouse is moved.

## 4.4 Selected Dataset Droplist

The Univariate Analysis widget can analyze multiple 1-D datasets, plotting density functions for each on the same plot. Each dataset has a unique name - the selected dataset's name is shown in a droplist to the left of the mode droplist. Many controls pertain only to this selected dataset.

The File menu is used to print the display and to save the density and distribution functions to FITS files. Dialog boxes will appear to let you configure PostScript parameters and choose filenames.

## 4.6 Region-of-interest Controls

At the top of the widget are controls that let you specify an interval of data values that define a region-of-interest. For example, to compute statistics on a range of data you would either:
1. Move the two markers to the ends of the range by clicking the left & right mouse buttons.

2. Press the Use Markers button.

3. Change the left-hand droplist from None to Stats.
or
1. Type in the range endpoints in the Edit dialog box.

2. Change the left-hand droplist from None to Stats.

The right-hand droplist may be used to exclude rather than include a range of data.

If the left-hand droplist is set to Filter, then datapoints falling outside the ROI are excluded from analysis. Don't forget, if you wish this filter to propagate to the Working Dataset you must press the Apply Filter button (Section 3.3.1).

The selected dataset (optionally filtered) may be fit to a gaussian+polynomial probability model by direct application of the Maximum Likelihood Method, i.e. by directly maximizing the likelihood of the data (see Chapter 10 of [Bevington and Robinson1992]). One advantage of this method is that it does not use a density function estimated from the univariate dataset. Such density estimates invariable require choosing arbitrary parameters, such as histogram bin size and phase.

The menu item Fit Setup creates controls that allow you to specify the number of gaussian components and the order of the polynomial background component in the model, as well as supply initial values for the model parameters. If you press the button labeled Mouse'' you will be asked to click on the density plot to define the initial parameter values for the current gaussian component. Individual model parameters may be frozen at the value you supplied by changing the droplist next to the parameter from free'' to fixed''. Up to three gaussian components are allowed, although only the parameters for one gaussian are displayed at a time. A gaussian component will be used in the fit if it's initial amplitude parameter is non-zero OR if it's amplitude parameter is marked free''. Similarly, a polynomial term will be used in the fit if it's initial coefficient parameter is non-zero OR if it's coefficient parameter is marked free''.

To perform the fit, choose the Perform Fit menu item. Since the Maximum Likelihood Method evaluates the model at each datapoint, the fit will be slow for large datasets. We constrain the integral of the model over the range of the data to be equal to the number of datapoints. Thus, the amplitudes of the gaussian components plus the coefficients of the polynomial terms cannot all be free parameters. One of them is chosen to be a derived parameter so that the model integral comes out right. As a result, the number of free parameters, reported in the fit result message, is often one less than you expect.

The Kolmogorov-Smirnov statistic is used to characterize the goodness-of-fit (see Section 14.3 of [Press1992] and Section 4.5.2 of [Babu and Feigelson1996]) by comparing the model probability density to the density you've estimated from the data (the plot you're looking at). Thus, the choices you've made in computing the density estimate (bin size, bin phase, smoothing) may slightly affect the KS statistic. Unfortunately, this fitting method does not conveniently produce estimates for the errors on the fit parameters.

If you would prefer to perform a traditional least squares fit of your density function to a gaussian+polynomial model, then you must export the density function to the function_1d tool (see the next section). Least squares fitting is fast and will give you error estimates on the parameters, but you'll face the angst of choosing bins sizes, phases, and smoothing and worrying about what happens to the weighting of bins that have small numbers (including zero) of counts.

If desired, the density of the selected dataset and/or the model function may be exported to a function_1d tool for further analysis. For example, suppose you wanted to try fitting your dataset with both a single gaussian model and with a double gaussian model, and you wanted to produce a plot that shows both models and the density estimate (histogram) for the data. This cannot be done in the Univariate Analysis widget because only one fit exists at a time. However, by exporting all the functions you're working with (the density function plus the fit functions you create) to a function_1d tool (which knows how to work with multiple functions), you can create the plot you need.
1. Perform the first fit

2. Export the first fit - this creates a new function_1d tool

3. Change the number of gaussian components (Fit Setup) then perform the second fit

4. Export the second fit - it is added to the function_1d tool you recently created

5. Export the density function itself

6. In the function_1d tool, edit the function descriptions as desired, turn on the legend, adjust the plot symbols, line styles, and colors as desired, adjust the axes as desired, and print.

Next: 5 dataset_2d Tool Up: User's Guide for the Previous: 3 Event Browser
Patrick Broos
Penn State Department of Astronomy
2013-01-08