{ "metadata": { "name": "poissondemo" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simulating a Poisson distribution\n", "\n", "The Hubble ultra deep field (HUDF) is 200 arcseconds on a side and contains about 10,000 galaxies. If we divide it into images 4 arcseconds on a side, there are $50\\times 50=2,500$ such images, each one of which should contain on average 4 galaxies. Of course we will not have exactly 4 galaxies in each image. If we assume that each of the 10,000 galaxies has an equal chance of being in any of the 2,500 images, and ignore the possibility of a galaxy being split between two images, we have a situation where the probability distribution for the number of galaxies in a randomly chosen image is approximated by a Poisson distribution with a mean of 4. We can simulate this by going through for each of the 10,000 galaxies and randomly assigning it to one of the 2,500 images, without regard to how many other galaxies are already there. The number of galaxies in each image will then be, approximately, an independent Poisson random variable.\n", "\n", "If we simulate a data set following this procedure, we'll get 2,500 random values which should be approximately drawn from the same Poisson distribution, so the fraction of those 2,500 images with k galaxies should be approximately the probability for a Poisson random value with mean 4 to be k. It turns out 2,500 is still small enough that the actual frequencies will vary noticeable from the long-term average, so for the purposes of illustration, we'll scale up the number of images by a factor of 200. So we're pretending that the HUDF is $8\\times 10^6$ rather than $4\\times 10^4$ square arcseconds (with the same density), and there are 2,000,000 galaxies to distribute among 500,000 images." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "from __future__ import division\n", "rcParams['figure.figsize'] = (10.0, 8.0)\n", "rcParams['font.size'] = 14\n", "seed(20140219)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "num_galaxies = 10*1000*200; num_images = 2500*200; galaxies_per_image = num_galaxies/num_images" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print \"We have %g galaxies in %g images for an average density of %g galaxies per image\" % (num_galaxies,num_images,galaxies_per_image)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We randomly choose an image into which to place each galaxy; i.e., we generate 2 million random integers each between 0 and 499,999, using the `randint` function" ] }, { "cell_type": "code", "collapsed": false, "input": [ "?randint" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "image_for_galaxy = randint(0,num_images,num_galaxies)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the list of image indices, galaxy by galaxy:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "image_for_galaxy" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to count how many galaxies landed in each image. That seems like it might be tricky to code up, but if we realize we're counting all the 0s in the list above, then all the 1s, then all the 2s, etc., we can see that what we're really doing is making a histogram of these numbers, where each bin contains one integer. So we can let numpy's `histogram` function do the work for us." ] }, { "cell_type": "code", "collapsed": false, "input": [ "?histogram" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second argument of the function is a list of bins, which is actually the list of left and right edges of the bins, where the left edge is included and the right is not, so we use the numpy `range` function to give us a list of integers from 0 to 500,000." ] }, { "cell_type": "code", "collapsed": false, "input": [ "(galaxies_in_image,tmp) = histogram(image_for_galaxy,range(num_images+1))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, counting that took a little while. Now we can see the list of counts:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "galaxies_in_image" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can make a bar plot, image by image. Doing it for all half million images is a bit crazy, but we can zoom in on the first 50:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bar(tmp[:50],galaxies_in_image[:50]);\n", "xlabel('image index');\n", "ylabel('number of galaxies in image');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can make a traditional histogram: count the number of images with no galaxies, the number with one, etc, up to whatever the maximum happens to be:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "(nx,xx) = histogram(galaxies_in_image,range(min(galaxies_in_image),max(galaxies_in_image)+2))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second output is the left and right edges of the bins:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "xx" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This actually has one more element than we want, since it goes up to the right edge of the last bin, so we drop the last value so we have the right number of points to print." ] }, { "cell_type": "code", "collapsed": false, "input": [ "max(galaxies_in_image)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "x=xx[:-1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the number of images containing each possible number of galaxies:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "nx" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we divide this by the number of images, we'll get the relative frequency of each value:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "px = nx / num_images; px" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compare this to the Poisson probability mass function\n", "$$\n", "p(x|\\mu) = \\frac{\\mu^x}{x!} e^{-\\mu}\n", "$$\n", "and plot the two of them:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from scipy.misc import factorial\n", "px_theor = galaxies_per_image**x / factorial(x) * exp(-galaxies_per_image)\n", "bar(x-0.5,px,1,color=[1,1,1],label='simulation');\n", "plot(x,px_theor,'ko',label='Poisson pmf');\n", "legend();\n", "xlabel('Number of galaxies in image');\n", "ylabel('Probability or frequency');\n", "xlim([-0.5,x[-1]+0.5])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the relative frequencies in the simulation come very close to the theoretical Poisson probabilities." ] } ], "metadata": {} } ] }