Numpy is a very useful Python library for analyzing data, e.g., data about heights of players, data about game scores, data about weather, etc. One of the most frequent things we would like to do with data is to study the distribution of data, e.g., using a histogram. A histogram takes all your data elements and bins them into ranges or regions and shows you what these ranges are and how many data elements fall into each bin. histogramdd() is a function in numpy for creating histograms of multidimensional data: it can be applied to 1 dimensional (1D) data, 2 dimensional data (2D), and in general any number of dimensions - hence the “dd” in its name.
numpy.histogramdd() in 1D
Here is a very simple example of histogramdd() in action:
There is a lot happening in the code above, so let us take it line by line. First, we import numpy and refer to it as “np” henceforth in this program. We use the np.array() function to create an array of heights of players. Note that the heights array has information about 8 players whose heights range from 6.1 (shortest) to 6.7 (tallest).
In the next line, we call histogramdd() to compute the histogram by passing the heights array as an argument. Without any additional arguments, histogramdd() computes 10 bins. There are many ways to bin a range of values but we will not get into it in this blogpost. For each bin, some number of values (possibly zero) fall into the range represented by the bin. These are stored in the “counts” array returned by histogramdd().
When we run the program above, we get:
The second line prints the bins and the bin positions are returned by this array. You should take the bin positions two at a time as you move along the array. Thus, the first bin is [6.1, 6.16). Note that the interval is closed on the left but open on the right. This means all values that are greater than or equal to 6.1 and strictly less than 6.16 will be binned here. Note that there are two values (see the counts array printed on the first line) that fall in this bin, namely 6.1 and 6.1 (see the data in the original program). Similarly, the second bin is [6.16, 6.22), the third bin is [6.22, 6.28) and so on. The last bin is given by [6.64, 6.7] (note that both values are inclusive because we have reached the end of the range).
Here is a more user friendly program that prints the bins, the number of elements in each, and the specific boundaries of each bin:
Note that here we first convert the returned bins numpy array into a list (bins_converted) and then in a for loop print all the details about each bin. Notice how we are careful about noticing whether the right end of the bin is inclusive or exclusive and printing the delimiter accordingly.
The output is:
You can add up all the element counts and see that they sum up to the size of the original array.
numpy.histogramdd() with user-specified bin number
You can update the program to specify a particular number of bins, e.g., 5:
The output is:
Note that the bins have gotten bigger (because there are fewer of them) and thus contain more elements. In particular, note that there is no bin with zero elements.
numpy.histogramdd() with 1 bin
In the limiting case, you can specify bins=1 to put everything in a single bin. So if we update this one line as:
The output will be:
Note that the one single bin traverses from the minimum value (6.1) to the maximum value (6.7), both inclusive.
numpy.histogramdd() with 2 dimensional data
Let us now explore the real power of histogramdd(), ie with multidimensional data. For ease of illustration, we will use 2D data but the same approach works for higher dimensions as well.
Also for this example, we will simply randomly generate 2 dimensional data as in the below program:
In the above program, we first generate 2D random data using the numpy random.randn() function that takes as input the number of data points to generate (10) and the number of dimensions (2). In real life, these dimensions can be height and weight, temperature and humidity, price and discount, etc.
After generating this data, we print it, then use histogramdd with a bins setting of 2. After that we pretty print the results as before.
The output after a sample run is:
Note that the original array has both negative and positive values. As a result the bins along both dimensions span negative and positive values. Remember that this is a 2D array, so the bins must be interpreted in a joint manner. Thus the “first” bin would be: [-1.84694171 -0.40196092) along the first dimension and [-1.28763895 -0.23910684) along the second dimension. The results show that there are 3 values that fall in this bin.
If you wish to use different numbers of bins along each dimension, you should pass on an array as input to histogramdd(), like so:
The output will be (sample run):
Note that the counts is now a 2x3 array, i.e., 2 bins along the first dimension and 3 bins along the second dimension.
As you can see numpy.histogramdd() is a very powerful tool to quickly analyze a multidimensional dataset and understand how the data points are distributed alongside each dimension. Histograms are very useful in finance, science, medicine, and engineering.
Kodeclik is an online coding academy for kids and teens to learn real world programming. Kids are introduced to coding in a fun and exciting way and are challeged to higher levels with engaging, high quality content.