A Design Specification for ``nan_policy`` ========================================= Many functions in `scipy.stats` have a parameter called ``nan_policy`` that determines how the function handles data that contains ``nan``. In this section, we provide SciPy developer guidelines for how ``nan_policy`` is intended to be used, to ensure that as this parameter is added to new functions, we maintain a consistent API. The basic API ------------- The parameter ``nan_policy`` accepts three possible strings: ``'omit'``, ``'raise'`` and ``'propagate'``. The meanings are: * ``nan_policy='omit'``: Ignore occurrences of ``nan`` in the input. Do not generate a warning if the input contains ``nan`` (unless the equivalent input with the ``nan`` values removed would generate a warning). For example, for the simple case of a function that accepts a single array and returns a scalar (and ignoring the possible use of ``axis`` for the moment):: func([1.0, 3.0, np.nan, 5.0], nan_policy='omit') should behave the same as:: func([1.0, 3.0, 5.0]) More generally, for functions that return a scalar, ``func(a, nan_policy='omit')`` should behave the same as ``func(a[~np.isnan(a)])``. For functions that transform a vector to a new vector of the same size and for which each entry in the output array depends on more than just the corresponding value in the input array [#f1]_ (e.g. `scipy.stats.zscore`, `scipy.stats.boxcox` *when* ``lmbda`` *is None*),:: y = func(a, nan_policy='omit') should behave the same as:: nan_mask = np.isnan(a) y = np.empty(a.shape, dtype=np.float64) y[~nan_mask] = func(a[~nan_mask]) y[nan_mask] = np.nan (In general, the dtype of ``y`` might depend on ``a`` and on the expected behavior of ``func``). In other words, a `nan` in the input gives a corresponding `nan` in the output, but the presence of that `nan` does not affect the calculation of the non-`nan` values. Unit tests for this property should be used to test functions that handle ``nan_policy``. For functions that return a scalar and that accept two or more arguments but whose values are not related (e.g. `scipy.stats.ansari`, `scipy.stats.f_oneway`), the same idea applies to each input array. So:: func(a, b, nan_policy='omit') should behave the same as:: func(a[~np.isnan(a)], b[~np.isnan(b)]) For inputs with *related* or *paired* values (e.g. `scipy.stats.pearsonr`, `scipy.stats.ttest_rel`) the recommended behavior is to omit all the values for which any of the related values are ``nan``. For a function with two related array inputs, this means:: y = func(a, b, nan_policy='omit') should behave the same as:: hasnan = np.isnan(a) | np.isnan(b) # Union of the isnan masks. y = func(a[~hasnan], b[~hasnan]) The docstring for such a function should clearly state this behavior. * ``nan_policy='raise'``: Raise a ``ValueError``. * ``nan_policy='propagate'``: Propagate the ``nan`` value to the output. Typically, this means just execute the function without checking for ``nan``, but see https://github.com/scipy/scipy/issues/7818 for an example where that might lead to unexpected output. ``nan_policy`` combined with an ``axis`` parameter -------------------------------------------------- There is nothing surprising here--the principle mentioned above still applies when the function has an ``axis`` parameter. Suppose, for example, ``func`` reduces a 1-d array to a scalar, and handles n-d arrays as a collection of 1-d arrays, with the ``axis`` parameter specifying the axis along which the reduction is to be applied. If, say:: func([1, 3, 4]) -> 10.0 func([2, -3, 8, 2]) -> 4.2 func([7, 8]) -> 9.5 func([]) -> -inf then:: func([[ 1, nan, 3, 4], [ 2, -3, 8, 2], [nan, 7, nan, 8], [nan, nan, nan, nan]], nan_policy='omit', axis=-1) must give the result:: np.array([10.0, 4.2, 9.5, -inf]) Edge cases ---------- A function that implements the ``nan_policy`` parameter should gracefully handle the case where *all* the values in the input array(s) are ``nan``. The basic principle described above still applies:: func([nan, nan, nan], nan_policy='omit') should behave the same as:: func([]) In practice, when adding ``nan_policy`` to an existing function, it is not unusual to find that the function doesn't already handle this case in a well-defined manner, and some thought and design may have to be applied to ensure that it works. The correct behavior (whether that be to return ``nan``, return some other value, raise an exception, or something else) will be determined on a case-by-case basis. Why doesn't ``nan_policy`` also apply to ``inf``? -------------------------------------------------- Although we learn in grade school that "infinity is not a number", the floating point values ``nan`` and ``inf`` are qualitatively different. The values ``inf`` and ``-inf`` act much more like regular floating point values than ``nan``. * One can compare ``inf`` to other floating point values and it behaves as expected, e.g. ``3 < inf`` is True. * For the most part, arithmetic works "as expected" with ``inf``, e.g. ``inf + inf = inf``, ``-2*inf = -inf``, ``1/inf = 0``, etc. * Many existing functions work "as expected" with ``inf``: ``np.log(inf) = inf``, ``np.exp(-inf) = 0``, ``np.array([1.0, -1.0, np.inf]).min() = -1.0``, etc. So while ``nan`` almost always means "something went wrong" or "something is missing", ``inf`` can in many cases be treated as a useful floating point value. It is also consistent with the NumPy ``nan`` functions to not ignore ``inf``:: >>> np.nanmax([1, 2, 3, np.inf, np.nan]) inf >>> np.nansum([1, 2, 3, np.inf, np.nan]) inf >>> np.nanmean([8, -np.inf, 9, 1, np.nan]) -inf How *not* to implement ``nan_policy`` ------------------------------------- In the past (and possibly currently), some ``stats`` functions handled ``nan_policy`` by using a masked array to mask the ``nan`` values, and then computing the result using the functions in the ``mstats`` subpackage. The problem with this approach is that the masked array code might convert ``inf`` to a masked value, which we don't want to do (see above). It also means that, if care is not taken, the return value will be a masked array, which will likely be a surprise to the user if they passed in regular arrays. .. rubric:: Footnotes .. [#f1] If an element of the output depends only on the corresponding element of the input (e.g. `numpy.sin`, `scipy.special.gamma`), then there is no need for a ``nan_policy`` parameter.