pyspark.pandas.Series.nunique

Series.nunique(dropna: bool = True, approx: bool = False, rsd: float = 0.05) → int

Return number of unique elements in the object. Excludes NA values by default.

Parameters
dropnabool, default True

Don’t include NaN in the count.

approx: bool, default False

If False, will use the exact algorithm and return the exact number of unique. If True, it uses the HyperLogLog approximate algorithm, which is significantly faster for large amount of data. Note: This parameter is specific to pandas-on-Spark and is not found in pandas.

rsd: float, default 0.05

Maximum estimation error allowed in the HyperLogLog algorithm. Note: Just like approx this parameter is specific to pandas-on-Spark.

Returns
int

See also

DataFrame.nunique

Method nunique for DataFrame.

Series.count

Count non-NA/null observations in the Series.

Examples

>>>
>>> ps.Series([1, 2, 3, np.nan]).nunique()
3
>>>
>>> ps.Series([1, 2, 3, np.nan]).nunique(dropna=False)
4

On big data, we recommend using the approximate algorithm to speed up this function. The result will be very close to the exact unique count.

>>>
>>> ps.Series([1, 2, 3, np.nan]).nunique(approx=True)
3
>>>
>>> idx = ps.Index([1, 1, 2, None])
>>> idx
Float64Index([1.0, 1.0, 2.0, nan], dtype='float64')
>>>
>>> idx.nunique()
2
>>>
>>> idx.nunique(dropna=False)
3