pyspark.sql.DataFrame.fillna#
- DataFrame.fillna(value, subset=None)[source]#
Returns a new
DataFrame
which null values are filled with new value.DataFrame.fillna()
andDataFrameNaFunctions.fill()
are aliases of each other.New in version 1.3.1.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- valueint, float, string, bool or dict, the value to replace null values with.
If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, float, boolean, or string.
- subsetstr, tuple or list, optional
optional list of column names to consider. Columns specified in subset that do not have matching data types are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
- Returns
DataFrame
DataFrame with replaced null values.
Examples
>>> df = spark.createDataFrame([ ... (10, 80.5, "Alice", None), ... (5, None, "Bob", None), ... (None, None, "Tom", None), ... (None, None, None, True)], ... schema=["age", "height", "name", "bool"])
Example 1: Fill all null values with 50 for numeric columns.
>>> df.na.fill(50).show() +---+------+-----+----+ |age|height| name|bool| +---+------+-----+----+ | 10| 80.5|Alice|NULL| | 5| 50.0| Bob|NULL| | 50| 50.0| Tom|NULL| | 50| 50.0| NULL|true| +---+------+-----+----+
Example 2: Fill all null values with
False
for boolean columns.>>> df.na.fill(False).show() +----+------+-----+-----+ | age|height| name| bool| +----+------+-----+-----+ | 10| 80.5|Alice|false| | 5| NULL| Bob|false| |NULL| NULL| Tom|false| |NULL| NULL| NULL| true| +----+------+-----+-----+
- Example 3: Fill all null values with to 50 and “unknown” for
‘age’ and ‘name’ column respectively.
>>> df.na.fill({'age': 50, 'name': 'unknown'}).show() +---+------+-------+----+ |age|height| name|bool| +---+------+-------+----+ | 10| 80.5| Alice|NULL| | 5| NULL| Bob|NULL| | 50| NULL| Tom|NULL| | 50| NULL|unknown|true| +---+------+-------+----+
Example 4: Fill all null values with “Spark” for ‘name’ column.
>>> df.na.fill(value = 'Spark', subset = 'name').show() +----+------+-----+----+ | age|height| name|bool| +----+------+-----+----+ | 10| 80.5|Alice|NULL| | 5| NULL| Bob|NULL| |NULL| NULL| Tom|NULL| |NULL| NULL|Spark|true| +----+------+-----+----+