Functions
General Functions
pyjanitor's general-purpose data cleaning functions.
add_columns
add_column(df, column_name, value, fill_remaining=False)
Add a column to the dataframe.
Intended to be the method-chaining alternative to:
df[column_name] = value
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.assign
instead.
Examples:
Add a column of constant values to the dataframe.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=1)
a b c
0 0 a 1
1 1 b 1
2 2 c 1
Add a column of different values to the dataframe.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=list("efg"))
a b c
0 0 a e
1 1 b f
2 2 c g
Add a column using an iterator.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_column(column_name="c", value=range(4, 7))
a b c
0 0 a 4
1 1 b 5
2 2 c 6
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
str
|
Name of the new column. Should be a string, in order for the column name to be compatible with the Feather binary format (this is a useful thing to have). |
required |
value
|
Union[List[Any], Tuple[Any], Any]
|
Either a single value, or a list/tuple of values. |
required |
fill_remaining
|
bool
|
If value is a tuple or list that is smaller than the number of rows in the DataFrame, repeat the list or tuple (R-style) to the end of the DataFrame. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If attempting to add a column that already exists. |
ValueError
|
If |
ValueError
|
If attempting to add an iterable of values with a length not equal to the number of DataFrame rows. |
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with an added column. |
Source code in janitor/functions/add_columns.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
add_columns(df, fill_remaining=False, **kwargs)
Add multiple columns to the dataframe.
This method does not mutate the original DataFrame.
Method to augment
add_column
with ability to add multiple columns in
one go. This replaces the need for multiple
add_column
calls.
Usage is through supplying kwargs where the key is the col name and the values correspond to the values of the new DataFrame column.
Values passed can be scalar or iterable (list, ndarray, etc.)
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.assign
instead.
Examples:
Inserting two more columns into a dataframe.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.add_columns(x=4, y=list("def"))
a b x y
0 0 a 4 d
1 1 b 4 e
2 2 c 4 f
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
fill_remaining
|
bool
|
If value is a tuple or list that is smaller than
the number of rows in the DataFrame, repeat the list or tuple
(R-style) to the end of the DataFrame. (Passed to
|
False
|
**kwargs
|
Any
|
Column, value pairs which are looped through in
|
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with added columns. |
Source code in janitor/functions/add_columns.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
|
also
Implementation source for chainable function also
.
also(df, func, *args, **kwargs)
Run a function with side effects.
This function allows you to run an arbitrary function
in the pyjanitor
method chain.
Doing so will let you do things like save the dataframe to disk midway
while continuing to modify the dataframe afterwards.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = (
... pd.DataFrame({"a": [1, 2, 3], "b": list("abc")})
... .query("a > 1")
... .also(lambda df: print(f"DataFrame shape is: {df.shape}"))
... .rename_column(old_column_name="a", new_column_name="a_new")
... .also(lambda df: df.to_csv("midpoint.csv"))
... .also(
... lambda df: print(f"Columns: {df.columns}")
... )
... )
DataFrame shape is: (2, 2)
Columns: Index(['a_new', 'b'], dtype='object')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
func
|
Callable
|
A function you would like to run in the method chain. It should take one DataFrame object as a parameter and have no return. If there is a return, it will be ignored. |
required |
*args
|
Any
|
Optional arguments for |
()
|
**kwargs
|
Any
|
Optional keyword arguments for |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
The input pandas DataFrame, unmodified. |
Source code in janitor/functions/also.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
bin_numeric
Implementation source for bin_numeric
.
bin_numeric(df, from_column_name, to_column_name, bins=5, **kwargs)
Generate a new column that labels bins for a specified numeric column.
This method does not mutate the original DataFrame.
A wrapper around the pandas cut()
function to bin data of
one column, generating a new column with the results.
Examples:
Binning a numeric column with specific bin edges.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [3, 6, 9, 12, 15]})
>>> df.bin_numeric(
... from_column_name="a", to_column_name="a_binned",
... bins=[0, 5, 11, 15],
... )
a a_binned
0 3 (0, 5]
1 6 (5, 11]
2 9 (5, 11]
3 12 (11, 15]
4 15 (11, 15]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
from_column_name
|
str
|
The column whose data you want binned. |
required |
to_column_name
|
str
|
The new column to be created with the binned data. |
required |
bins
|
Optional[Union[int, ScalarSequence, IntervalIndex]]
|
The binning strategy to be utilized. Read the |
5
|
**kwargs
|
Any
|
Additional kwargs to pass to |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/bin_numeric.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
case_when
Implementation source for case_when
.
case_when(df, *args, default=None, column_name)
Create a column based on a condition or multiple conditions.
Similar to SQL and dplyr's case_when
with inspiration from pydatatable
if_else function.
If your scenario requires direct replacement of values,
pandas' replace
method or map
method should be better
suited and more efficient; if the conditions check
if a value is within a range of values, pandas' cut
or qcut
should be more efficient; np.where/np.select
are also
performant options.
This function relies on pd.Series.mask
method.
When multiple conditions are satisfied, the first one is used.
The variable *args
parameters takes arguments of the form :
condition0
, value0
, condition1
, value1
, ..., default
.
If condition0
evaluates to True
, then assign value0
to
column_name
, if condition1
evaluates to True
, then
assign value1
to column_name
, and so on. If none of the
conditions evaluate to True
, assign default
to
column_name
.
This function can be likened to SQL's case_when
:
CASE WHEN condition0 THEN value0
WHEN condition1 THEN value1
--- more conditions
ELSE default
END AS column_name
compared to python's if-elif-else
:
if condition0:
value0
elif condition1:
value1
# more elifs
else:
default
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "a": [0, 0, 1, 2, "hi"],
... "b": [0, 3, 4, 5, "bye"],
... "c": [6, 7, 8, 9, "wait"],
... }
... )
>>> df
a b c
0 0 0 6
1 0 3 7
2 1 4 8
3 2 5 9
4 hi bye wait
>>> df.case_when(
... ((df.a == 0) & (df.b != 0)) | (df.c == "wait"), df.a,
... (df.b == 0) & (df.a == 0), "x",
... default = df.c,
... column_name = "value",
... )
a b c value
0 0 0 6 x
1 0 3 7 0
2 1 4 8 8
3 2 5 9 9
4 hi bye wait hi
Version Changed
- 0.24.0
- Added
default
parameter.
- Added
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
*args
|
Any
|
Variable argument of conditions and expected values.
Takes the form
|
()
|
default
|
Any
|
This is the element inserted in the output when all conditions evaluate to False. Can be scalar, 1-D array or callable. If callable, it should evaluate to a 1-D array. The 1-D array should be the same length as the DataFrame. |
None
|
column_name
|
str
|
Name of column to assign results to. A new column is created if it does not already exist in the DataFrame. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If condition/value fails to evaluate. |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/case_when.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
|
change_index_dtype
Implementation of the change_index_dtype
function.
change_index_dtype(df, dtype, axis='index')
Cast an index to a specified dtype dtype
.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> rng = np.random.default_rng(seed=0)
>>> np.random.seed(0)
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
... 'foo', 'foo', 'qux', 'qux'],
... [1.0, 2.0, 1.0, 2.0,
... 1.0, 2.0, 1.0, 2.0]]))
>>> idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
>>> df
A B
first second
bar 1.0 1.764052 0.400157
2.0 0.978738 2.240893
baz 1.0 1.867558 -0.977278
2.0 0.950088 -0.151357
foo 1.0 -0.103219 0.410599
2.0 0.144044 1.454274
qux 1.0 0.761038 0.121675
2.0 0.443863 0.333674
>>> outcome=df.change_index_dtype(dtype=str)
>>> outcome
A B
first second
bar 1.0 1.764052 0.400157
2.0 0.978738 2.240893
baz 1.0 1.867558 -0.977278
2.0 0.950088 -0.151357
foo 1.0 -0.103219 0.410599
2.0 0.144044 1.454274
qux 1.0 0.761038 0.121675
2.0 0.443863 0.333674
>>> outcome.index.dtypes
first object
second object
dtype: object
>>> outcome=df.change_index_dtype(dtype={'second':int})
>>> outcome
A B
first second
bar 1 1.764052 0.400157
2 0.978738 2.240893
baz 1 1.867558 -0.977278
2 0.950088 -0.151357
foo 1 -0.103219 0.410599
2 0.144044 1.454274
qux 1 0.761038 0.121675
2 0.443863 0.333674
>>> outcome.index.dtypes
first object
second int64
dtype: object
>>> outcome=df.change_index_dtype(dtype={0:'category',1:int})
>>> outcome
A B
first second
bar 1 1.764052 0.400157
2 0.978738 2.240893
baz 1 1.867558 -0.977278
2 0.950088 -0.151357
foo 1 -0.103219 0.410599
2 0.144044 1.454274
qux 1 0.761038 0.121675
2 0.443863 0.333674
>>> outcome.index.dtypes
first category
second int64
dtype: object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
dtype
|
Use a str or dtype to cast the entire Index to the same type. Alternatively, use a dictionary to change the MultiIndex to new dtypes. |
required | |
axis
|
str
|
Determines which axis to change the dtype(s). Should be either 'index' or 'columns'. |
'index'
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with new Index. |
Source code in janitor/functions/change_index_dtype.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
change_type
change_type(df, column_name, dtype, ignore_exception=False)
Change the type of a column.
This method does not mutate the original DataFrame.
Exceptions that are raised can be ignored. For example, if one has a mixed
dtype column that has non-integer strings and integers, and you want to
coerce everything to integers, you can optionally ignore the non-integer
strings and replace them with NaN
or keep the original value.
Intended to be the method-chaining alternative to:
df[col] = df[col].astype(dtype)
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.astype
instead.
Examples:
Change the type of a column.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
>>> df
col1 col2
0 0 m
1 1 5
2 2 True
>>> df.change_type(
... "col1", dtype=str,
... ).change_type(
... "col2", dtype=float, ignore_exception="fillna",
... )
col1 col2
0 0 NaN
1 1 5.0
2 2 1.0
Change the type of multiple columns. To change the type of all columns,
please use DataFrame.astype
instead.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": range(3), "col2": ["m", 5, True]})
>>> df.change_type(['col1', 'col2'], str)
col1 col2
0 0 m
1 1 5
2 2 True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable | list[Hashable] | Index
|
The column(s) in the dataframe. |
required |
dtype
|
type
|
The datatype to convert to. Should be one of the standard Python types, or a numpy datatype. |
required |
ignore_exception
|
bool
|
One of |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If unknown option provided for |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with changed column types. |
Source code in janitor/functions/change_type.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
clean_names
Functions for cleaning columns/index names and/or column values.
clean_names(df, axis='columns', column_names=None, strip_underscores=None, case_type='lower', remove_special=False, strip_accents=True, preserve_original_labels=True, enforce_string=True, truncate_limit=None)
Clean column/index names. It can also be applied to column values.
Takes all column names, converts them to lowercase, then replaces all spaces with underscores.
By default, column names are converted to string types.
This can be switched off by passing in enforce_string=False
.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "Aloha": range(3),
... "Bell Chart": range(3),
... "Animals@#$%^": range(3)
... }
... )
>>> df
Aloha Bell Chart Animals@#$%^
0 0 0 0
1 1 1 1
2 2 2 2
>>> df.clean_names()
aloha bell_chart animals@#$%^
0 0 0 0
1 1 1 1
2 2 2 2
>>> df.clean_names(remove_special=True)
aloha bell_chart animals
0 0 0 0
1 1 1 1
2 2 2 2
Version Changed
- 0.26.0
- Added
axis
andcolumn_names
parameters.
- Added
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
axis
|
str
|
Whether to clean the labels on the index or columns.
If |
'columns'
|
column_names
|
str | list
|
Clean the values in a column.
|
None
|
strip_underscores
|
str | bool
|
Removes the outer underscores from all column names/values. Default None keeps outer underscores. Values can be either 'left', 'right' or 'both' or the respective shorthand 'l', 'r' and True. |
None
|
case_type
|
str
|
Whether to make columns lower or uppercase. Current case may be preserved with 'preserve', while snake case conversion (from CamelCase or camelCase only) can be turned on using "snake". Default 'lower' makes all characters lowercase. |
'lower'
|
remove_special
|
bool
|
Remove special characters from columns. Only letters, numbers and underscores are preserved. |
False
|
strip_accents
|
bool
|
Whether or not to remove accents from columns names/values. |
True
|
preserve_original_labels
|
bool
|
Preserve original names.
This is later retrievable using |
True
|
enforce_string
|
bool
|
Whether or not to convert all column names/values to string type. Defaults to True, but can be turned off. Columns with >1 levels will not be converted by default. |
True
|
truncate_limit
|
int
|
Truncates formatted column names/values to the specified length. Default None does not truncate. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/clean_names.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
coalesce
Function for performing coalesce.
coalesce(df, *column_names, target_column_name=None, default_value=None)
Coalesce two or more columns of data in order of column names provided.
Given the variable arguments of column names,
coalesce
finds and returns the first non-missing value
from these columns, for every row in the input dataframe.
If all the column values are null for a particular row,
then the default_value
will be filled in.
If target_column_name
is not provided,
then the first column is coalesced.
This method does not mutate the original DataFrame.
The select
syntax
can be used in column_names
.
Examples:
Use coalesce
with 3 columns, "a", "b" and "c".
>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> df = pd.DataFrame({
... "a": [np.nan, 1, np.nan],
... "b": [2, 3, np.nan],
... "c": [4, np.nan, np.nan],
... })
>>> df.coalesce("a", "b", "c")
a b c
0 2.0 2.0 4.0
1 1.0 3.0 NaN
2 NaN NaN NaN
Provide a target_column_name.
>>> df.coalesce("a", "b", "c", target_column_name="new_col")
a b c new_col
0 NaN 2.0 4.0 2.0
1 1.0 3.0 NaN 1.0
2 NaN NaN NaN NaN
Provide a default value.
>>> import pandas as pd
>>> import numpy as np
>>> import janitor
>>> df = pd.DataFrame({
... "a": [1, np.nan, np.nan],
... "b": [2, 3, np.nan],
... })
>>> df.coalesce(
... "a", "b",
... target_column_name="new_col",
... default_value=-1,
... )
a b new_col
0 1.0 2.0 1.0
1 NaN 3.0 3.0
2 NaN NaN -1.0
This is more syntactic diabetes! For R users, this should look familiar to
dplyr
's coalesce
function; for Python users, the interface
should be more intuitive than the pandas.Series.combine_first
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Any
|
A list of column names. |
()
|
target_column_name
|
Optional[str]
|
The new column name after combining.
If |
None
|
default_value
|
Optional[Union[int, float, str]]
|
A scalar to replace any remaining nulls after coalescing. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If length of |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with coalesced columns. |
Source code in janitor/functions/coalesce.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
collapse_levels
Implementation of the collapse_levels
function.
collapse_levels(df, sep=None, glue=None, axis='columns')
Flatten multi-level index/column dataframe to a single level.
This method does not mutate the original DataFrame.
Given a DataFrame containing multi-level index/columns, flatten to single-level by string-joining the labels in each level.
After a groupby
/ aggregate
operation where .agg()
is passed a
list of multiple aggregation functions, a multi-level DataFrame is
returned with the name of the function applied in the second level.
It is sometimes convenient for later indexing to flatten out this multi-level configuration back into a single level. This function does this through a simple string-joining of all the names across different levels in a single column.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "class": ["bird", "bird", "bird", "mammal", "mammal"],
... "max_speed": [389, 389, 24, 80, 21],
... "type": ["falcon", "falcon", "parrot", "Lion", "Monkey"],
... })
>>> df
class max_speed type
0 bird 389 falcon
1 bird 389 falcon
2 bird 24 parrot
3 mammal 80 Lion
4 mammal 21 Monkey
>>> grouped_df = df.groupby("class")[['max_speed']].agg(["mean", "median"])
>>> grouped_df
max_speed
mean median
class
bird 267.333333 389.0
mammal 50.500000 50.5
>>> grouped_df.collapse_levels(sep="_")
max_speed_mean max_speed_median
class
bird 267.333333 389.0
mammal 50.500000 50.5
Before applying .collapse_levels
, the .agg
operation returns a
multi-level column DataFrame whose columns are (level 1, level 2)
:
[("max_speed", "mean"), ("max_speed", "median")]
.collapse_levels
then flattens the column MultiIndex into a single
level index with names:
["max_speed_mean", "max_speed_median"]
For more control, a glue
specification can be passed,
where the names of the levels are used to control the output of the
flattened index:
>>> (grouped_df
... .rename_axis(columns=['column_name', 'agg_name'])
... .collapse_levels(glue="{agg_name}_{column_name}")
... )
mean_max_speed median_max_speed
class
bird 267.333333 389.0
mammal 50.500000 50.5
Note that for glue
to work, the keyword arguments
in the glue specification
should be the names of the levels in the MultiIndex.
Version Changed
- 0.27.0
- Added
glue
andaxis
parameters.
- Added
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
sep
|
str
|
String separator used to join the column level names. |
None
|
glue
|
str
|
A specification on how the column levels should be combined.
It allows for a more granular composition,
and serves as an alternative to |
None
|
axis
|
str
|
Determines whether to collapse the levels on the index or columns. |
'columns'
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with single-level column index. |
Source code in janitor/functions/collapse_levels.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
complete
complete(df, *columns, sort=False, by=None, fill_value=None, explicit=True)
Complete a data frame with missing combinations of data.
It is modeled after tidyr's complete
function.
In a way, it is the inverse of pd.dropna
, as it exposes
implicitly missing rows.
The variable columns
parameter can be a column name,
a list of column names,
or a pandas Index, Series, or DataFrame.
If a pandas Index, Series, or DataFrame is passed, it should
have a name or names that exist in df
.
A callable can also be passed - the callable should evaluate
to a pandas Index, Series, or DataFrame,
and the names of the pandas object should exist in df
.
A dictionary can also be passed -
the values of the dictionary should be
either be a 1D array
or a callable that evaluates to a
1D array,
while the keys of the dictionary
should exist in df
.
User should ensure that the pandas object is unique and/or sorted - no checks are done to ensure uniqueness and/or sortedness.
If by
is present, the DataFrame is completed per group.
by
should be a column name, or a list of column names.
Examples:
>>> import pandas as pd
>>> import janitor
>>> import numpy as np
>>> df = pd.DataFrame(
... {
... "Year": [1999, 2000, 2004, 1999, 2004],
... "Taxon": [
... "Saccharina",
... "Saccharina",
... "Saccharina",
... "Agarum",
... "Agarum",
... ],
... "Abundance": [4, 5, 2, 1, 8],
... }
... )
>>> df
Year Taxon Abundance
0 1999 Saccharina 4
1 2000 Saccharina 5
2 2004 Saccharina 2
3 1999 Agarum 1
4 2004 Agarum 8
Expose missing pairings of Year
and Taxon
:
>>> df.complete("Year", "Taxon", sort=True)
Year Taxon Abundance
0 1999 Agarum 1.0
1 1999 Saccharina 4.0
2 2000 Agarum NaN
3 2000 Saccharina 5.0
4 2004 Agarum 8.0
5 2004 Saccharina 2.0
Expose missing years from 1999 to 2004:
>>> index = pd.Index(range(1999,2005),name='Year')
>>> df.complete(index, "Taxon", sort=True)
Year Taxon Abundance
0 1999 Agarum 1.0
1 1999 Saccharina 4.0
2 2000 Agarum NaN
3 2000 Saccharina 5.0
4 2001 Agarum NaN
5 2001 Saccharina NaN
6 2002 Agarum NaN
7 2002 Saccharina NaN
8 2003 Agarum NaN
9 2003 Saccharina NaN
10 2004 Agarum 8.0
11 2004 Saccharina 2.0
A dictionary can be used as well:
>>> dictionary = {'Year':range(1999,2005)}
>>> df.complete(dictionary, "Taxon", sort=True)
Year Taxon Abundance
0 1999 Agarum 1.0
1 1999 Saccharina 4.0
2 2000 Agarum NaN
3 2000 Saccharina 5.0
4 2001 Agarum NaN
5 2001 Saccharina NaN
6 2002 Agarum NaN
7 2002 Saccharina NaN
8 2003 Agarum NaN
9 2003 Saccharina NaN
10 2004 Agarum 8.0
11 2004 Saccharina 2.0
Fill missing values:
>>> df = pd.DataFrame(
... dict(
... group=(1, 2, 1, 2),
... item_id=(1, 2, 2, 3),
... item_name=("a", "a", "b", "b"),
... value1=(1, np.nan, 3, 4),
... value2=range(4, 8),
... )
... )
>>> df
group item_id item_name value1 value2
0 1 1 a 1.0 4
1 2 2 a NaN 5
2 1 2 b 3.0 6
3 2 3 b 4.0 7
>>> df.complete(
... "group",
... ["item_id", "item_name"],
... fill_value={"value1": 0, "value2": 99},
... sort=True
... )
group item_id item_name value1 value2
0 1 1 a 1.0 4.0
1 1 2 a 0.0 99.0
2 1 2 b 3.0 6.0
3 1 3 b 0.0 99.0
4 2 1 a 0.0 99.0
5 2 2 a 0.0 5.0
6 2 2 b 0.0 99.0
7 2 3 b 4.0 7.0
Limit the fill to only implicit missing values
by setting explicit to False
:
>>> df.complete(
... "group",
... ["item_id", "item_name"],
... fill_value={"value1": 0, "value2": 99},
... explicit=False,
... sort=True
... )
group item_id item_name value1 value2
0 1 1 a 1.0 4.0
1 1 2 a 0.0 99.0
2 1 2 b 3.0 6.0
3 1 3 b 0.0 99.0
4 2 1 a 0.0 99.0
5 2 2 a NaN 5.0
6 2 2 b 0.0 99.0
7 2 3 b 4.0 7.0
Expose missing rows per group, using a callable:
>>> df = pd.DataFrame(
... {
... "state": ["CA", "CA", "HI", "HI", "HI", "NY", "NY"],
... "year": [2010, 2013, 2010, 2012, 2016, 2009, 2013],
... "value": [1, 3, 1, 2, 3, 2, 5],
... }
... )
>>> df
state year value
0 CA 2010 1
1 CA 2013 3
2 HI 2010 1
3 HI 2012 2
4 HI 2016 3
5 NY 2009 2
6 NY 2013 5
>>> def new_year_values(df):
... return pd.RangeIndex(start=df.year.min(), stop=df.year.max() + 1, name='year')
>>> df.complete(new_year_values, by='state',sort=True)
state year value
0 CA 2010 1.0
1 CA 2011 NaN
2 CA 2012 NaN
3 CA 2013 3.0
4 HI 2010 1.0
5 HI 2011 NaN
6 HI 2012 2.0
7 HI 2013 NaN
8 HI 2014 NaN
9 HI 2015 NaN
10 HI 2016 3.0
11 NY 2009 2.0
12 NY 2010 NaN
13 NY 2011 NaN
14 NY 2012 NaN
15 NY 2013 5.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
*columns
|
Any
|
This refers to the columns to be completed. It could be a column name, a list of column names, or a pandas Index, Series, or DataFrame. It can also be a callable that gets evaluated to a pandas Index, Series, or DataFrame. It can also be a dictionary,
where the values are either a 1D array
or a callable that evaluates to a
1D array,
while the keys of the dictionary
should exist in |
()
|
sort
|
bool
|
Sort DataFrame based on *columns. |
False
|
by
|
str | list
|
Label or list of labels to group by. The explicit missing rows are returned per group. |
None
|
fill_value
|
dict | Any
|
Scalar value to use instead of NaN for missing combinations. A dictionary, mapping columns names to a scalar value is also accepted. |
None
|
explicit
|
bool
|
Determines if only implicitly missing values
should be filled ( |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with explicit missing rows, if any. |
Source code in janitor/functions/complete.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
|
concatenate_columns
concatenate_columns(df, column_names, new_column_name, sep='-', ignore_empty=True)
Concatenates the set of columns into a single column.
Used to quickly generate an index based on a group of columns.
This method mutates the original DataFrame.
Examples:
Concatenate two columns row-wise.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1, 3, 5], "b": list("xyz")})
>>> df
a b
0 1 x
1 3 y
2 5 z
>>> df.concatenate_columns(
... column_names=["a", "b"], new_column_name="m",
... )
a b m
0 1 x 1-x
1 3 y 3-y
2 5 z 5-z
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
List[Hashable]
|
A list of columns to concatenate together. |
required |
new_column_name
|
Hashable
|
The name of the new column. |
required |
sep
|
str
|
The separator between each column's data. |
'-'
|
ignore_empty
|
bool
|
Ignore null values if exists. |
True
|
Raises:
Type | Description |
---|---|
JanitorError
|
If at least two columns are not provided
within |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with concatenated columns. |
Source code in janitor/functions/concatenate_columns.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
conditional_join
conditional_join(df, right, *conditions, how='inner', df_columns=slice(None), right_columns=slice(None), keep='all', use_numba=False, indicator=False, force=False)
The conditional_join function operates similarly to pd.merge
,
but supports joins on inequality operators,
or a combination of equi and non-equi joins.
Joins solely on equality are not supported.
If the join is solely on equality, pd.merge
function
covers that; if you are interested in nearest joins, asof joins,
or rolling joins, then pd.merge_asof
covers that.
There is also pandas' IntervalIndex, which is efficient for range joins,
especially if the intervals do not overlap.
Column selection in df_columns
and right_columns
is possible using the
select
syntax.
Performance might be improved by setting use_numba
to True
-
this can be handy for equi joins that have lots of duplicated keys.
This can also be handy for non-equi joins, where there are more than
two join conditions,
or there is significant overlap in the range join columns.
This assumes that numba
is installed.
Noticeable performance can be observed for range joins, if both join columns from the right dataframe are monotonically increasing.
This function returns rows, if any, where values from df
meet the
condition(s) for values from right
. The conditions are passed in
as a variable argument of tuples, where the tuple is of
the form (left_on, right_on, op)
; left_on
is the column
label from df
, right_on
is the column label from right
,
while op
is the operator.
For multiple conditions, the and(&
)
operator is used to combine the results of the individual conditions.
In some scenarios there might be performance gains if the less than join,
or the greater than join condition, or the range condition
is executed before the equi join - pass force=True
to force this.
The operator can be any of ==
, !=
, <=
, <
, >=
, >
.
There is no optimisation for the !=
operator.
The join is done only on the columns.
For non-equi joins, only numeric, timedelta and date columns are supported.
inner
, left
, right
and outer
joins are supported.
If the columns from df
and right
have nothing in common,
a single index column is returned; else, a MultiIndex column
is returned.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df1 = pd.DataFrame({"value_1": [2, 5, 7, 1, 3, 4]})
>>> df2 = pd.DataFrame({"value_2A": [0, 3, 7, 12, 0, 2, 3, 1],
... "value_2B": [1, 5, 9, 15, 1, 4, 6, 3],
... })
>>> df1
value_1
0 2
1 5
2 7
3 1
4 3
5 4
>>> df2
value_2A value_2B
0 0 1
1 3 5
2 7 9
3 12 15
4 0 1
5 2 4
6 3 6
7 1 3
>>> df1.conditional_join(
... df2,
... ("value_1", "value_2A", ">"),
... ("value_1", "value_2B", "<")
... )
value_1 value_2A value_2B
0 2 1 3
1 5 3 6
2 3 2 4
3 4 3 5
4 4 3 6
Select specific columns, after the join:
>>> df1.conditional_join(
... df2,
... ("value_1", "value_2A", ">"),
... ("value_1", "value_2B", "<"),
... right_columns='value_2B',
... how='left'
... )
value_1 value_2B
0 2 3.0
1 5 6.0
2 3 4.0
3 4 5.0
4 4 6.0
5 7 NaN
6 1 NaN
Rename columns, before the join:
>>> (df1
... .rename(columns={'value_1':'left_column'})
... .conditional_join(
... df2,
... ("left_column", "value_2A", ">"),
... ("left_column", "value_2B", "<"),
... right_columns='value_2B',
... how='outer')
... )
left_column value_2B
0 2.0 3.0
1 5.0 6.0
2 3.0 4.0
3 4.0 5.0
4 4.0 6.0
5 7.0 NaN
6 1.0 NaN
7 NaN 1.0
8 NaN 9.0
9 NaN 15.0
10 NaN 1.0
Get the first match:
>>> df1.conditional_join(
... df2,
... ("value_1", "value_2A", ">"),
... ("value_1", "value_2B", "<"),
... keep='first'
... )
value_1 value_2A value_2B
0 2 1 3
1 5 3 6
2 3 2 4
3 4 3 5
Get the last match:
>>> df1.conditional_join(
... df2,
... ("value_1", "value_2A", ">"),
... ("value_1", "value_2B", "<"),
... keep='last'
... )
value_1 value_2A value_2B
0 2 1 3
1 5 3 6
2 3 2 4
3 4 3 6
Add an indicator column:
>>> df1.conditional_join(
... df2,
... ("value_1", "value_2A", ">"),
... ("value_1", "value_2B", "<"),
... how='outer',
... indicator=True
... )
value_1 value_2A value_2B _merge
0 2.0 1.0 3.0 both
1 5.0 3.0 6.0 both
2 3.0 2.0 4.0 both
3 4.0 3.0 5.0 both
4 4.0 3.0 6.0 both
5 7.0 NaN NaN left_only
6 1.0 NaN NaN left_only
7 NaN 0.0 1.0 right_only
8 NaN 7.0 9.0 right_only
9 NaN 12.0 15.0 right_only
10 NaN 0.0 1.0 right_only
Version Changed
- 0.24.0
- Added
df_columns
,right_columns
,keep
anduse_numba
parameters.
- Added
- 0.24.1
- Added
indicator
parameter.
- Added
- 0.25.0
col
class supported.- Outer join supported.
sort_by_appearance
deprecated. - Numba support for equi join
- 0.27.0
- Added support for timedelta dtype.
- 0.28.0
col
class deprecated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
right
|
Union[DataFrame, Series]
|
Named Series or DataFrame to join to. |
required |
conditions
|
Any
|
Variable argument of tuple(s) of the form
|
()
|
how
|
Literal['inner', 'left', 'right', 'outer']
|
Indicates the type of join to be performed.
It can be one of |
'inner'
|
df_columns
|
Optional[Any]
|
Columns to select from |
slice(None)
|
right_columns
|
Optional[Any]
|
Columns to select from |
slice(None)
|
use_numba
|
bool
|
Use numba, if installed, to accelerate the computation. |
False
|
keep
|
Literal['first', 'last', 'all']
|
Choose whether to return the first match, last match or all matches. |
'all'
|
indicator
|
Optional[Union[bool, str]]
|
If |
False
|
force
|
bool
|
If |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame of the two merged Pandas objects. |
Source code in janitor/functions/conditional_join.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
get_join_indices(df, right, conditions, keep='all', use_numba=False, force=False, return_ragged_arrays=False)
Convenience function to return the matching indices from an inner join.
New in version 0.27.0
Version Changed
- 0.29.0
- Add support for ragged array indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
right
|
Union[DataFrame, Series]
|
Named Series or DataFrame to join to. |
required |
conditions
|
list[tuple[str]]
|
List of arguments of tuple(s) of the form
|
required |
use_numba
|
bool
|
Use numba, if installed, to accelerate the computation. |
False
|
keep
|
Literal['first', 'last', 'all']
|
Choose whether to return the first match, last match or all matches. |
'all'
|
force
|
bool
|
If |
False
|
return_ragged_arrays
|
bool
|
If |
False
|
Returns:
Type | Description |
---|---|
tuple[ndarray, ndarray]
|
A tuple of indices for the rows in the dataframes that match. |
Source code in janitor/functions/conditional_join.py
1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 |
|
convert_date
convert_excel_date(df, column_names)
Convert Excel's serial date format into Python datetime format.
This method does not mutate the original DataFrame.
Implementation is based on Stack Overflow.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [39690, 39690, 37118]})
>>> df
date
0 39690
1 39690
2 37118
>>> df.convert_excel_date('date')
date
0 2008-08-30
1 2008-08-30
2 2001-08-15
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Union[Hashable, list]
|
A column name, or a list of column names. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with corrected dates. |
Source code in janitor/functions/convert_date.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
convert_matlab_date(df, column_names)
Convert Matlab's serial date number into Python datetime format.
Implementation is based on Stack Overflow.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [737125.0, 737124.815863, 737124.4985, 737124]})
>>> df
date
0 737125.000000
1 737124.815863
2 737124.498500
3 737124.000000
>>> df.convert_matlab_date('date')
date
0 2018-03-06 00:00:00.000000000
1 2018-03-05 19:34:50.563199671
2 2018-03-05 11:57:50.399998876
3 2018-03-05 00:00:00.000000000
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Union[Hashable, list]
|
A column name, or a list of column names. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with corrected dates. |
Source code in janitor/functions/convert_date.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
convert_unix_date(df, column_name)
Convert unix epoch time into Python datetime format.
Note that this ignores local tz and convert all timestamps to naive datetime based on UTC!
This method mutates the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use pd.to_datetime
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"date": [1651510462, 53394822, 1126233195]})
>>> df
date
0 1651510462
1 53394822
2 1126233195
>>> df.convert_unix_date('date')
date
0 2022-05-02 16:54:22
1 1971-09-10 23:53:42
2 2005-09-09 02:33:15
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
A column name. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with corrected dates. |
Source code in janitor/functions/convert_date.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
count_cumulative_unique
Implementation of count_cumulative_unique.
count_cumulative_unique(df, column_name, dest_column_name, case_sensitive=True)
Generates a running total of cumulative unique values in a given column.
A new column will be created containing a running
count of unique values in the specified column.
If case_sensitive
is True
, then the case of
any letters will matter (i.e., a != A
);
otherwise, the case of any letters will not matter.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "letters": list("aabABb"),
... "numbers": range(4, 10),
... })
>>> df
letters numbers
0 a 4
1 a 5
2 b 6
3 A 7
4 B 8
5 b 9
>>> df.count_cumulative_unique(
... column_name="letters",
... dest_column_name="letters_unique_count",
... )
letters numbers letters_unique_count
0 a 4 1
1 a 5 1
2 b 6 2
3 A 7 3
4 B 8 4
5 b 9 4
Cumulative counts, ignoring casing.
>>> df.count_cumulative_unique(
... column_name="letters",
... dest_column_name="letters_unique_count",
... case_sensitive=False,
... )
letters numbers letters_unique_count
0 a 4 1
1 a 5 1
2 b 6 2
3 A 7 2
4 B 8 2
5 b 9 2
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
Name of the column containing values from which a running count of unique values will be created. |
required |
dest_column_name
|
str
|
The name of the new column containing the cumulative count of unique values that will be created. |
required |
case_sensitive
|
bool
|
Whether or not uppercase and lowercase letters will be considered equal. Only valid with string-like columns. |
True
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a new column containing a cumulative count of unique values from another column. |
Source code in janitor/functions/count_cumulative_unique.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
currency_column_to_numeric
currency_column_to_numeric(df, column_name, cleaning_style=None, cast_non_numeric=None, fill_all_non_numeric=None, remove_non_numeric=False)
Convert currency column to numeric.
This method does not mutate the original DataFrame.
This method allows one to take a column containing currency values,
inadvertently imported as a string, and cast it as a float. This is
usually the case when reading CSV files that were modified in Excel.
Empty strings (i.e. ''
) are retained as NaN
values.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a_col": [" 24.56", "-", "(12.12)", "1,000,000"],
... "d_col": ["", "foo", "1.23 dollars", "-1,000 yen"],
... })
>>> df
a_col d_col
0 24.56
1 - foo
2 (12.12) 1.23 dollars
3 1,000,000 -1,000 yen
The default cleaning style.
>>> df.currency_column_to_numeric("d_col")
a_col d_col
0 24.56 NaN
1 - NaN
2 (12.12) 1.23
3 1,000,000 -1000.00
The accounting cleaning style.
>>> df.currency_column_to_numeric("a_col", cleaning_style="accounting")
a_col d_col
0 24.56
1 0.00 foo
2 -12.12 1.23 dollars
3 1000000.00 -1,000 yen
Valid cleaning styles are:
None
: Default cleaning is applied. Empty strings are always retained asNaN
. Numbers,-
,.
are extracted and the resulting string is cast to a float.'accounting'
: Replaces numbers in parentheses with negatives, removes commas.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame. |
required |
column_name
|
str
|
The column containing currency values to modify. |
required |
cleaning_style
|
Optional[str]
|
What style of cleaning to perform. |
None
|
cast_non_numeric
|
Optional[dict]
|
A dict of how to coerce certain strings to numeric
type. For example, if there are values of 'REORDER' in the DataFrame,
|
None
|
fill_all_non_numeric
|
Optional[Union[float, int]]
|
Similar to |
None
|
remove_non_numeric
|
bool
|
If set to True, rows of |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/currency_column_to_numeric.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
deconcatenate_column
Implementation of deconcatenating columns.
deconcatenate_column(df, column_name, sep=None, new_column_names=None, autoname=None, preserve_position=False)
De-concatenates a single column into multiple columns.
The column to de-concatenate can be either a collection (list, tuple, ...)
which can be separated out with pd.Series.tolist()
,
or a string to slice based on sep
.
To determine this behaviour automatically, the first element in the column specified is inspected.
If it is a string, then sep
must be specified.
Else, the function assumes that it is an iterable type
(e.g. list
or tuple
),
and will attempt to deconcatenate by splitting the list.
Given a column with string values, this is the inverse of the
concatenate_columns
function.
Used to quickly split columns out of a single column.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"m": ["1-x", "2-y", "3-z"]})
>>> df
m
0 1-x
1 2-y
2 3-z
>>> df.deconcatenate_column("m", sep="-", autoname="col")
m col1 col2
0 1-x 1 x
1 2-y 2 y
2 3-z 3 z
The keyword argument preserve_position
takes True
or False
boolean
that controls whether the new_column_names
will take the original position
of the to-be-deconcatenated column_name
:
- When
preserve_position=False
(default),df.columns
change from[..., column_name, ...]
to[..., column_name, ..., new_column_names]
. In other words, the deconcatenated new columns are appended to the right of the original dataframe and the originalcolumn_name
is NOT dropped. - When
preserve_position=True
,df.column
change from[..., column_name, ...]
to[..., new_column_names, ...]
. In other words, the deconcatenated new column will REPLACE the originalcolumn_name
at its original position, andcolumn_name
itself is dropped.
The keyword argument autoname
accepts a base string
and then automatically creates numbered column names
based off the base string.
For example, if col
is passed in as the argument to autoname
,
and 4 columns are created, then the resulting columns will be named
col1, col2, col3, col4
.
Numbering is always 1-indexed, not 0-indexed,
in order to make the column names human-friendly.
This method does not mutate the original DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
The column to split. |
required |
sep
|
Optional[str]
|
The separator delimiting the column's data. |
None
|
new_column_names
|
Optional[Union[List[str], Tuple[str]]]
|
A list of new column names post-splitting. |
None
|
autoname
|
str
|
A base name for automatically naming the new columns.
Takes precedence over |
None
|
preserve_position
|
bool
|
Boolean for whether or not to preserve original position of the column upon de-concatenation. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If either |
JanitorError
|
If incorrect number of names is provided
within |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a deconcatenated column. |
Source code in janitor/functions/deconcatenate_column.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
drop_constant_columns
Implementation of drop_constant_columns.
drop_constant_columns(df)
Finds and drops the constant columns from a Pandas DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> data_dict = {
... "a": [1, 1, 1],
... "b": [1, 2, 3],
... "c": [1, 1, 1],
... "d": ["rabbit", "leopard", "lion"],
... "e": ["Cambridge", "Shanghai", "Basel"]
... }
>>> df = pd.DataFrame(data_dict)
>>> df
a b c d e
0 1 1 1 rabbit Cambridge
1 1 2 1 leopard Shanghai
2 1 3 1 lion Basel
>>> df.drop_constant_columns()
b d e
0 1 rabbit Cambridge
1 2 leopard Shanghai
2 3 lion Basel
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input Pandas DataFrame |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The Pandas DataFrame with the constant columns dropped. |
Source code in janitor/functions/drop_constant_columns.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
drop_duplicate_columns
Implementation for drop_duplicate_columns
.
drop_duplicate_columns(df, column_name, nth_index=0)
Remove a duplicated column specified by column_name
.
Specifying nth_index=0
will remove the first column,
nth_index=1
will remove the second column,
and so on and so forth.
The corresponding tidyverse R's library is:
select(-<column_name>_<nth_index + 1>)
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": range(2, 5),
... "b": range(3, 6),
... "A": range(4, 7),
... "a*": range(6, 9),
... }).clean_names(remove_special=True)
>>> df
a b a a
0 2 3 4 6
1 3 4 5 7
2 4 5 6 8
>>> df.drop_duplicate_columns(column_name="a", nth_index=1)
a b a
0 2 3 6
1 3 4 7
2 4 5 8
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame |
required |
column_name
|
Hashable
|
Name of duplicated columns. |
required |
nth_index
|
int
|
Among the duplicated columns, select the nth column to drop. |
0
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame |
Source code in janitor/functions/drop_duplicate_columns.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
dropnotnull
Implementation source for dropnotnull
.
dropnotnull(df, column_name)
Drop rows that do not have null values in the given column.
This method does not mutate the original DataFrame.
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1., np.NaN, 3.], "b": [None, "y", "z"]})
>>> df
a b
0 1.0 None
1 NaN y
2 3.0 z
>>> df.dropnotnull("a")
a b
1 NaN y
>>> df.dropnotnull("b")
a b
0 1.0 None
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
The column name to drop rows from. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with dropped rows. |
Source code in janitor/functions/dropnotnull.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
encode_categorical
encode_categorical(df, column_names=None, **kwargs)
Encode the specified columns with Pandas' category dtype.
It is syntactic sugar around pd.Categorical
.
This method does not mutate the original DataFrame.
Simply pass a string, or a sequence of column names to column_names
;
alternatively, you can pass kwargs, where the keys are the column names
and the values can either be None, sort
, appearance
or a 1-D array-like object.
- None: column is cast to an unordered categorical.
sort
: column is cast to an ordered categorical, with the order defined by the sort-order of the categories.appearance
: column is cast to an ordered categorical, with the order defined by the order of appearance in the original column.- 1d-array-like object: column is cast to an ordered categorical, with the categories and order as specified in the input array.
column_names
and kwargs
parameters cannot be used at the same time.
Examples:
Using column_names
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "foo": ["b", "b", "a", "c", "b"],
... "bar": range(4, 9),
... })
>>> df
foo bar
0 b 4
1 b 5
2 a 6
3 c 7
4 b 8
>>> df.dtypes
foo object
bar int64
dtype: object
>>> enc_df = df.encode_categorical(column_names="foo")
>>> enc_df.dtypes
foo category
bar int64
dtype: object
>>> enc_df["foo"].cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> enc_df["foo"].cat.ordered
False
Using kwargs
to specify an ordered categorical.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "foo": ["b", "b", "a", "c", "b"],
... "bar": range(4, 9),
... })
>>> df.dtypes
foo object
bar int64
dtype: object
>>> enc_df = df.encode_categorical(foo="appearance")
>>> enc_df.dtypes
foo category
bar int64
dtype: object
>>> enc_df["foo"].cat.categories
Index(['b', 'a', 'c'], dtype='object')
>>> enc_df["foo"].cat.ordered
True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame object. |
required |
column_names
|
Union[str, Iterable[str], Hashable]
|
A column name or an iterable (list or tuple) of column names. |
None
|
**kwargs
|
Any
|
A mapping from column name to either |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If both |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/encode_categorical.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
expand_column
Implementation for expand_column.
expand_column(df, column_name, sep='|', concat=True)
Expand a categorical column with multiple labels into dummy-coded columns.
Super sugary syntax that wraps pandas.Series.str.get_dummies
.
This method does not mutate the original DataFrame.
Examples:
Functional usage syntax:
>>> import pandas as pd
>>> df = pd.DataFrame(
... {
... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
... "col2": [1, 2, 3, 4],
... }
... )
>>> df = expand_column(
... df,
... column_name="col1",
... sep=", " # note space in sep
... )
>>> df
col1 col2 A B C D E F
0 A, B 1 1 1 0 0 0 0
1 B, C, D 2 0 1 1 1 0 0
2 E, F 3 0 0 0 0 1 1
3 A, E, F 4 1 0 0 0 1 1
Method chaining syntax:
>>> import pandas as pd
>>> import janitor
>>> df = (
... pd.DataFrame(
... {
... "col1": ["A, B", "B, C, D", "E, F", "A, E, F"],
... "col2": [1, 2, 3, 4],
... }
... )
... .expand_column(
... column_name='col1',
... sep=', '
... )
... )
>>> df
col1 col2 A B C D E F
0 A, B 1 1 1 0 0 0 0
1 B, C, D 2 0 1 1 1 0 0
2 E, F 3 0 0 0 0 1 1
3 A, E, F 4 1 0 0 0 1 1
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
Which column to expand. |
required |
sep
|
str
|
The delimiter, same to
|
'|'
|
concat
|
bool
|
Whether to return the expanded column concatenated to
the original dataframe ( |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with an expanded column. |
Source code in janitor/functions/expand_column.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
expand_grid
Implementation source for expand_grid
.
cartesian_product(*inputs, sort=False)
Creates a DataFrame from a cartesian combination of all inputs.
Inspiration is from tidyr's expand_grid() function.
The input argument should be a pandas Index/Series/DataFrame, or a dictionary - the values of the dictionary should be a 1D array.
Examples:
>>> import pandas as pd
>>> import janitor as jn
>>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
>>> data = pd.Series([1, 2, 3], name='z')
>>> jn.cartesian_product(df, data)
x y z
0 1 2 1
1 1 2 2
2 1 2 3
3 2 1 1
4 2 1 2
5 2 1 3
cartesian_product
also works with non-pandas objects:
>>> data = {"x": [1, 2, 3], "y": [1, 2]}
>>> cartesian_product(data)
x y
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 3 2
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs
|
tuple
|
Variable arguments. The arguments should be a pandas Index/Series/DataFrame, or a dictionary, where the values in the dictionary is a 1D array. |
()
|
sort
|
bool
|
If True, sort the output DataFrame. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/expand_grid.py
406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 |
|
expand(df, *columns, sort=False, by=None)
Creates a DataFrame from a cartesian combination of all inputs.
Inspiration is from tidyr's expand() function.
expand() is often useful with
pd.merge
to convert implicit
missing values to explicit missing values - similar to
complete
.
It can also be used to figure out which combinations are missing (e.g identify gaps in your DataFrame).
The variable columns
parameter can be a column name,
a list of column names, a pandas Index/Series/DataFrame,
or a callable, which when applied to the DataFrame,
evaluates to a pandas Index/Series/DataFrame.
A dictionary can also be passed
to the variable columns
parameter -
the values of the dictionary should be
either be a 1D array
or a callable that evaluates to a
1D array. The array should be unique;
no check is done to verify this.
If by
is present, the DataFrame is expanded per group.
by
should be a column name, or a list of column names.
Examples:
>>> import pandas as pd
>>> import janitor
>>> data = [{'type': 'apple', 'year': 2010, 'size': 'XS'},
... {'type': 'orange', 'year': 2010, 'size': 'S'},
... {'type': 'apple', 'year': 2012, 'size': 'M'},
... {'type': 'orange', 'year': 2010, 'size': 'S'},
... {'type': 'orange', 'year': 2011, 'size': 'S'},
... {'type': 'orange', 'year': 2012, 'size': 'M'}]
>>> df = pd.DataFrame(data)
>>> df
type year size
0 apple 2010 XS
1 orange 2010 S
2 apple 2012 M
3 orange 2010 S
4 orange 2011 S
5 orange 2012 M
Get unique observations:
>>> df.expand('type')
type
0 apple
1 orange
>>> df.expand('size')
size
0 XS
1 S
2 M
>>> df.expand('type', 'size')
type size
0 apple XS
1 apple S
2 apple M
3 orange XS
4 orange S
5 orange M
>>> df.expand('type','size','year')
type size year
0 apple XS 2010
1 apple XS 2012
2 apple XS 2011
3 apple S 2010
4 apple S 2012
5 apple S 2011
6 apple M 2010
7 apple M 2012
8 apple M 2011
9 orange XS 2010
10 orange XS 2012
11 orange XS 2011
12 orange S 2010
13 orange S 2012
14 orange S 2011
15 orange M 2010
16 orange M 2012
17 orange M 2011
Get observations that only occur in the data:
>>> df.expand(['type','size'])
type size
0 apple XS
1 orange S
2 apple M
3 orange M
>>> df.expand(['type','size','year'])
type size year
0 apple XS 2010
1 orange S 2010
2 apple M 2012
3 orange S 2011
4 orange M 2012
Expand the DataFrame to include new observations:
>>> df.expand('type','size',{'new_year':range(2010,2014)})
type size new_year
0 apple XS 2010
1 apple XS 2011
2 apple XS 2012
3 apple XS 2013
4 apple S 2010
5 apple S 2011
6 apple S 2012
7 apple S 2013
8 apple M 2010
9 apple M 2011
10 apple M 2012
11 apple M 2013
12 orange XS 2010
13 orange XS 2011
14 orange XS 2012
15 orange XS 2013
16 orange S 2010
17 orange S 2011
18 orange S 2012
19 orange S 2013
20 orange M 2010
21 orange M 2011
22 orange M 2012
23 orange M 2013
Filter for missing observations:
>>> combo = df.expand('type','size','year')
>>> anti_join = df.merge(combo, how='right', indicator=True)
>>> anti_join.query("_merge=='right_only'").drop(columns="_merge")
type year size
1 apple 2012 XS
2 apple 2011 XS
3 apple 2010 S
4 apple 2012 S
5 apple 2011 S
6 apple 2010 M
8 apple 2011 M
9 orange 2010 XS
10 orange 2012 XS
11 orange 2011 XS
14 orange 2012 S
16 orange 2010 M
18 orange 2011 M
Expand within each group, using by
:
>>> df.expand('year','size',by='type')
year size
type
apple 2010 XS
apple 2010 M
apple 2012 XS
apple 2012 M
orange 2010 S
orange 2010 M
orange 2011 S
orange 2011 M
orange 2012 S
orange 2012 M
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
columns
|
tuple
|
Specification of columns to expand. It could be column labels, a list/tuple of column labels, or a pandas Index/Series/DataFrame. It can also be a callable; the callable will be applied to the entire DataFrame. The callable should return a pandas Series/Index/DataFrame. It can also be a dictionary, where the values are either a 1D array or a callable that evaluates to a 1D array. The array should be unique; no check is done to verify this. |
()
|
sort
|
bool
|
If True, sort the DataFrame. |
False
|
by
|
str | list
|
Label or list of labels to group by. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/expand_grid.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 |
|
expand_grid(df=None, df_key=None, *, others=None)
Creates a DataFrame from a cartesian combination of all inputs.
Note
This function will be deprecated in a 1.x release;
use cartesian_product
instead.
It is not restricted to a pandas DataFrame; it can work with any list-like structure that is 1 or 2 dimensional.
If method-chaining to a DataFrame, a string argument
to df_key
parameter must be provided.
Data types are preserved in this function, including pandas' extension array dtypes.
The output will always be a DataFrame, usually with a MultiIndex column,
with the keys of the others
dictionary serving as the top level columns.
If a pandas Series/DataFrame is passed, and has a labeled index, or a MultiIndex index, the index is discarded; the final DataFrame will have a RangeIndex.
The MultiIndexed DataFrame can be flattened using pyjanitor's
collapse_levels
method; the user can also decide to drop any of the levels, via pandas'
droplevel
method.
Examples:
>>> import pandas as pd
>>> import janitor as jn
>>> df = pd.DataFrame({"x": [1, 2], "y": [2, 1]})
>>> data = {"z": [1, 2, 3]}
>>> df.expand_grid(df_key="df", others=data)
df z
x y 0
0 1 2 1
1 1 2 2
2 1 2 3
3 2 1 1
4 2 1 2
5 2 1 3
expand_grid
works with non-pandas objects:
>>> data = {"x": [1, 2, 3], "y": [1, 2]}
>>> jn.expand_grid(others=data)
x y
0 0
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 3 2
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
Optional[DataFrame]
|
A pandas DataFrame. |
None
|
df_key
|
Optional[str]
|
Name of key for the dataframe. It becomes part of the column names of the dataframe. |
None
|
others
|
Optional[dict]
|
A dictionary that contains the data
to be combined with the dataframe.
If no dataframe exists, all inputs
in |
None
|
Raises:
Type | Description |
---|---|
KeyError
|
If there is a DataFrame and |
Returns:
Type | Description |
---|---|
Union[DataFrame, None]
|
A pandas DataFrame of the cartesian product. |
Union[DataFrame, None]
|
If |
Union[DataFrame, None]
|
None is returned. |
Source code in janitor/functions/expand_grid.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
explode_index
Implementation of the explode_index
function.
explode_index(df, names_sep=None, names_pattern=None, axis='columns', level_names=None)
Explode a single index DataFrame into a MultiIndex DataFrame.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {'max_speed_mean': [267.3333333333333, 50.5],
... 'max_speed_median': [389.0, 50.5]})
>>> df
max_speed_mean max_speed_median
0 267.333333 389.0
1 50.500000 50.5
>>> df.explode_index(names_sep='_',axis='columns')
max
speed
mean median
0 267.333333 389.0
1 50.500000 50.5
>>> df.explode_index(names_pattern=r"(.+speed)_(.+)",axis='columns')
max_speed
mean median
0 267.333333 389.0
1 50.500000 50.5
>>> df.explode_index(
... names_pattern=r"(?P<measurement>.+speed)_(?P<aggregation>.+)",
... axis='columns'
... )
measurement max_speed
aggregation mean median
0 267.333333 389.0
1 50.500000 50.5
>>> df.explode_index(
... names_sep='_',
... axis='columns',
... level_names = ['min or max', 'measurement','aggregation']
... )
min or max max
measurement speed
aggregation mean median
0 267.333333 389.0
1 50.500000 50.5
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
names_sep
|
Union[str, None]
|
string or compiled regex used to split the column/index into levels. |
None
|
names_pattern
|
Union[str, None]
|
regex to extract new levels from the column/index. |
None
|
axis
|
str
|
'index/columns'. Determines which axis to explode. |
'columns'
|
level_names
|
list
|
names of the levels in the MultiIndex. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a MultiIndex. |
Source code in janitor/functions/explode_index.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
factorize_columns
Implementation of the factorize_columns
function
factorize_columns(df, column_names, suffix='_enc', **kwargs)
Converts labels into numerical data.
This method will create a new column with the string _enc
appended
after the original column's name.
This can be overridden with the suffix parameter.
Internally, this method uses pandas factorize
method.
It takes in an optional suffix and keyword arguments also.
An empty string as suffix will override the existing column.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "foo": ["b", "b", "a", "c", "b"],
... "bar": range(4, 9),
... })
>>> df
foo bar
0 b 4
1 b 5
2 a 6
3 c 7
4 b 8
>>> df.factorize_columns(column_names="foo")
foo bar foo_enc
0 b 4 0
1 b 5 0
2 a 6 1
3 c 7 2
4 b 8 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
column_names
|
Union[str, Iterable[str], Hashable]
|
A column name or an iterable (list or tuple) of column names. |
required |
suffix
|
str
|
Suffix to be used for the new column. An empty string suffix means, it will override the existing column. |
'_enc'
|
**kwargs
|
Any
|
Keyword arguments. It takes any of the keyword arguments,
which the pandas factorize method takes like |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/factorize_columns.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
fill
fill_direction(df, **kwargs)
Provide a method-chainable function for filling missing values in selected columns.
It is a wrapper for pd.Series.ffill
and pd.Series.bfill
,
and pairs the column name with one of up
, down
, updown
,
and downup
.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.assign
instead.
Examples:
>>> import pandas as pd
>>> import janitor as jn
>>> df = pd.DataFrame(
... {
... 'col1': [1, 2, 3, 4],
... 'col2': [None, 5, 6, 7],
... 'col3': [8, 9, 10, None],
... 'col4': [None, None, 11, None],
... 'col5': [None, 12, 13, None]
... }
... )
>>> df
col1 col2 col3 col4 col5
0 1 NaN 8.0 NaN NaN
1 2 5.0 9.0 NaN 12.0
2 3 6.0 10.0 11.0 13.0
3 4 7.0 NaN NaN NaN
>>> df.fill_direction(
... col2 = 'up',
... col3 = 'down',
... col4 = 'downup',
... col5 = 'updown'
... )
col1 col2 col3 col4 col5
0 1 5.0 8.0 11.0 12.0
1 2 5.0 9.0 11.0 12.0
2 3 6.0 10.0 11.0 13.0
3 4 7.0 10.0 11.0 13.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
**kwargs
|
Any
|
Key - value pairs of columns and directions.
Directions can be either |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If direction supplied is not one of |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with modified column(s). |
Source code in janitor/functions/fill.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
fill_empty(df, column_names, value)
Fill NaN
values in specified columns with a given value.
Super sugary syntax that wraps pandas.DataFrame.fillna
.
This method mutates the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use jn.impute
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... 'col1': [1, 2, 3],
... 'col2': [None, 4, None ],
... 'col3': [None, 5, 6]
... }
... )
>>> df
col1 col2 col3
0 1 NaN NaN
1 2 4.0 5.0
2 3 NaN 6.0
>>> df.fill_empty(column_names = 'col2', value = 0)
col1 col2 col3
0 1 0.0 NaN
1 2 4.0 5.0
2 3 0.0 6.0
>>> df.fill_empty(column_names = ['col2', 'col3'], value = 0)
col1 col2 col3
0 1 0.0 0.0
1 2 4.0 5.0
2 3 0.0 6.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Union[str, Iterable[str], Hashable]
|
A column name or an iterable (list or tuple) of column names. If a single column name is passed in, then only that column will be filled; if a list or tuple is passed in, then those columns will all be filled with the same value. |
required |
value
|
Any
|
The value that replaces the |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with |
Source code in janitor/functions/fill.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
filter
filter_column_isin(df, column_name, iterable, complement=False)
Filter a dataframe for values in a column that exist in the given iterable.
This method does not mutate the original DataFrame.
Assumes exact matching; fuzzy matching not implemented.
Examples:
Filter the dataframe to retain rows for which names
are exactly James
or John
.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"names": ["Jane", "Jeremy", "John"], "foo": list("xyz")})
>>> df
names foo
0 Jane x
1 Jeremy y
2 John z
>>> df.filter_column_isin(column_name="names", iterable=["James", "John"])
names foo
2 John z
This is the method-chaining alternative to:
df = df[df["names"].isin(["James", "John"])]
If complement=True
, then we will only get rows for which the names
are neither James
nor John
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
The column on which to filter. |
required |
iterable
|
Iterable
|
An iterable. Could be a list, tuple, another pandas Series. |
required |
complement
|
bool
|
Whether to return the complement of the selection or not. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A filtered pandas DataFrame. |
Source code in janitor/functions/filter.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 |
|
filter_date(df, column_name, start_date=None, end_date=None, years=None, months=None, days=None, column_date_options=None, format=None)
Filter a date-based column based on certain criteria.
This method does not mutate the original DataFrame.
Dates may be finicky and this function builds on top of the magic from
the pandas to_datetime
function that is able to parse dates well.
Additional options to parse the date type of your column may be found at the official pandas documentation.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": range(5, 9),
... "dt": ["2021-11-12", "2021-12-15", "2022-01-03", "2022-01-09"],
... })
>>> df
a dt
0 5 2021-11-12
1 6 2021-12-15
2 7 2022-01-03
3 8 2022-01-09
>>> df.filter_date("dt", start_date="2021-12-01", end_date="2022-01-05")
a dt
1 6 2021-12-15
2 7 2022-01-03
>>> df.filter_date("dt", years=[2021], months=[12])
a dt
1 6 2021-12-15
Note
This method will cast your column to a Timestamp!
Note
This only affects the format of the start_date
and end_date
parameters. If there's an issue with the format of the DataFrame being
parsed, you would pass {'format': your_format}
to column_date_options
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The dataframe to filter on. |
required |
column_name
|
Hashable
|
The column which to apply the fraction transformation. |
required |
start_date
|
Optional[date]
|
The beginning date to use to filter the DataFrame. |
None
|
end_date
|
Optional[date]
|
The end date to use to filter the DataFrame. |
None
|
years
|
Optional[List]
|
The years to use to filter the DataFrame. |
None
|
months
|
Optional[List]
|
The months to use to filter the DataFrame. |
None
|
days
|
Optional[List]
|
The days to use to filter the DataFrame. |
None
|
column_date_options
|
Optional[Dict]
|
Special options to use when parsing the date column in the original DataFrame. The options may be found at the official Pandas documentation. |
None
|
format
|
Optional[str]
|
If you're using a format for |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A filtered pandas DataFrame. |
Source code in janitor/functions/filter.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
|
filter_on(df, criteria, complement=False)
Return a dataframe filtered on a particular criteria.
This method does not mutate the original DataFrame.
This is super-sugary syntax that wraps the pandas .query()
API, enabling
users to use strings to quickly specify filters for filtering their
dataframe. The intent is that filter_on
as a verb better matches the
intent of a pandas user than the verb query
.
This is intended to be the method-chaining equivalent of the following:
df = df[df["score"] < 3]
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.query
instead.
Examples:
Filter students who failed an exam (scored less than 50).
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "student_id": ["S1", "S2", "S3"],
... "score": [40, 60, 85],
... })
>>> df
student_id score
0 S1 40
1 S2 60
2 S3 85
>>> df.filter_on("score < 50", complement=False)
student_id score
0 S1 40
Credit to Brant Peterson for the name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
criteria
|
str
|
A filtering criteria that returns an array or Series of booleans, on which pandas can filter on. |
required |
complement
|
bool
|
Whether to return the complement of the filter or not. If set to True, then the rows for which the criteria is False are retained instead. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A filtered pandas DataFrame. |
Source code in janitor/functions/filter.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
filter_string(df, column_name, search_string, complement=False, case=True, flags=0, na=None, regex=True)
Filter a string-based column according to whether it contains a substring.
This is super sugary syntax that builds on top of pandas.Series.str.contains
.
It is meant to be the method-chaining equivalent of the following:
df = df[df[column_name].str.contains(search_string)]]
This method does not mutate the original DataFrame.
Examples:
Retain rows whose column values contain a particular substring.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": range(3, 6), "b": ["bear", "peeL", "sail"]})
>>> df
a b
0 3 bear
1 4 peeL
2 5 sail
>>> df.filter_string(column_name="b", search_string="ee")
a b
1 4 peeL
>>> df.filter_string(column_name="b", search_string="L", case=False)
a b
1 4 peeL
2 5 sail
Filter names does not contain '.'
(disable regex mode).
>>> import pandas as pd
>>> import janitor
>>> df = pd.Series(["JoseChen", "Brian.Salvi"], name="Name").to_frame()
>>> df
Name
0 JoseChen
1 Brian.Salvi
>>> df.filter_string(column_name="Name", search_string=".", regex=False, complement=True)
Name
0 JoseChen
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
The column to filter. The column should contain strings. |
required |
search_string
|
str
|
A regex pattern or a (sub-)string to search. |
required |
complement
|
bool
|
Whether to return the complement of the filter or not. If set to True, then the rows for which the string search fails are retained instead. |
False
|
case
|
bool
|
If True, case sensitive. |
True
|
flags
|
int
|
Flags to pass through to the re module, e.g. re.IGNORECASE. |
0
|
na
|
Any
|
Fill value for missing values. The default depends on dtype of
the array. For object-dtype, |
None
|
regex
|
bool
|
If True, assumes |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A filtered pandas DataFrame. |
Source code in janitor/functions/filter.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
find_replace
Implementation for find_replace.
find_replace(df, match='exact', **mappings)
Perform a find-and-replace action on provided columns.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.replace
instead.
Depending on use case, users can choose either exact, full-value matching, or regular-expression-based fuzzy matching (hence allowing substring matching in the latter case). For strings, the matching is always case sensitive.
Examples:
For instance, given a DataFrame containing orders at a coffee shop:
>>> df = pd.DataFrame({
... "customer": ["Mary", "Tom", "Lila"],
... "order": ["ice coffee", "lemonade", "regular coffee"]
... })
>>> df
customer order
0 Mary ice coffee
1 Tom lemonade
2 Lila regular coffee
Our task is to replace values ice coffee
and regular coffee
of the order
column into latte
.
Example 1 - exact matching (functional usage):
>>> df = find_replace(
... df,
... match="exact",
... order={"ice coffee": "latte", "regular coffee": "latte"},
... )
>>> df
customer order
0 Mary latte
1 Tom lemonade
2 Lila latte
Example 1 - exact matching (method chaining):
>>> df = df.find_replace(
... match="exact",
... order={"ice coffee": "latte", "regular coffee": "latte"},
... )
>>> df
customer order
0 Mary latte
1 Tom lemonade
2 Lila latte
Example 2 - Regular-expression-based matching (functional usage):
>>> df = find_replace(
... df,
... match='regex',
... order={'coffee$': 'latte'},
... )
>>> df
customer order
0 Mary latte
1 Tom lemonade
2 Lila latte
Example 2 - Regular-expression-based matching (method chaining usage):
>>> df = df.find_replace(
... match='regex',
... order={'coffee$': 'latte'},
... )
>>> df
customer order
0 Mary latte
1 Tom lemonade
2 Lila latte
To perform a find and replace on the entire DataFrame,
pandas' df.replace()
function provides the appropriate functionality.
You can find more detail on the replace docs.
This function only works with column names that have no spaces
or punctuation in them.
For example, a column name item_name
would work with find_replace
,
because it is a contiguous string that can be parsed correctly,
but item name
would not be parsed correctly by the Python interpreter.
If you have column names that might not be compatible,
we recommend calling on clean_names()
as the first method call. If, for whatever reason, that is not possible,
then _find_replace
is available as a function
that you can do a pandas pipe call on.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
match
|
str
|
Whether or not to perform an exact match or not. Valid values are "exact" or "regex". |
'exact'
|
**mappings
|
Any
|
keyword arguments corresponding to column names that have dictionaries passed in indicating what to find (keys) and what to replace with (values). |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with replaced values. |
Source code in janitor/functions/find_replace.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
flag_nulls
Implementation source for flag_nulls
.
flag_nulls(df, column_name='null_flag', columns=None)
Creates a new column to indicate whether you have null values in a given row.
If the columns parameter is not set, looks across the entire DataFrame, otherwise will look only in the columns you set.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": ["w", "x", None, "z"], "b": [5, None, 7, 8],
... })
>>> df.flag_nulls()
a b null_flag
0 w 5.0 0
1 x NaN 1
2 None 7.0 1
3 z 8.0 0
>>> df.flag_nulls(columns="b")
a b null_flag
0 w 5.0 0
1 x NaN 1
2 None 7.0 0
3 z 8.0 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input pandas DataFrame. |
required |
column_name
|
Optional[Hashable]
|
Name for the output column. |
'null_flag'
|
columns
|
Optional[Union[str, Iterable[str], Hashable]]
|
List of columns to look at for finding null values. If you only want to look at one column, you can simply give its name. If set to None (default), all DataFrame columns are used. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If any column within |
Returns:
Type | Description |
---|---|
DataFrame
|
Input dataframe with the null flag column. |
Source code in janitor/functions/flag_nulls.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
get_dupes
Implementation of the get_dupes
function
get_dupes(df, column_names=None)
Return all duplicate rows.
This method does not mutate the original DataFrame.
Examples:
Method chaining syntax:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
... "quantity": [100, 100, 75, 200, 75],
... })
>>> df
item quantity
0 shoe 100
1 shoe 100
2 bag 75
3 shoe 200
4 bag 75
>>> df.get_dupes()
item quantity
0 shoe 100
1 shoe 100
2 bag 75
4 bag 75
Optional column_names
usage:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
... "quantity": [100, 100, 75, 200, 75],
... })
>>> df
item quantity
0 shoe 100
1 shoe 100
2 bag 75
3 shoe 200
4 bag 75
>>> df.get_dupes(column_names=["item"])
item quantity
0 shoe 100
1 shoe 100
2 bag 75
3 shoe 200
4 bag 75
>>> df.get_dupes(column_names=["quantity"])
item quantity
0 shoe 100
1 shoe 100
2 bag 75
4 bag 75
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
column_names
|
Optional[Union[str, Iterable[str], Hashable]]
|
A column name or an iterable (list or tuple) of column names. Following pandas API, this only considers certain columns for identifying duplicates. Defaults to using all columns. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
The duplicate rows, as a pandas DataFrame. |
Source code in janitor/functions/get_dupes.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
groupby_agg
Implementation source for groupby_agg
.
groupby_agg(df, by, new_column_name, agg_column_name, agg, dropna=True)
Shortcut for assigning a groupby-transform to a new column.
This method does not mutate the original DataFrame.
Intended to be the method-chaining equivalent of:
df = df.assign(...=df.groupby(...)[...].transform(...))
Note
This function will be deprecated in a 1.x release.
Please use
jn.transform_column
instead.
Examples:
Basic usage.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "item": ["shoe", "shoe", "bag", "shoe", "bag"],
... "quantity": [100, 120, 75, 200, 25],
... })
>>> df.groupby_agg(
... by="item",
... agg="mean",
... agg_column_name="quantity",
... new_column_name="avg_quantity",
... )
item quantity avg_quantity
0 shoe 100 140.0
1 shoe 120 140.0
2 bag 75 50.0
3 shoe 200 140.0
4 bag 25 50.0
Set dropna=False
to compute the aggregation, treating the null
values in the by
column as an isolated "group".
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "x": ["a", "a", None, "b"], "y": [9, 9, 9, 9],
... })
>>> df.groupby_agg(
... by="x",
... agg="count",
... agg_column_name="y",
... new_column_name="y_count",
... dropna=False,
... )
x y y_count
0 a 9 2
1 a 9 2
2 None 9 1
3 b 9 1
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
by
|
Union[List, Callable, str]
|
Column(s) to groupby on, will be passed into |
required |
new_column_name
|
str
|
Name of the aggregation output column. |
required |
agg_column_name
|
str
|
Name of the column to aggregate over. |
required |
agg
|
Union[Callable, str]
|
How to aggregate. |
required |
dropna
|
bool
|
Whether or not to include null values, if present in the
|
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/groupby_agg.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
groupby_topk
Implementation of the groupby_topk
function
groupby_topk(df, by, column, k, dropna=True, ascending=True, ignore_index=True)
Return top k
rows from a groupby of a set of columns.
Returns a DataFrame that has the top k
values per column
,
grouped by by
. Under the hood it uses nlargest/nsmallest
,
for numeric columns, which avoids sorting the entire dataframe,
and is usually more performant. For non-numeric columns, pd.sort_values
is used.
No sorting is done to the by
column(s); the order is maintained
in the final output.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "age": [20, 23, 22, 43, 21],
... "id": [1, 4, 6, 2, 5],
... "result": ["pass", "pass", "fail", "pass", "fail"],
... }
... )
>>> df
age id result
0 20 1 pass
1 23 4 pass
2 22 6 fail
3 43 2 pass
4 21 5 fail
Ascending top 3:
>>> df.groupby_topk(by="result", column="age", k=3)
age id result
0 20 1 pass
1 23 4 pass
2 43 2 pass
3 21 5 fail
4 22 6 fail
Descending top 2:
>>> df.groupby_topk(
... by="result", column="age", k=2, ascending=False, ignore_index=False
... )
age id result
3 43 2 pass
1 23 4 pass
2 22 6 fail
4 21 5 fail
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
by
|
Union[list, Hashable]
|
Column name(s) to group input DataFrame |
required |
column
|
Hashable
|
Name of the column that determines |
required |
k
|
int
|
Number of top rows to return for each group. |
required |
dropna
|
bool
|
If |
True
|
ascending
|
bool
|
If |
True
|
ignore_index
|
bool
|
If |
True
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with top |
Source code in janitor/functions/groupby_topk.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
impute
Implementation of impute
function
impute(df, column_names, value=None, statistic_column_name=None)
Method-chainable imputation of values in a column.
This method does not mutate the original DataFrame.
Underneath the hood, this function calls the .fillna()
method available
to every pandas.Series
object.
Either one of value
or statistic_column_name
should be provided.
If value
is provided, then all null values in the selected column will
take on the value provided.
If statistic_column_name
is provided, then all null values in the
selected column(s) will take on the summary statistic value
of other non-null values.
Column selection in column_names
is possible using the
select
syntax.
Currently supported statistics include:
mean
(also aliased byaverage
)median
mode
minimum
(also aliased bymin
)maximum
(also aliased bymax
)
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": [1, 2, 3],
... "sales": np.nan,
... "score": [np.nan, 3, 2],
... })
>>> df
a sales score
0 1 NaN NaN
1 2 NaN 3.0
2 3 NaN 2.0
Imputing null values with 0 (using the value
parameter):
>>> df.impute(column_names="sales", value=0.0)
a sales score
0 1 0.0 NaN
1 2 0.0 3.0
2 3 0.0 2.0
Imputing null values with median (using the statistic_column_name
parameter):
>>> df.impute(column_names="score", statistic_column_name="median")
a sales score
0 1 NaN 2.5
1 2 NaN 3.0
2 3 NaN 2.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Any
|
The name of the column(s) on which to impute values. |
required |
value
|
Optional[Any]
|
The value used for imputation, passed into |
None
|
statistic_column_name
|
Optional[str]
|
The column statistic to impute. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If both |
KeyError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
An imputed pandas DataFrame. |
Source code in janitor/functions/impute.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
jitter
Implementation of the jitter
function.
jitter(df, column_name, dest_column_name, scale, clip=None, random_state=None)
Adds Gaussian noise (jitter) to the values of a column.
A new column will be created containing the values of the original column
with Gaussian noise added.
For each value in the column, a Gaussian distribution is created
having a location (mean) equal to the value
and a scale (standard deviation) equal to scale
.
A random value is then sampled from this distribution,
which is the jittered value.
If a tuple is supplied for clip
,
then any values of the new column less than clip[0]
will be set to clip[0]
,
and any values greater than clip[1]
will be set to clip[1]
.
Additionally, if a numeric value is supplied for random_state
,
this value will be used to set the random seed used for sampling.
NaN values are ignored in this method.
This method mutates the original DataFrame.
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [3, 4, 5, np.nan]})
>>> df
a
0 3.0
1 4.0
2 5.0
3 NaN
>>> df.jitter("a", dest_column_name="a_jit", scale=1, random_state=42)
a a_jit
0 3.0 3.496714
1 4.0 3.861736
2 5.0 5.647689
3 NaN NaN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
Name of the column containing values to add Gaussian jitter to. |
required |
dest_column_name
|
str
|
The name of the new column containing the jittered values that will be created. |
required |
scale
|
number
|
A positive value multiplied by the original column value to determine the scale (standard deviation) of the Gaussian distribution to sample from. (A value of zero results in no jittering.) |
required |
clip
|
Optional[Iterable[number]]
|
An iterable of two values (minimum and maximum) to clip the jittered values to, default to None. |
None
|
random_state
|
Optional[number]
|
An integer or 1-d array value used to set the random seed, default to None. |
None
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a new column containing Gaussian-jittered values from another column. |
Source code in janitor/functions/jitter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
join_apply
Implementation of the join_apply
function
join_apply(df, func, new_column_name)
Join the result of applying a function across dataframe rows.
This method does not mutate the original DataFrame.
This is a convenience function that allows us to apply arbitrary functions that take any combination of information from any of the columns. The only requirement is that the function signature takes in a row from the DataFrame.
Examples:
Sum the result of two columns into a new column.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a":[1, 2, 3], "b": [2, 3, 4]})
>>> df
a b
0 1 2
1 2 3
2 3 4
>>> df.join_apply(
... func=lambda x: 2 * x["a"] + x["b"],
... new_column_name="2a+b",
... )
a b 2a+b
0 1 2 4
1 2 3 7
2 3 4 10
Incorporating conditionals in func
.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [20, 30, 40]})
>>> df
a b
0 1 20
1 2 30
2 3 40
>>> def take_a_if_even(x):
... if x["a"] % 2 == 0:
... return x["a"]
... else:
... return x["b"]
>>> df.join_apply(take_a_if_even, "a_if_even")
a b a_if_even
0 1 20 20
1 2 30 2
2 3 40 40
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
func
|
Callable
|
A function that is applied elementwise across all rows of the DataFrame. |
required |
new_column_name
|
str
|
Name of the resulting column. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with new column appended. |
Source code in janitor/functions/join_apply.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
label_encode
Implementation of label_encode
function
label_encode(df, column_names)
Convert labels into numerical data.
This method will create a new column with the string _enc
appended
after the original column's name.
Consider this to be syntactic sugar.
This function uses the factorize
pandas function under the hood.
This method behaves differently from
encode_categorical
.
This method creates a new column of numeric data.
encode_categorical
replaces the dtype of the original column with a categorical dtype.
This method mutates the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use factorize_columns
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "foo": ["b", "b", "a", "c", "b"],
... "bar": range(4, 9),
... })
>>> df
foo bar
0 b 4
1 b 5
2 a 6
3 c 7
4 b 8
>>> df.label_encode(column_names="foo")
foo bar foo_enc
0 b 4 0
1 b 5 0
2 a 6 1
3 c 7 2
4 b 8 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
column_names
|
Union[str, Iterable[str], Hashable]
|
A column name or an iterable (list or tuple) of column names. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/label_encode.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
limit_column_characters
Implementation of limit_column_characters.
limit_column_characters(df, column_length, col_separator='_')
Truncate column sizes to a specific length.
This method mutates the original DataFrame.
Method chaining will truncate all columns to a given length and append a given separator character with the index of duplicate columns, except for the first distinct column name.
Examples:
>>> import pandas as pd
>>> import janitor
>>> data_dict = {
... "really_long_name": [9, 8, 7],
... "another_really_long_name": [2, 4, 6],
... "another_really_longer_name": list("xyz"),
... "this_is_getting_out_of_hand": list("pqr"),
... }
>>> df = pd.DataFrame(data_dict)
>>> df
really_long_name another_really_long_name another_really_longer_name this_is_getting_out_of_hand
0 9 2 x p
1 8 4 y q
2 7 6 z r
>>> df.limit_column_characters(7)
really_ another another_1 this_is
0 9 2 x p
1 8 4 y q
2 7 6 z r
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_length
|
int
|
Character length for which to truncate all columns. The column separator value and number for duplicate column name does not contribute. Therefore, if all columns are truncated to 10 characters, the first distinct column will be 10 characters and the remaining will be 12 characters (assuming a column separator of one character). |
required |
col_separator
|
str
|
The separator to use for counting distinct column
values, for example, |
'_'
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with truncated column lengths. |
Source code in janitor/functions/limit_column_characters.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
min_max_scale
min_max_scale(df, feature_range=(0, 1), column_name=None, jointly=False)
Scales DataFrame to between a minimum and maximum value.
One can optionally set a new target minimum and maximum value
using the feature_range
keyword argument.
If column_name
is specified, then only that column(s) of data is scaled.
Otherwise, the entire dataframe is scaled.
If jointly
is True
, the column_names
provided entire dataframe will
be regnozied as the one to jointly scale. Otherwise, each column of data
will be scaled separately.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale()
a b
0 0.0 0.0
1 1.0 1.0
>>> df.min_max_scale(jointly=True)
a b
0 0.5 0.0
1 1.0 0.5
Setting custom minimum and maximum.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale(feature_range=(0, 100))
a b
0 0.0 0.0
1 100.0 100.0
>>> df.min_max_scale(feature_range=(0, 100), jointly=True)
a b
0 50.0 0.0
1 100.0 50.0
Apply min-max to the selected columns.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1], 'c': [1, 0]})
>>> df.min_max_scale(
... feature_range=(0, 100),
... column_name=["a", "c"],
... )
a b c
0 0.0 0 100.0
1 100.0 1 0.0
>>> df.min_max_scale(
... feature_range=(0, 100),
... column_name=["a", "c"],
... jointly=True,
... )
a b c
0 50.0 0 50.0
1 100.0 1 0.0
>>> df.min_max_scale(feature_range=(0, 100), column_name='a')
a b c
0 0.0 0 1
1 100.0 1 0
The aforementioned example might be applied to something like scaling the isoelectric points of amino acids. While technically they range from approx 3-10, we can also think of them on the pH scale which ranges from 1 to 14. Hence, 3 gets scaled not to 0 but approx. 0.15 instead, while 10 gets scaled to approx. 0.69 instead.
Version Changed
- 0.24.0
- Deleted
old_min
,old_max
,new_min
, andnew_max
options. - Added
feature_range
, andjointly
options.
- Deleted
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
feature_range
|
tuple[int | float, int | float]
|
Desired range of transformed data. |
(0, 1)
|
column_name
|
str | int | list[str | int] | Index
|
The column on which to perform scaling. |
None
|
jointly
|
bool
|
Scale the entire data if True. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If the length of |
ValueError
|
If the element of |
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with scaled data. |
Source code in janitor/functions/min_max_scale.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
move
Implementation of move.
move(df, source, target=None, position='before', axis=0)
Changes rows or columns positions in the dataframe.
It uses the
select
syntax,
making it easy to move blocks of rows or columns at once.
This operation does not reset the index of the dataframe. User must explicitly do so.
The dataframe must have unique column names or indices.
Examples:
Move a row:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [2, 4, 6, 8], "b": list("wxyz")})
>>> df
a b
0 2 w
1 4 x
2 6 y
3 8 z
>>> df.move(source=0, target=3, position="before", axis=0)
a b
1 4 x
2 6 y
0 2 w
3 8 z
Move a column:
>>> import pandas as pd
>>> import janitor
>>> data = [{"a": 1, "b": 1, "c": 1,
... "d": "a", "e": "a","f": "a"}]
>>> df = pd.DataFrame(data)
>>> df
a b c d e f
0 1 1 1 a a a
>>> df.move(source="a", target="c", position="after", axis=1)
b c a d e f
0 1 1 1 a a a
>>> df.move(source="f", target="b", position="before", axis=1)
a f b c d e
0 1 a 1 1 a a
>>> df.move(source="a", target=None, position="after", axis=1)
b c d e f a
0 1 1 a a a 1
Move columns:
>>> from pandas.api.types import is_numeric_dtype, is_string_dtype
>>> df.move(source=is_string_dtype, target=None, position="before", axis=1)
d e f a b c
0 a a a 1 1 1
>>> df.move(source=is_numeric_dtype, target=None, position="after", axis=1)
d e f a b c
0 a a a 1 1 1
>>> df.move(source = ["d", "f"], target=is_numeric_dtype, position="before", axis=1)
d f a b c e
0 a a 1 1 1 a
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
source
|
Any
|
Columns or rows to move. |
required |
target
|
Any
|
Columns or rows to move adjacent to.
If |
None
|
position
|
str
|
Specifies the destination of the columns/rows.
Values can be either |
'before'
|
axis
|
int
|
Axis along which the function is applied. 0 to move along the index, 1 to move along the columns. |
0
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataframe with the Series moved. |
Source code in janitor/functions/move.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
|
pivot
pivot_longer(df, index=None, column_names=None, names_to=None, values_to='value', column_level=None, names_sep=None, names_pattern=None, names_transform=None, dropna=False, sort_by_appearance=False, ignore_index=True)
Unpivots a DataFrame from wide to long format.
This method does not mutate the original DataFrame.
It is modeled after the pivot_longer
function in R's tidyr package,
and also takes inspiration from R's data.table package.
This function is useful to massage a DataFrame into a format where one or more columns are considered measured variables, and all other columns are considered as identifier variables.
All measured variables are unpivoted (and typically duplicated) along the row axis.
Column selection in index
and column_names
is possible using the
select
syntax.
For more granular control on the unpivoting, have a look at
pivot_longer_spec
.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "Sepal.Length": [5.1, 5.9],
... "Sepal.Width": [3.5, 3.0],
... "Petal.Length": [1.4, 5.1],
... "Petal.Width": [0.2, 1.8],
... "Species": ["setosa", "virginica"],
... }
... )
>>> df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Replicate pandas' melt:
>>> df.pivot_longer(index = 'Species')
Species variable value
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
Convenient, flexible column selection in the index
via the
select
syntax:
>>> from pandas.api.types import is_string_dtype
>>> df.pivot_longer(index = is_string_dtype)
Species variable value
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
Split the column labels into individual columns:
>>> df.pivot_longer(
... index = 'Species',
... names_to = ('part', 'dimension'),
... names_sep = '.',
... sort_by_appearance = True,
... )
Species part dimension value
0 setosa Sepal Length 5.1
1 setosa Sepal Width 3.5
2 setosa Petal Length 1.4
3 setosa Petal Width 0.2
4 virginica Sepal Length 5.9
5 virginica Sepal Width 3.0
6 virginica Petal Length 5.1
7 virginica Petal Width 1.8
Retain parts of the column names as headers:
>>> df.pivot_longer(
... index = 'Species',
... names_to = ('part', '.value'),
... names_sep = '.',
... sort_by_appearance = True,
... )
Species part Length Width
0 setosa Sepal 5.1 3.5
1 setosa Petal 1.4 0.2
2 virginica Sepal 5.9 3.0
3 virginica Petal 5.1 1.8
Split the column labels based on regex:
>>> df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
>>> df
id new_sp_m5564 newrel_f65
0 1 2 3
>>> df.pivot_longer(
... index = 'id',
... names_to = ('diagnosis', 'gender', 'age'),
... names_pattern = r"new_?(.+)_(.)(\d+)",
... )
id diagnosis gender age value
0 1 sp m 5564 2
1 1 rel f 65 3
Split the column labels for the above dataframe using named groups in names_pattern
:
>>> df.pivot_longer(
... index = 'id',
... names_pattern = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)",
... )
id diagnosis gender age value
0 1 sp m 5564 2
1 1 rel f 65 3
Convert the dtypes of specific columns with names_transform
:
>>> result = (df
... .pivot_longer(
... index = 'id',
... names_to = ('diagnosis', 'gender', 'age'),
... names_pattern = r"new_?(.+)_(.)(\d+)",
... names_transform = {'gender': 'category', 'age':'int'})
... )
>>> result.dtypes
id int64
diagnosis object
gender category
age int64
value int64
dtype: object
Use multiple .value
to reshape the dataframe:
>>> df = pd.DataFrame(
... [
... {
... "x_1_mean": 10,
... "x_2_mean": 20,
... "y_1_mean": 30,
... "y_2_mean": 40,
... "unit": 50,
... }
... ]
... )
>>> df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
>>> df.pivot_longer(
... index="unit",
... names_to=(".value", "time", ".value"),
... names_pattern=r"(x|y)_([0-9])(_mean)",
... )
unit time x_mean y_mean
0 50 1 10 30
1 50 2 20 40
Replicate the above with named groups in names_pattern
- use _
instead of .value
:
>>> df.pivot_longer(
... index="unit",
... names_pattern=r"(?P<_>x|y)_(?P<time>[0-9])(?P<__>_mean)",
... )
unit time x_mean y_mean
0 50 1 10 30
1 50 2 20 40
Convenient, flexible column selection in the column_names
via
the select
syntax:
>>> df.pivot_longer(
... column_names="*mean",
... names_to=(".value", "time", ".value"),
... names_pattern=r"(x|y)_([0-9])(_mean)",
... )
unit time x_mean y_mean
0 50 1 10 30
1 50 2 20 40
>>> df.pivot_longer(
... column_names=slice("x_1_mean", "y_2_mean"),
... names_to=(".value", "time", ".value"),
... names_pattern=r"(x|y)_([0-9])(_mean)",
... )
unit time x_mean y_mean
0 50 1 10 30
1 50 2 20 40
Reshape the dataframe by passing a sequence to names_pattern
:
>>> df = pd.DataFrame({'hr1': [514, 573],
... 'hr2': [545, 526],
... 'team': ['Red Sox', 'Yankees'],
... 'year1': [2007, 2007],
... 'year2': [2008, 2008]})
>>> df
hr1 hr2 team year1 year2
0 514 545 Red Sox 2007 2008
1 573 526 Yankees 2007 2008
>>> df.pivot_longer(
... index = 'team',
... names_to = ['year', 'hr'],
... names_pattern = ['year', 'hr']
... )
team hr year
0 Red Sox 514 2007
1 Yankees 573 2007
2 Red Sox 545 2008
3 Yankees 526 2008
Reshape the above dataframe by passing a dictionary to names_pattern
:
>>> df.pivot_longer(
... index = 'team',
... names_pattern = {"year":"year", "hr":"hr"}
... )
team hr year
0 Red Sox 514 2007
1 Yankees 573 2007
2 Red Sox 545 2008
3 Yankees 526 2008
Multiple values_to:
>>> df = pd.DataFrame(
... {
... "City": ["Houston", "Austin", "Hoover"],
... "State": ["Texas", "Texas", "Alabama"],
... "Name": ["Aria", "Penelope", "Niko"],
... "Mango": [4, 10, 90],
... "Orange": [10, 8, 14],
... "Watermelon": [40, 99, 43],
... "Gin": [16, 200, 34],
... "Vodka": [20, 33, 18],
... },
... )
>>> df
City State Name Mango Orange Watermelon Gin Vodka
0 Houston Texas Aria 4 10 40 16 20
1 Austin Texas Penelope 10 8 99 200 33
2 Hoover Alabama Niko 90 14 43 34 18
>>> df.pivot_longer(
... index=["City", "State"],
... column_names=slice("Mango", "Vodka"),
... names_to=("Fruit", "Drink"),
... values_to=("Pounds", "Ounces"),
... names_pattern=["M|O|W", "G|V"],
... )
City State Fruit Drink Pounds Ounces
0 Houston Texas Mango Gin 4 16.0
1 Austin Texas Mango Gin 10 200.0
2 Hoover Alabama Mango Gin 90 34.0
3 Houston Texas Orange Vodka 10 20.0
4 Austin Texas Orange Vodka 8 33.0
5 Hoover Alabama Orange Vodka 14 18.0
6 Houston Texas Watermelon None 40 NaN
7 Austin Texas Watermelon None 99 NaN
8 Hoover Alabama Watermelon None 43 NaN
Replicate the above transformation with a nested dictionary passed to names_pattern
- the outer keys in the names_pattern
dictionary are passed to names_to
,
while the inner keys are passed to values_to
:
>>> df.pivot_longer(
... index=["City", "State"],
... column_names=slice("Mango", "Vodka"),
... names_pattern={
... "Fruit": {"Pounds": "M|O|W"},
... "Drink": {"Ounces": "G|V"},
... },
... )
City State Fruit Drink Pounds Ounces
0 Houston Texas Mango Gin 4 16.0
1 Austin Texas Mango Gin 10 200.0
2 Hoover Alabama Mango Gin 90 34.0
3 Houston Texas Orange Vodka 10 20.0
4 Austin Texas Orange Vodka 8 33.0
5 Hoover Alabama Orange Vodka 14 18.0
6 Houston Texas Watermelon None 40 NaN
7 Austin Texas Watermelon None 99 NaN
8 Hoover Alabama Watermelon None 43 NaN
Version Changed
- 0.24.0
- Added
dropna
parameter.
- Added
- 0.24.1
names_pattern
can accept a dictionary.- named groups supported in
names_pattern
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
index
|
list | tuple | str | Pattern
|
Name(s) of columns to use as identifier variables.
Should be either a single column name, or a list/tuple of
column names.
|
None
|
column_names
|
list | tuple | str | Pattern
|
Name(s) of columns to unpivot. Should be either
a single column name or a list/tuple of column names.
|
None
|
names_to
|
list | tuple | str
|
Name of new column as a string that will contain
what were previously the column names in |
None
|
values_to
|
str
|
Name of new column as a string that will contain what
were previously the values of the columns in |
'value'
|
column_level
|
int | str
|
If columns are a MultiIndex, then use this level to
unpivot the DataFrame. Provided for compatibility with pandas' melt,
and applies only if neither |
None
|
names_sep
|
str | Pattern
|
Determines how the column name is broken up, if
|
None
|
names_pattern
|
list | tuple | str | Pattern
|
Determines how the column name is broken up.
It can be a regular expression containing matching groups.
Under the hood it is processed with pandas' |
None
|
names_transform
|
str | Callable | dict
|
Use this option to change the types of columns that
have been transformed to rows. This does not applies to the values' columns.
Accepts any argument that is acceptable by |
None
|
dropna
|
bool
|
Determines whether or not to drop nulls
from the values columns. Default is |
False
|
sort_by_appearance
|
bool
|
Boolean value that determines
the final look of the DataFrame. If |
False
|
ignore_index
|
bool
|
If |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame that has been unpivoted from wide to long format. |
Source code in janitor/functions/pivot.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 |
|
pivot_longer_spec(df, spec, sort_by_appearance=False, ignore_index=True, dropna=False, df_columns_is_unique=True)
A declarative interface to pivot a DataFrame from wide to long form,
where you describe how the data will be unpivoted,
using a DataFrame. This gives you, the user,
more control over unpivoting, where you create a “spec”
data frame that describes exactly how data stored
in the column names becomes variables.
It can come in handy for situations where
pivot_longer
seems inadequate for the transformation.
New in version 0.28.0
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "Sepal.Length": [5.1, 5.9],
... "Sepal.Width": [3.5, 3.0],
... "Petal.Length": [1.4, 5.1],
... "Petal.Width": [0.2, 1.8],
... "Species": ["setosa", "virginica"],
... }
... )
>>> df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
>>> spec = {'.name':['Sepal.Length','Petal.Length',
... 'Sepal.Width','Petal.Width'],
... '.value':['Length','Length','Width','Width'],
... 'part':['Sepal','Petal','Sepal','Petal']}
>>> spec = pd.DataFrame(spec)
>>> spec
.name .value part
0 Sepal.Length Length Sepal
1 Petal.Length Length Petal
2 Sepal.Width Width Sepal
3 Petal.Width Width Petal
>>> pivot_longer_spec(df=df,spec=spec)
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The source DataFrame to unpivot. |
required |
spec
|
DataFrame
|
A specification DataFrame. At a minimum, the spec DataFrame must have a '.name' and a '.value' columns. The '.name' column should contain the columns in the source DataFrame that will be transformed to long form. The '.value' column gives the name of the column(s) that the values in the source DataFrame will go into. Additional columns in spec should be named to match columns in the long format of the dataset and contain values corresponding to columns pivoted from the wide format. Note that these additional columns should not already exist in the source DataFrame. |
required |
sort_by_appearance
|
bool
|
Boolean value that determines
the final look of the DataFrame. If |
False
|
ignore_index
|
bool
|
If |
True
|
dropna
|
bool
|
Determines whether or not to drop nulls
from the values columns. Default is |
False
|
df_columns_is_unique
|
bool
|
Boolean value to indicate if the source
DataFrame's columns is unique. Default is |
True
|
Raises:
Type | Description |
---|---|
KeyError
|
If '.name' or '.value' is missing from the spec's columns. |
ValueError
|
If the spec's columns is not unique, or the labels in spec['.name'] is not unique. |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/pivot.py
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 |
|
pivot_wider(df, index=None, names_from=None, values_from=None, flatten_levels=True, names_sep='_', names_glue=None, reset_index=True, names_expand=False, index_expand=False)
Reshapes data from long to wide form.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.pivot
instead.
The number of columns are increased, while decreasing
the number of rows. It is the inverse of the
pivot_longer
method, and is a wrapper around pd.DataFrame.pivot
method.
This method does not mutate the original DataFrame.
Column selection in index
, names_from
and values_from
is possible using the
select
syntax.
A ValueError is raised if the combination
of the index
and names_from
is not unique.
By default, values from values_from
are always
at the top level if the columns are not flattened.
If flattened, the values from values_from
are usually
at the start of each label in the columns.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = [{'dep': 5.5, 'step': 1, 'a': 20, 'b': 30},
... {'dep': 5.5, 'step': 2, 'a': 25, 'b': 37},
... {'dep': 6.1, 'step': 1, 'a': 22, 'b': 19},
... {'dep': 6.1, 'step': 2, 'a': 18, 'b': 29}]
>>> df = pd.DataFrame(df)
>>> df
dep step a b
0 5.5 1 20 30
1 5.5 2 25 37
2 6.1 1 22 19
3 6.1 2 18 29
Pivot and flatten columns:
>>> df.pivot_wider(
... index = "dep",
... names_from = "step",
... )
dep a_1 a_2 b_1 b_2
0 5.5 20 25 30 37
1 6.1 22 18 19 29
Modify columns with names_sep
:
>>> df.pivot_wider(
... index = "dep",
... names_from = "step",
... names_sep = "",
... )
dep a1 a2 b1 b2
0 5.5 20 25 30 37
1 6.1 22 18 19 29
Modify columns with names_glue
:
>>> df.pivot_wider(
... index = "dep",
... names_from = "step",
... names_glue = "{_value}_step{step}",
... )
dep a_step1 a_step2 b_step1 b_step2
0 5.5 20 25 30 37
1 6.1 22 18 19 29
Expand columns to expose implicit missing values - this applies only to categorical columns:
>>> weekdays = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
>>> daily = pd.DataFrame(
... {
... "day": pd.Categorical(
... values=("Tue", "Thu", "Fri", "Mon"), categories=weekdays
... ),
... "value": (2, 3, 1, 5),
... },
... index=[0, 0, 0, 0],
... )
>>> daily
day value
0 Tue 2
0 Thu 3
0 Fri 1
0 Mon 5
>>> daily.pivot_wider(names_from='day', values_from='value')
Tue Thu Fri Mon
0 2 3 1 5
>>> (daily
... .pivot_wider(
... names_from='day',
... values_from='value',
... names_expand=True)
... )
Mon Tue Wed Thu Fri Sat Sun
0 5 2 NaN 3 1 NaN NaN
Expand the index to expose implicit missing values - this applies only to categorical columns:
>>> daily = daily.assign(letter = list('ABBA'))
>>> daily
day value letter
0 Tue 2 A
0 Thu 3 B
0 Fri 1 B
0 Mon 5 A
>>> daily.pivot_wider(index='day',names_from='letter',values_from='value')
day A B
0 Tue 2.0 NaN
1 Thu NaN 3.0
2 Fri NaN 1.0
3 Mon 5.0 NaN
>>> (daily
... .pivot_wider(
... index='day',
... names_from='letter',
... values_from='value',
... index_expand=True)
... )
day A B
0 Mon 5.0 NaN
1 Tue 2.0 NaN
2 Wed NaN NaN
3 Thu NaN 3.0
4 Fri NaN 1.0
5 Sat NaN NaN
6 Sun NaN NaN
Version Changed
- 0.24.0
- Added
reset_index
,names_expand
andindex_expand
parameters.
- Added
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
index
|
list | str
|
Name(s) of columns to use as identifier variables.
It should be either a single column name, or a list of column names.
If |
None
|
names_from
|
list | str
|
Name(s) of column(s) to use to make the new DataFrame's columns. Should be either a single column name, or a list of column names. |
None
|
values_from
|
list | str
|
Name(s) of column(s) that will be used for populating
the new DataFrame's values.
If |
None
|
flatten_levels
|
bool
|
If |
True
|
names_sep
|
str
|
If |
'_'
|
names_glue
|
str
|
A string to control the output of the flattened columns.
It offers more flexibility in creating custom column names,
and uses python's |
None
|
reset_index
|
bool
|
Determines whether to restore |
True
|
names_expand
|
bool
|
Expand columns to show all the categories.
Applies only if |
False
|
index_expand
|
bool
|
Expand the index to show all the categories.
Applies only if |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame that has been unpivoted from long to wide form. |
Source code in janitor/functions/pivot.py
1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 |
|
process_text
Implementation source for process_text
.
process_text(df, column_name, string_function, **kwargs)
Apply a Pandas string method to an existing column.
This function aims to make string cleaning easy, while chaining, by simply passing the string method name, along with keyword arguments, if any, to the function.
This modifies an existing column; it does not create a new column;
new columns can be created via pyjanitor's
transform_columns
.
A list of all the string methods in Pandas can be accessed here.
Note
This function will be deprecated in a 1.x release.
Please use jn.transform_column
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> import re
>>> df = pd.DataFrame({"text": ["Ragnar", "sammywemmy", "ginger"],
... "code": [1, 2, 3]})
>>> df
text code
0 Ragnar 1
1 sammywemmy 2
2 ginger 3
>>> df.process_text(column_name="text", string_function="lower")
text code
0 ragnar 1
1 sammywemmy 2
2 ginger 3
For string methods with parameters, simply pass the keyword arguments:
>>> df.process_text(
... column_name="text",
... string_function="extract",
... pat=r"(ag)",
... expand=False,
... flags=re.IGNORECASE,
... )
text code
0 ag 1
1 NaN 2
2 NaN 3
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
str
|
String column to be operated on. |
required |
string_function
|
str
|
pandas string method to be applied. |
required |
**kwargs
|
Any
|
Keyword arguments for parameters of the |
{}
|
Raises:
Type | Description |
---|---|
KeyError
|
If |
ValueError
|
If the text function returns a DataFrame, instead of a Series. |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with modified column. |
Source code in janitor/functions/process_text.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
remove_columns
Implementation of remove_columns.
remove_columns(df, column_names)
Remove the set of columns specified in column_names
.
This method does not mutate the original DataFrame.
Intended to be the method-chaining alternative to del df[col]
.
Note
This function will be deprecated in a 1.x release.
Kindly use pd.DataFrame.drop
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": [2, 4, 6], "b": [1, 3, 5], "c": [7, 8, 9]})
>>> df
a b c
0 2 1 7
1 4 3 8
2 6 5 9
>>> df.remove_columns(column_names=['a', 'c'])
b
0 1
1 3
2 5
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Union[str, Iterable[str], Hashable]
|
The columns to remove. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/remove_columns.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
remove_empty
Implementation of remove_empty.
remove_empty(df, reset_index=True)
Drop all rows and columns that are completely null.
This method does not mutate the original DataFrame.
Implementation is inspired from StackOverflow.
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": [1, np.nan, 2],
... "b": [3, np.nan, 4],
... "c": [np.nan, np.nan, np.nan],
... })
>>> df
a b c
0 1.0 3.0 NaN
1 NaN NaN NaN
2 2.0 4.0 NaN
>>> df.remove_empty()
a b
0 1.0 3.0
1 2.0 4.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
reset_index
|
bool
|
Determines if the index is reset. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/remove_empty.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
rename_columns
rename_column(df, old_column_name, new_column_name)
Rename a column in place.
This method does not mutate the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.rename
instead.
This is just syntactic sugar/a convenience function for renaming one column at a time.
If you are convinced that there are multiple columns in need of changing,
then use the pandas.DataFrame.rename
method.
Examples:
Change the name of column 'a' to 'a_new'.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("abc")})
>>> df.rename_column(old_column_name='a', new_column_name='a_new')
a_new b
0 0 a
1 1 b
2 2 c
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
old_column_name
|
str
|
The old column name. |
required |
new_column_name
|
str
|
The new column name. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with renamed columns. |
Source code in janitor/functions/rename_columns.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
rename_columns(df, new_column_names=None, function=None)
Rename columns.
This method does not mutate the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.rename
instead.
One of the new_column_names
or function
are a required parameter.
If both are provided, then new_column_names
takes priority and function
is never executed.
Examples:
Rename columns using a dictionary which maps old names to new names.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
>>> df
a b
0 0 x
1 1 y
2 2 z
>>> df.rename_columns(new_column_names={"a": "a_new", "b": "b_new"})
a_new b_new
0 0 x
1 1 y
2 2 z
Rename columns using a generic callable.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": list(range(3)), "b": list("xyz")})
>>> df.rename_columns(function=str.upper)
A B
0 0 x
1 1 y
2 2 z
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
new_column_names
|
Union[Dict, None]
|
A dictionary of old and new column names. |
None
|
function
|
Callable
|
A function which should be applied to all the columns. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If both |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with renamed columns. |
Source code in janitor/functions/rename_columns.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
reorder_columns
Implementation source for reorder_columns
.
reorder_columns(df, column_order)
Reorder DataFrame columns by specifying desired order as list of col names.
Columns not specified retain their order and follow after the columns specified
in column_order
.
All columns specified within the column_order
list must be present within df
.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"col1": [1, 1, 1], "col2": [2, 2, 2], "col3": [3, 3, 3]})
>>> df
col1 col2 col3
0 1 2 3
1 1 2 3
2 1 2 3
>>> df.reorder_columns(['col3', 'col1'])
col3 col1 col2
0 3 1 2
1 3 1 2
2 3 1 2
Notice that the column order of df
is now col3
, col1
, col2
.
Internally, this function uses DataFrame.reindex
with copy=False
to avoid unnecessary data duplication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
|
required |
column_order
|
Union[Iterable[str], Index, Hashable]
|
A list of column names or Pandas |
required |
Raises:
Type | Description |
---|---|
IndexError
|
If a column within |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with reordered columns. |
Source code in janitor/functions/reorder_columns.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
|
round_to_fraction
Implementation of round_to_fraction
round_to_fraction(df, column_name, denominator, digits=np.inf)
Round all values in a column to a fraction.
This method mutates the original DataFrame.
Taken from the R package.
Also, optionally round to a specified number of digits.
Examples:
Round numeric column to the nearest 1/4 value.
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a1": [1.263, 2.499, np.nan],
... "a2": ["x", "y", "z"],
... })
>>> df
a1 a2
0 1.263 x
1 2.499 y
2 NaN z
>>> df.round_to_fraction("a1", denominator=4)
a1 a2
0 1.25 x
1 2.50 y
2 NaN z
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
Name of column to round to fraction. |
required |
denominator
|
float
|
The denominator of the fraction for rounding. Must be a positive number. |
required |
digits
|
float
|
The number of digits for rounding after rounding to the fraction. Default is np.inf (i.e. no subsequent rounding). |
inf
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a column's values rounded. |
Source code in janitor/functions/round_to_fraction.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
row_to_names
Implementation of the row_to_names
function.
row_to_names(df, row_numbers=0, remove_rows=False, remove_rows_above=False, reset_index=False)
Elevates a row, or rows, to be the column names of a DataFrame.
This method does not mutate the original DataFrame.
Contains options to remove the elevated row from the DataFrame along with removing the rows above the selected row.
Examples:
Replace column names with the first row and reset the index.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": ["nums", 6, 9],
... "b": ["chars", "x", "y"],
... })
>>> df
a b
0 nums chars
1 6 x
2 9 y
>>> df.row_to_names(0, remove_rows=True, reset_index=True)
nums chars
0 6 x
1 9 y
>>> df.row_to_names([0,1], remove_rows=True, reset_index=True)
nums chars
6 x
0 9 y
Remove rows above the elevated row and the elevated row itself.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": ["bla1", "nums", 6, 9],
... "b": ["bla2", "chars", "x", "y"],
... })
>>> df
a b
0 bla1 bla2
1 nums chars
2 6 x
3 9 y
>>> df.row_to_names(1, remove_rows=True, remove_rows_above=True, reset_index=True)
nums chars
0 6 x
1 9 y
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
row_numbers
|
int | list | slice
|
Position of the row(s) containing the variable names. It can be an integer, a list or a slice. Defaults to 0 (first row). |
0
|
remove_rows
|
bool
|
Whether the row(s) should be removed from the DataFrame. |
False
|
remove_rows_above
|
bool
|
Whether the row(s) above the selected row should be removed from the DataFrame. |
False
|
reset_index
|
bool
|
Whether the index should be reset on the returning DataFrame. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with set column names. |
Source code in janitor/functions/row_to_names.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
select
DropLabel
dataclass
Helper class for removing labels within the select
syntax.
label
can be any of the types supported in the select
,
select_rows
and select_columns
functions.
An array of integers not matching the labels is returned.
New in version 0.24.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label
|
Any
|
Label(s) to be dropped from the index. |
required |
Source code in janitor/functions/select.py
538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 |
|
get_columns(group, label)
Helper function for selecting columns on a grouped object,
using the
select
syntax.
New in version 0.25.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group
|
DataFrameGroupBy | SeriesGroupBy
|
A Pandas GroupBy object. |
required |
label
|
Any
|
column(s) to select. |
required |
Returns:
Type | Description |
---|---|
DataFrameGroupBy | SeriesGroupBy
|
A pandas groupby object. |
Source code in janitor/functions/select.py
477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 |
|
get_index_labels(arg, df, axis)
Convenience function to get actual labels from column/index
New in version 0.25.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arg
|
Any
|
Valid inputs include: an exact column name to look for,
a shell-style glob string (e.g. |
required |
df
|
DataFrame
|
The pandas DataFrame object. |
required |
axis
|
Literal['index', 'columns']
|
Should be either |
required |
Returns:
Type | Description |
---|---|
Index
|
A pandas Index. |
Source code in janitor/functions/select.py
450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 |
|
select(df, *args, index=None, columns=None, axis='columns', invert=False)
Method-chainable selection of rows and columns.
It accepts a string, shell-like glob strings (*string*)
,
regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.
This method does not mutate the original DataFrame.
Selection can be inverted with the DropLabel
class.
Optional ability to invert selection of index/columns available as well.
New in version 0.24.0
Note
The preferred option when selecting columns or rows in a Pandas DataFrame
is with .loc
or .iloc
methods, as they are generally performant.
select
is primarily for convenience.
Version Changed
- 0.26.0
- Added variable
args
,invert
andaxis
parameters. rows
keyword deprecated in favour ofindex
.
- Added variable
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
>>> df.select(index='cobra', columns='shield')
shield
cobra 2
Labels can be dropped with the DropLabel
class:
>>> df.select(index=DropLabel('cobra'))
max_speed shield
viper 4 5
sidewinder 7 8
More examples can be found in the
select_columns
section.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
*args
|
tuple
|
Valid inputs include: an exact index name to look for,
a shell-style glob string (e.g. |
()
|
index
|
Any
|
Valid inputs include: an exact label to look for,
a shell-style glob string (e.g. |
None
|
columns
|
Any
|
Valid inputs include: an exact label to look for,
a shell-style glob string (e.g. |
None
|
invert
|
bool
|
Whether or not to invert the selection. This will result in the selection of the complement of the rows/columns provided. |
False
|
axis
|
str
|
Whether the selection should be on the index('index'), or columns('columns'). Applicable only for the variable args parameter. |
'columns'
|
Raises:
Type | Description |
---|---|
ValueError
|
If args and index/columns are provided. |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with the specified rows and/or columns selected. |
Source code in janitor/functions/select.py
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 |
|
select_columns(df, *args, invert=False)
Method-chainable selection of columns.
It accepts a string, shell-like glob strings (*string*)
,
regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.
This method does not mutate the original DataFrame.
Optional ability to invert selection of columns available as well.
Note
The preferred option when selecting columns or rows in a Pandas DataFrame
is with .loc
or .iloc
methods.
select_columns
is primarily for convenience.
Note
This function will be deprecated in a 1.x release.
Please use jn.select
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> from numpy import nan
>>> pd.set_option("display.max_columns", None)
>>> pd.set_option("display.expand_frame_repr", False)
>>> pd.set_option("max_colwidth", None)
>>> data = {'name': ['Cheetah','Owl monkey','Mountain beaver',
... 'Greater short-tailed shrew','Cow'],
... 'genus': ['Acinonyx', 'Aotus', 'Aplodontia', 'Blarina', 'Bos'],
... 'vore': ['carni', 'omni', 'herbi', 'omni', 'herbi'],
... 'order': ['Carnivora','Primates','Rodentia','Soricomorpha','Artiodactyla'],
... 'conservation': ['lc', nan, 'nt', 'lc', 'domesticated'],
... 'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
... 'sleep_rem': [nan, 1.8, 2.4, 2.3, 0.7],
... 'sleep_cycle': [nan, nan, nan, 0.133333333, 0.666666667],
... 'awake': [11.9, 7.0, 9.6, 9.1, 20.0],
... 'brainwt': [nan, 0.0155, nan, 0.00029, 0.423],
... 'bodywt': [50.0, 0.48, 1.35, 0.019, 600.0]}
>>> df = pd.DataFrame(data)
>>> df
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000
1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480
2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350
3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019
4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000
Explicit label selection:
>>> df.select_columns('name', 'order')
name order
0 Cheetah Carnivora
1 Owl monkey Primates
2 Mountain beaver Rodentia
3 Greater short-tailed shrew Soricomorpha
4 Cow Artiodactyla
Selection via globbing:
>>> df.select_columns("sleep*", "*wt")
sleep_total sleep_rem sleep_cycle brainwt bodywt
0 12.1 NaN NaN NaN 50.000
1 17.0 1.8 NaN 0.01550 0.480
2 14.4 2.4 NaN NaN 1.350
3 14.9 2.3 0.133333 0.00029 0.019
4 4.0 0.7 0.666667 0.42300 600.000
Selection via regex:
>>> import re
>>> df.select_columns(re.compile(r"o.+er"))
order conservation
0 Carnivora lc
1 Primates NaN
2 Rodentia nt
3 Soricomorpha lc
4 Artiodactyla domesticated
Selection via slicing:
>>> df.select_columns(slice('name','order'), slice('sleep_total','sleep_cycle'))
name genus vore order sleep_total sleep_rem sleep_cycle
0 Cheetah Acinonyx carni Carnivora 12.1 NaN NaN
1 Owl monkey Aotus omni Primates 17.0 1.8 NaN
2 Mountain beaver Aplodontia herbi Rodentia 14.4 2.4 NaN
3 Greater short-tailed shrew Blarina omni Soricomorpha 14.9 2.3 0.133333
4 Cow Bos herbi Artiodactyla 4.0 0.7 0.666667
Selection via callable:
>>> from pandas.api.types import is_numeric_dtype
>>> df.select_columns(is_numeric_dtype)
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
0 12.1 NaN NaN 11.9 NaN 50.000
1 17.0 1.8 NaN 7.0 0.01550 0.480
2 14.4 2.4 NaN 9.6 NaN 1.350
3 14.9 2.3 0.133333 9.1 0.00029 0.019
4 4.0 0.7 0.666667 20.0 0.42300 600.000
>>> df.select_columns(lambda f: f.isna().any())
conservation sleep_rem sleep_cycle brainwt
0 lc NaN NaN NaN
1 NaN 1.8 NaN 0.01550
2 nt 2.4 NaN NaN
3 lc 2.3 0.133333 0.00029
4 domesticated 0.7 0.666667 0.42300
Exclude columns with the invert
parameter:
>>> df.select_columns(is_numeric_dtype, invert=True)
name genus vore order conservation
0 Cheetah Acinonyx carni Carnivora lc
1 Owl monkey Aotus omni Primates NaN
2 Mountain beaver Aplodontia herbi Rodentia nt
3 Greater short-tailed shrew Blarina omni Soricomorpha lc
4 Cow Bos herbi Artiodactyla domesticated
Exclude columns with the DropLabel
class:
>>> from janitor import DropLabel
>>> df.select_columns(DropLabel(slice("name", "awake")), "conservation")
brainwt bodywt conservation
0 NaN 50.000 lc
1 0.01550 0.480 NaN
2 NaN 1.350 nt
3 0.00029 0.019 lc
4 0.42300 600.000 domesticated
Selection on MultiIndex columns:
>>> d = {'num_legs': [4, 4, 2, 2],
... 'num_wings': [0, 0, 2, 2],
... 'class': ['mammal', 'mammal', 'mammal', 'bird'],
... 'animal': ['cat', 'dog', 'bat', 'penguin'],
... 'locomotion': ['walks', 'walks', 'flies', 'walks']}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(['class', 'animal', 'locomotion']).T
>>> df
class mammal bird
animal cat dog bat penguin
locomotion walks walks flies walks
num_legs 4 4 2 2
num_wings 0 0 2 2
Selection with a scalar:
>>> df.select_columns('mammal')
class mammal
animal cat dog bat
locomotion walks walks flies
num_legs 4 4 2
num_wings 0 0 2
Selection with a tuple:
>>> df.select_columns(('mammal','bat'))
class mammal
animal bat
locomotion flies
num_legs 2
num_wings 2
Selection within a level is possible with a dictionary, where the key is either a level name or number:
>>> df.select_columns({'animal':'cat'})
class mammal
animal cat
locomotion walks
num_legs 4
num_wings 0
>>> df.select_columns({1:["bat", "cat"]})
class mammal
animal bat cat
locomotion flies walks
num_legs 2 4
num_wings 2 0
Selection on multiple levels:
>>> df.select_columns({"class":"mammal", "locomotion":"flies"})
class mammal
animal bat
locomotion flies
num_legs 2
num_wings 2
Selection with a regex on a level:
>>> df.select_columns({"animal":re.compile(".+t$")})
class mammal
animal cat bat
locomotion walks flies
num_legs 4 2
num_wings 0 2
Selection with a callable on a level:
>>> df.select_columns({"animal":lambda f: f.str.endswith('t')})
class mammal
animal cat bat
locomotion walks flies
num_legs 4 2
num_wings 0 2
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
*args
|
Any
|
Valid inputs include: an exact column name to look for,
a shell-style glob string (e.g. |
()
|
invert
|
bool
|
Whether or not to invert the selection. This will result in the selection of the complement of the columns provided. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with the specified columns selected. |
Source code in janitor/functions/select.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
|
select_rows(df, *args, invert=False)
Method-chainable selection of rows.
It accepts a string, shell-like glob strings (*string*)
,
regex, slice, array-like object, or a list of the previous options.
Selection on a MultiIndex on a level, or multiple levels, is possible with a dictionary.
This method does not mutate the original DataFrame.
Optional ability to invert selection of rows available as well.
New in version 0.24.0
Note
The preferred option when selecting columns or rows in a Pandas DataFrame
is with .loc
or .iloc
methods, as they are generally performant.
select_rows
is primarily for convenience.
Note
This function will be deprecated in a 1.x release.
Please use jn.select
instead.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = {"col1": [1, 2], "foo": [3, 4], "col2": [5, 6]}
>>> df = pd.DataFrame.from_dict(df, orient='index')
>>> df
0 1
col1 1 2
foo 3 4
col2 5 6
>>> df.select_rows("col*")
0 1
col1 1 2
col2 5 6
More examples can be found in the
select_columns
section.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
*args
|
Any
|
Valid inputs include: an exact index name to look for,
a shell-style glob string (e.g. |
()
|
invert
|
bool
|
Whether or not to invert the selection. This will result in the selection of the complement of the rows provided. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with the specified rows selected. |
Source code in janitor/functions/select.py
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
|
shuffle
Implementation of shuffle
functions.
shuffle(df, random_state=None, reset_index=True)
Shuffle the rows of the DataFrame.
This method does not mutate the original DataFrame.
Super-sugary syntax! Underneath the hood, we use df.sample(frac=1)
,
with the option to set the random state.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "col1": range(5),
... "col2": list("abcde"),
... })
>>> df
col1 col2
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
>>> df.shuffle(random_state=42)
col1 col2
0 1 b
1 4 e
2 2 c
3 0 a
4 3 d
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
random_state
|
Any
|
If provided, set a seed for the random number
generator. Passed to |
None
|
reset_index
|
bool
|
If True, reset the dataframe index to the default RangeIndex. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A shuffled pandas DataFrame. |
Source code in janitor/functions/shuffle.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
sort_column_value_order
Implementation of the sort_column_value_order
function.
sort_column_value_order(df, column, column_value_order, columns=None)
This function adds precedence to certain values in a specified column, then sorts based on that column and any other specified columns.
Examples:
>>> import pandas as pd
>>> import janitor
>>> import numpy as np
>>> company_sales = {
... "SalesMonth": ["Jan", "Feb", "Feb", "Mar", "April"],
... "Company1": [150.0, 200.0, 200.0, 300.0, 400.0],
... "Company2": [180.0, 250.0, 250.0, np.nan, 500.0],
... "Company3": [400.0, 500.0, 500.0, 600.0, 675.0],
... }
>>> df = pd.DataFrame.from_dict(company_sales)
>>> df
SalesMonth Company1 Company2 Company3
0 Jan 150.0 180.0 400.0
1 Feb 200.0 250.0 500.0
2 Feb 200.0 250.0 500.0
3 Mar 300.0 NaN 600.0
4 April 400.0 500.0 675.0
>>> df.sort_column_value_order(
... "SalesMonth",
... {"April": 1, "Mar": 2, "Feb": 3, "Jan": 4}
... )
SalesMonth Company1 Company2 Company3
4 April 400.0 500.0 675.0
3 Mar 300.0 NaN 600.0
1 Feb 200.0 250.0 500.0
2 Feb 200.0 250.0 500.0
0 Jan 150.0 180.0 400.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas DataFrame that we are manipulating |
required |
column
|
str
|
This is a column name as a string we are using to specify which column to sort by |
required |
column_value_order
|
dict
|
Dictionary of values that will represent precedence of the values in the specified column |
required |
columns
|
str
|
A list of additional columns that we can sort by |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If chosen Column Name is not in
Dataframe, or if |
Returns:
Type | Description |
---|---|
DataFrame
|
A sorted pandas DataFrame. |
Source code in janitor/functions/sort_column_value_order.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
sort_naturally
Implementation of the sort_naturally
function.
sort_naturally(df, column_name, **natsorted_kwargs)
Sort a DataFrame by a column using natural sorting.
Natural sorting is distinct from
the default lexiographical sorting provided by pandas
.
For example, given the following list of items:
["A1", "A11", "A3", "A2", "A10"]
Lexicographical sorting would give us:
["A1", "A10", "A11", "A2", "A3"]
By contrast, "natural" sorting would give us:
["A1", "A2", "A3", "A10", "A11"]
This function thus provides natural sorting on a single column of a dataframe.
To accomplish this, we do a natural sort on the unique values that are present in the dataframe. Then, we reconstitute the entire dataframe in the naturally sorted order.
Natural sorting is provided by the Python package natsort.
All keyword arguments to natsort
should be provided
after the column name to sort by is provided.
They are passed through to the natsorted
function.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame(
... {
... "Well": ["A21", "A3", "A21", "B2", "B51", "B12"],
... "Value": [1, 2, 13, 3, 4, 7],
... }
... )
>>> df
Well Value
0 A21 1
1 A3 2
2 A21 13
3 B2 3
4 B51 4
5 B12 7
>>> df.sort_naturally("Well")
Well Value
1 A3 2
0 A21 1
2 A21 13
3 B2 3
5 B12 7
4 B51 4
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
str
|
The column on which natural sorting should take place. |
required |
**natsorted_kwargs
|
Any
|
Keyword arguments to be passed
to natsort's |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
A sorted pandas DataFrame. |
Source code in janitor/functions/sort_naturally.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
take_first
Implementation of take_first function.
take_first(df, subset, by, ascending=True)
Take the first row within each group specified by subset
.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [0, 1, 2, 3]})
>>> df
a b
0 x 0
1 x 1
2 y 2
3 y 3
>>> df.take_first(subset="a", by="b")
a b
0 x 0
2 y 2
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
subset
|
Union[Hashable, Iterable[Hashable]]
|
Column(s) defining the group. |
required |
by
|
Hashable
|
Column to sort by. |
required |
ascending
|
bool
|
Whether or not to sort in ascending order, |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/take_first.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
then
Implementation source for then
.
then(df, func)
Add an arbitrary function to run in the pyjanitor
method chain.
This method does not mutate the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use pd.DataFrame.pipe
instead.
Examples:
A trivial example using a lambda func
.
>>> import pandas as pd
>>> import janitor
>>> (pd.DataFrame({"a": [1, 2, 3], "b": [7, 8, 9]})
... .then(lambda df: df * 2))
a b
0 2 14
1 4 16
2 6 18
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
func
|
Callable
|
A function you would like to run in the method chain. It should take one parameter and return one parameter, each being the DataFrame object. After that, do whatever you want in the middle. Go crazy. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/then.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
to_datetime
Implementation source for to_datetime
.
to_datetime(df, column_name, **kwargs)
Convert column to a datetime type, in-place.
Intended to be the method-chaining equivalent of:
df[column_name] = pd.to_datetime(df[column_name], **kwargs)
This method mutates the original DataFrame.
Note
This function will be deprecated in a 1.x release.
Please use jn.transform_column
instead.
Examples:
Converting a string column to datetime type with custom format.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'date': ['20200101', '20200202', '20200303']})
>>> df
date
0 20200101
1 20200202
2 20200303
>>> df.to_datetime('date', format='%Y%m%d')
date
0 2020-01-01
1 2020-02-02
2 2020-03-03
Read the pandas documentation for to_datetime
for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
Column name. |
required |
**kwargs
|
Any
|
Provide any kwargs that |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with updated datetime data. |
Source code in janitor/functions/to_datetime.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
toset
Implementation of the toset
function.
toset(series)
Return a set of the values.
Note
This function will be deprecated in a 1.x release.
Please use set(df[column])
instead.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)
Examples:
>>> import pandas as pd
>>> import janitor
>>> s = pd.Series([1, 2, 3, 5, 5], index=["a", "b", "c", "d", "e"])
>>> s
a 1
b 2
c 3
d 5
e 5
dtype: int64
>>> s.toset()
{1, 2, 3, 5}
Parameters:
Name | Type | Description | Default |
---|---|---|---|
series
|
Series
|
A pandas series. |
required |
Returns:
Type | Description |
---|---|
Set
|
A set of values. |
Source code in janitor/functions/toset.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
transform_columns
transform_column(df, column_name, function, dest_column_name=None, elementwise=True)
Transform the given column using the provided function.
Meant to be the method-chaining equivalent of:
df[dest_column_name] = df[column_name].apply(function)
Functions can be applied in one of two ways:
- Element-wise (default;
elementwise=True
). Then, the individual column elements will be passed in as the first argument offunction
. - Column-wise (
elementwise=False
). Then,function
is expected to take in a pandas Series and return a sequence that is of identical length to the original.
If dest_column_name
is provided, then the transformation result is stored
in that column. Otherwise, the transformed result is stored under the name
of the original column.
This method does not mutate the original DataFrame.
Examples:
Transform a column in-place with an element-wise function.
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "a": [2, 3, 4],
... "b": ["area", "pyjanitor", "grapefruit"],
... })
>>> df
a b
0 2 area
1 3 pyjanitor
2 4 grapefruit
>>> df.transform_column(
... column_name="a",
... function=lambda x: x**2 - 1,
... )
a b
0 3 area
1 8 pyjanitor
2 15 grapefruit
Examples:
Transform a column in-place with an column-wise function.
>>> df.transform_column(
... column_name="b",
... function=lambda srs: srs.str[:5],
... elementwise=False,
... )
a b
0 2 area
1 3 pyjan
2 4 grape
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_name
|
Hashable
|
The column to transform. |
required |
function
|
Callable
|
A function to apply on the column. |
required |
dest_column_name
|
Optional[str]
|
The column name to store the transformation result in. Defaults to None, which will result in the original column name being overwritten. If a name is provided here, then a new column with the transformed values will be created. |
None
|
elementwise
|
bool
|
Whether to apply the function elementwise or not.
If |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with a transformed column. |
Source code in janitor/functions/transform_columns.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
transform_columns(df, column_names, function, suffix=None, elementwise=True, new_column_names=None)
Transform multiple columns through the same transformation.
This method does not mutate the original DataFrame.
Super syntactic sugar!
Essentially wraps transform_column
and calls it repeatedly over all column names provided.
User can optionally supply either a suffix to create a new set of columns
with the specified suffix, or provide a dictionary mapping each original
column name in column_names
to its corresponding new column name.
Note that all column names must be strings.
Examples:
log10 transform a list of columns, replacing original columns.
>>> import numpy as np
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "col1": [5, 10, 15],
... "col2": [3, 6, 9],
... "col3": [10, 100, 1_000],
... })
>>> df
col1 col2 col3
0 5 3 10
1 10 6 100
2 15 9 1000
>>> df.transform_columns(["col1", "col2", "col3"], np.log10)
col1 col2 col3
0 0.698970 0.477121 1.0
1 1.000000 0.778151 2.0
2 1.176091 0.954243 3.0
Using the suffix
parameter to create new columns.
>>> df.transform_columns(["col1", "col3"], np.log10, suffix="_log")
col1 col2 col3 col1_log col3_log
0 5 3 10 0.698970 1.0
1 10 6 100 1.000000 2.0
2 15 9 1000 1.176091 3.0
Using the new_column_names
parameter to create new columns.
>>> df.transform_columns(
... ["col1", "col3"],
... np.log10,
... new_column_names={"col1": "transform1"},
... )
col1 col2 col3 transform1
0 5 3 1.0 0.698970
1 10 6 2.0 1.000000
2 15 9 3.0 1.176091
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A pandas DataFrame. |
required |
column_names
|
Union[List[str], Tuple[str]]
|
An iterable of columns to transform. |
required |
function
|
Callable
|
A function to apply on each column. |
required |
suffix
|
Optional[str]
|
Suffix to use when creating new columns to hold the transformed values. |
None
|
elementwise
|
bool
|
Passed on to |
True
|
new_column_names
|
Optional[Dict[str, str]]
|
An explicit mapping of old column names in
|
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If both |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with transformed columns. |
Source code in janitor/functions/transform_columns.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 |
|
truncate_datetime
Implementation of the truncate_datetime
family of functions.
truncate_datetime_dataframe(df, datepart)
Truncate times down to a user-specified precision of year, month, day, hour, minute, or second.
This method does not mutate the original DataFrame.
Examples:
>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({
... "foo": ["xxxx", "yyyy", "zzzz"],
... "dt": pd.date_range("2020-03-11", periods=3, freq="15H"),
... })
>>> df
foo dt
0 xxxx 2020-03-11 00:00:00
1 yyyy 2020-03-11 15:00:00
2 zzzz 2020-03-12 06:00:00
>>> df.truncate_datetime_dataframe("day")
foo dt
0 xxxx 2020-03-11
1 yyyy 2020-03-11
2 zzzz 2020-03-12
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame on which to truncate datetime. |
required |
datepart
|
str
|
Truncation precision, YEAR, MONTH, DAY, HOUR, MINUTE, SECOND. (String is automagically capitalized) |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with all valid datetimes truncated down to the specified precision. |
Source code in janitor/functions/truncate_datetime.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
update_where
Function for updating values based on other column values.
update_where(df, conditions, target_column_name, target_val)
Add multiple conditions to update a column in the dataframe.
This method does not mutate the original DataFrame.
Examples:
>>> import janitor
>>> data = {
... "a": [1, 2, 3, 4],
... "b": [5, 6, 7, 8],
... "c": [0, 0, 0, 0],
... }
>>> df = pd.DataFrame(data)
>>> df
a b c
0 1 5 0
1 2 6 0
2 3 7 0
3 4 8 0
>>> df.update_where(
... conditions = (df.a > 2) & (df.b < 8),
... target_column_name = 'c',
... target_val = 10
... )
a b c
0 1 5 0
1 2 6 0
2 3 7 10
3 4 8 0
>>> df.update_where( # supports pandas *query* style string expressions
... conditions = "a > 2 and b < 8",
... target_column_name = 'c',
... target_val = 10
... )
a b c
0 1 5 0
1 2 6 0
2 3 7 10
3 4 8 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pandas DataFrame object. |
required |
conditions
|
Any
|
Conditions used to update a target column and target value. |
required |
target_column_name
|
Hashable
|
Column to be updated. If column does not exist in DataFrame, a new column will be created; note that entries that do not get set in the new column will be null. |
required |
target_val
|
Any
|
Value to be updated. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame. |
Source code in janitor/functions/update_where.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
utils
Utility functions for all of the functions submodule.
patterns(regex_pattern)
This function converts a string into a compiled regular expression.
It can be used to select columns in the index or columns_names
arguments of pivot_longer
function.
Warning
This function is deprecated. Kindly use re.compile
instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
regex_pattern
|
Union[str, Pattern]
|
String to be converted to compiled regular expression. |
required |
Returns:
Type | Description |
---|---|
Pattern
|
A compile regular expression from provided |
Source code in janitor/functions/utils.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
unionize_dataframe_categories(*dataframes, column_names=None)
Given a group of dataframes which contain some categorical columns, for each categorical column present, find all the possible categories across all the dataframes which have that column. Update each dataframes' corresponding column with a new categorical object that contains the original data but has labels for all the possible categories from all dataframes. This is useful when concatenating a list of dataframes which all have the same categorical columns into one dataframe.
If, for a given categorical column, all input dataframes do not have at
least one instance of all the possible categories,
Pandas will change the output dtype of that column from category
to
object
, losing out on dramatic speed gains you get from the former
format.
Examples:
Usage example for concatenation of categorical column-containing dataframes:
Instead of:
concatenated_df = pd.concat([df1, df2, df3], ignore_index=True)
which in your case has resulted in category
-> object
conversion,
use:
unionized_dataframes = unionize_dataframe_categories(df1, df2, df2)
concatenated_df = pd.concat(unionized_dataframes, ignore_index=True)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*dataframes
|
Any
|
The dataframes you wish to unionize the categorical objects for. |
()
|
column_names
|
Optional[Iterable[CategoricalDtype]]
|
If supplied, only unionize this subset of columns. |
None
|
Raises:
Type | Description |
---|---|
TypeError
|
If any of the inputs are not pandas DataFrames. |
Returns:
Type | Description |
---|---|
List[DataFrame]
|
A list of the category-unioned dataframes in the same order they were provided. |
Source code in janitor/functions/utils.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|