Mutating data
A mutator is a function that takes in a list of Pandas series and a probability between zero and one and returns a list of mutated series. The probability dictates which maximum percentage of rows within each series should be mutated.
If a mutator fails to meet the requested percentage, most likely due to the contents of a series not being
able to be mutated due to preconditions imposed by the mutator, it will emit a GeckoWarning
.
Gecko comes with a bunch of built-in mutators which are described on this page.
They are exposed in Gecko's mutator
module.
Available mutators
Keyboard typos
One of the most common sources for typos are adjacent keys on a keyboard. Gecko supports loading of keyboard layouts and applying typos based on them. Currently, keyboard layouts must be provided as an XML file from the Unicode CLDR repository. Gecko parses these files and determines all neighboring keys of each key, as well as their variants with and without Shift pressed.
Warning
As of Unicode CLDR keyboard specification is under a major redesign as of release 44. Support will be added as soon as the specification is finalized. For now, please retrieve CLDR keyboard files from a release tagged 43 or earlier. The examples in this documentation use files from the CLDR release 43.
Download the German keyboard layout from the CLDR repository.
The corresponding mutator is called with_cldr_keymap_file
.
Point the mutator to the file you just downloaded.
In the following example, one character in each word is substituted by another neighboring character on the German
keyboard.
import pandas as pd
import numpy as np
from gecko import mutator
rng = np.random.default_rng(3141)
kb_mutator = mutator.with_cldr_keymap_file(
"./de-t-k0-windows.xml",
rng=rng
)
srs = pd.Series(["apple", "banana", "clementine"])
print(kb_mutator([srs], 1.0))
# => [["aople", "ganana", "clementkne"]]
By default, this mutator considers all possible neighboring keys for each key. If you want to constrain typos to a certain set of characters, you can pass an optional string of characters or a list of characters to this mutator. One such example is to limit the mutator to digits when manipulating a series of numbers that are broken up by non-digit characters. The following snippet avoids the substitution of hyphens by specifying that only digits may be manipulated.
import pandas as pd
import numpy as np
import string
from gecko import mutator
rng = np.random.default_rng(2718)
kb_mutator = mutator.with_cldr_keymap_file(
"./de-t-k0-windows.xml",
charset=string.digits,
rng=rng
)
srs = pd.Series(["123-456-789", "727-727-727", "294-753-618"])
print(kb_mutator([srs], 1.0))
# => [["123-457-789", "717-727-727", "295-753-618"]]
Phonetic errors
One of the most challenging error sources to model are phonetic errors. These are words that sound the same but are written differently.
In German, for example, "ß" can almost always be replaced with "ss" and still have the word that it's in sound the same. Whether one writes "Straße" or "Strasse" does not matter as far as pronunciation is concerned. The same holds for "dt" and "tt" at the end of a word, since both reflect a hard "t" sound. One can derive rules from similarly sounding character sequences.
Gecko offers a method for modelling these rules and introducing phonetic errors based on them.
A phonetic rule in Gecko consists of a source pattern ("ß", "dt"), a target pattern ("ss", "tt") and positional flags.
The flags determine whether this rule applies at the start (^
), in the middle (_
) or the end ($
) of a word.
These flags can be freely combined.
The absence of a positional flag implies that a rule can be applied anywhere in a string.
Taking the example from above, a suitable rule table could look like this.
Gecko exposes the with_phonetic_replacement_table
function to handle these types of tables.
The call signature is similar to that of with_replacement_table
.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(8844167)
phonetic_mutator = mutator.with_phonetic_replacement_table(
"./phonetic-rules-de.csv",
source_column="source",
target_column="target",
flags_column="flags",
rng=rng,
)
srs = pd.Series(["straße", "stadt", "schießen"])
print(phonetic_mutator([srs], 1.0))
# => [["strasse", "statt", "schiessen"]]
Missing values
A textual representation of a "missing value" is sometimes used to clearly indicate that a blank or an empty value is to
be interpreted as a missing piece of information.
In datasets sourced from large databases, this "missing value" might consist of characters that do not adhere to a table
or column schema.
A simple example would be ###_MISSING_###
in place of a person's date of birth, since it does not conform to any
common date format and consists entirely of letters and special characters.
Gecko provides the function with_missing_value
which replaces values within a series with a representative
"missing value".
In the following example, 50% of all rows will be converted to a missing value.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(2905)
missing_mutator = mutator.with_missing_value("###_MISSING_###", rng=rng)
srs = pd.Series(["apple", "banana", "clementine"])
print(missing_mutator([srs], 0.5))
# => [["apple", "###_MISSING_###", "###_MISSING_###"]]
Edit errors
Edit errors are caused by a set of operations on single characters within a word. There are commonly four operations that can induce these types of errors: insertion and deletion of a single character, substitution of a character with a different one, and transposition of two adjacent characters.
Gecko provides mutators for each of these operations. For insertions and substitutions, it is possible to define a set of characters to choose from.
import string
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(8080)
srs = pd.Series(["apple", "banana", "clementine"])
insert_mutator = mutator.with_insert(charset=string.ascii_lowercase, rng=rng)
print(insert_mutator([srs], 1.0))
# => [["appmle", "bananai", "clemenjtine"]]
delete_mutator = mutator.with_delete(rng=rng)
print(delete_mutator([srs], 1.0))
# => [["appl", "anana", "clemntine"]]
substitute_mutator = mutator.with_substitute(charset=string.ascii_lowercase, rng=rng)
print(substitute_mutator([srs], 1.0))
# => [["apyle", "sanana", "clemeneine"]]
transpose_mutator = mutator.with_transpose(rng=rng)
print(transpose_mutator([srs], 1.0))
# => [["paple", "banaan", "cleemntine"]]
Categorical errors
Sometimes an attribute can only take on a set number of values.
For example, if you have a "gender" column in your dataset, and it can only take on m
for male, f
for female and o
for other, it wouldn't make sense for a mutated record to contain anything else except these three options.
Gecko offers the with_categorical_values
function for this purpose.
It sources all possible options from a column in a CSV file and then applies random replacements respecting the limited
available options.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(22)
srs = pd.Series(["f", "m", "f", "f", "o", "m", "o", "o"])
categorical_mutator = mutator.with_categorical_values(
"./gender.csv", # CSV file containing "gender" column with "f", "m" and "o" as possible values
value_column="gender",
rng=rng,
)
print(categorical_mutator([srs], 1.0))
# => [["m", "o", "m", "m", "m", "f", "m", "f"]]
Value permutations
Certain types of information are easily confused with others.
This is particularly true for names, where the differentiation between given and last names in a non-native language is
challenging to get right.
The with_permute
function handles this exact use case.
It simply swaps the values between series that are passed into it.
In this example, 50% of all rows are permuted at random.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(1955104)
srs_given_name = pd.Series(["Max", "Jane", "Jan"])
srs_last_name = pd.Series(["Mustermann", "Doe", "Jansen"])
permute_mutator = mutator.with_permute(rng=rng)
print(permute_mutator([srs_given_name, srs_last_name], 0.5))
# => [["Max", "Doe", "Jan"],
# ["Mustermann", "Jane", "Jansen"]]
Common replacements
Other various error sources, such as optical character recognition (OCR) errors, can be modeled using simple replacement tables. These tables have a source and a target column, defining mappings between character sequences.
The with_replacement_table
function achieves just that.
Suppose you have the following CSV file with common OCR errors.
You can use this file the same way you can with many other generation and mutation functions in Gecko.
Specifying the inline
flag ensures that replacements are performed within words.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(6379)
srs = pd.Series(["kick 0", "step 1", "go 2", "run 5"])
replacement_mutator = mutator.with_replacement_table(
"./ocr.csv",
inline=True,
rng=rng,
)
print(replacement_mutator([srs], 1.0))
# => ["lcick 0", "step |", "go z", "run s"]
To only replace whole words, leave out the inline
flag or set it to False
.
One use case is to replace names that sound or seem similar.
Assuming the table shown above, one could perform randomized replacements using this mutator like so.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(6379)
srs = pd.Series(["Jan", "Jann", "Juan"])
replacement_mutator = mutator.with_replacement_table(
"./given-names.csv",
rng=rng,
)
print(replacement_mutator([srs], 1.0))
# => GeckoWarning: with_replacement_table: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["Jann", "Juan", "Juan"]
Note how "Juan" is not replaced since it is only present in the "target" column, not the "source" column.
By default, this mutator only considers replacement from the "source" to the "target" column.
If it should also consider reverse replacements, set the reverse
flag.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(6379)
srs = pd.Series(["Jan", "Jann", "Juan"])
replacement_mutator = mutator.with_replacement_table(
"./given-names.csv",
reverse=True,
rng=rng,
)
print(replacement_mutator([srs], 1.0))
# => ["Jann", "Johann", "Jann"]
Regex replacements
Where the phonetic and generic replacement mutators do not fit the bill, replacements using regular expressions might
come in handy.
with_regex_replacement_table
supports the application of mutations based on regular expressions.
This mutator works off of CSV files which contain the regular expression patterns to look for and the substitutions
to perform as columns.
Warning
Before using this mutator, make sure that with_phonetic_replacement_table
and with_replacement_table
are not
suitable for your use case.
These functions are more optimised, whereas with_regex_replacement_table
has to perform
replacements on a mostly row-by-row basis which impacts performance.
Let's assume that you want to perform mutations on a column containing dates where the digits of certain days should be flipped. A CSV file that is capable of these mutations could look as follows.
A mutator using the CSV file above would look for dates that have "10", "20" or "30" in their "day" field and flips the digits to "01", "02" and "03" respectively. This is done by placing a capture group around the "day" field in the regular expression. Since it is the first capture group, once a row matches, Gecko will look up the substitution in the column labelled "1" in the CSV file. This also works when using named capture groups, in which case Gecko will use the name of the capture group to look up substitutions.
Substitutions may also reference named capture groups. Suppose you want to flip the least significant digit of the "day" and "month" field under certain conditions. A CSV file capable of performing this type of substitution looks as follows.
with_regex_replacement_table
works much like its "phonetic" and "common" siblings in that it requires a path to a CSV
file as shown above and the name of the column containing the regex patterns to look for.
The columns containing the substitution values are inferred at runtime.
In the following snippet, the second example using named capture groups to flip the digits in the day field is shown.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(0x2321)
srs = pd.Series(["2020-01-30", "2020-01-20", "2020-01-10"])
regex_mutator = mutator.with_regex_replacement_table(
"./dob-day-digit-flip.csv",
pattern_column="pattern",
rng=rng
)
print(regex_mutator([srs], 1.0))
# => ["2020-01-03", "2020-01-02", "2020-01-01"]
It is also possible to define a column that contains regex flags.
At the time, Gecko supports the ASCII
and IGNORECASE
flags which can be applied by adding a
and i
respectively
to the flag column.
In the following snippet, case-insensitive matching will be performed. This causes all rows of the input series to be modified.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(0xCAFED00D)
srs = pd.Series(["foobar", "Foobar", "fOoBaR"])
regex_mutator = mutator.with_regex_replacement_table(
"./foobar.csv",
pattern_column="pattern",
flags_column="flags",
rng=rng
)
print(regex_mutator([srs], 1.0))
# => ["foobaz", "Foobaz", "fOoBaz"]
Case conversions
During data entry or normalization, it may occur that text is converted to all lowercase or uppercase, by accident or on
purpose.
with_lowercase
and with_uppercase
handle these use cases.
Gecko outputs warnings here because some values in the original series are already all lowercase and uppercase.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(9_0000_8479_1716)
srs = pd.Series(["Foobar", "foobaz", "FOOBAT"])
lowercase_mutator = mutator.with_lowercase(rng=rng)
uppercase_mutator = mutator.with_uppercase(rng=rng)
print(lowercase_mutator([srs], 1.0))
# => GeckoWarning: with_lowercase: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["foobar", "foobaz", "foobat"]
print(uppercase_mutator([srs], 1.0))
# => GeckoWarning: with_uppercase: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["FOOBAR", "FOOBAZ", "FOOBAT"]
Date and time offsets
Date and time information is prone to errors where single fields are offset by a couple units.
This error source is implemented in the with_datetime_offset
function.
It requires a range in which time units can be offset by and the format of the data to mutate as expressed
by Python's format codes for datetime objects.
It is possible to apply offsets in units of days (d
), hours (h
), minutes (m
) and seconds (s
).
import numpy as np
import pandas as pd
from gecko import mutator
srs = pd.Series(pd.date_range("2020-01-01", "2020-01-31", freq="D"))
rng = np.random.default_rng(0xffd8)
datetime_mutator = mutator.with_datetime_offset(
max_delta=5, unit="d", dt_format="%Y-%m-%d", rng=rng
)
print(datetime_mutator([srs], 1.0))
# => ["2019-12-30", "2019-12-29", ..., "2020-02-02", "2020-01-29"]
When applying offsets, it might happen that the offset applied to a single field affects another field, e.g. subtracting a day from January 1st, 2020 will wrap around to December 31st, 2019. If this is not desired, Gecko offers an extra flag that prevents these types of wraparounds at the cost of leaving affected rows untouched. Note how the first and last entry in the output of the snippet below remains unchanged when compared to the previous snippet. Gecko outputs a warning to reflect this.
import numpy as np
import pandas as pd
from gecko import mutator
srs = pd.Series(pd.date_range("2020-01-01", "2020-01-31", freq="D"))
rng = np.random.default_rng(0xffd8)
datetime_mutator = mutator.with_datetime_offset(
max_delta=5, unit="d", dt_format="%Y-%m-%d", prevent_wraparound=True, rng=rng
)
print(datetime_mutator([srs], 1.0))
# => GeckoWarning: with_datetime_offset: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.9032258064516129
# => ["2020-01-01", "2020-01-02", ..., "2020-01-30", "2020-01-29"]
Repeated values
Erroneous copy-paste operations may yield an unwanted duplication of values.
This is implemented in Gecko's with_repeat
mutator.
By default, it appends values with a space, but a custom joining character can be defined as well.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(1_7117_4268)
srs = pd.Series(["foo", "bar", "baz"])
repeat_mutator = mutator.with_repeat(rng=rng)
repeat_mutator_no_space = mutator.with_repeat(join_with="")
print(repeat_mutator([srs], 1.0))
# => ["foo foo", "bar bar", "baz baz"]
print(repeat_mutator_no_space([srs], 1.0))
# => ["foofoo", "barbar", "bazbaz"]
Using generators
with_generator
can leverage Gecko's mutators to prepend, append or replace data.
For instance, this can be used for emulating compound names for persons who have more than one given or last name.
By default, this function adds a space when prepending or appending generated data, but this can be customized.
import numpy as np
import pandas as pd
from gecko import mutator
def generate_foobar_suffix(rand: np.random.Generator):
def _generate(count: int) -> list[pd.Series]:
return [pd.Series(rand.choice(("bar", "baz", "bat"), size=count))]
return _generate
srs = pd.Series(["foo"] * 100)
rng = np.random.default_rng(0x25504446)
gen_prepend_mutator = mutator.with_generator(generate_foobar_suffix(rng), "prepend")
print(gen_prepend_mutator([srs], 1.0))
# => ["bat foo", "bar foo", ..., "baz foo", "baz foo"]
gen_replace_mutator = mutator.with_generator(generate_foobar_suffix(rng), "replace")
print(gen_replace_mutator([srs], 1.0))
# => ["bar", "bar", ..., "baz", "bat"]
gen_append_mutator = mutator.with_generator(generate_foobar_suffix(rng), "append", join_with="")
print(gen_append_mutator([srs], 1.0))
# => ["foobat", "foobat", ..., "foobat", "foobaz"]
# {} can be used as a placeholder for generated values
gen_prepend_placeholder_mutator = mutator.with_generator(generate_foobar_suffix(rng), "append", join_with=" ({})")
print(gen_prepend_placeholder_mutator([srs], 1.0))
# => ["foo (bar)", "foo (baz)", ..., "foo (bat)", "foo (bar)"]
Grouped mutators
When applying mutators that are mutually exclusive, with_group
can be used.
It can take a list of mutators or a list of weighted mutators as arguments.
When providing a list of mutators, all mutators are applied with equal probability.
When using weighted mutators, each mutator is applied with its assigned probability.
import numpy as np
import pandas as pd
from gecko import mutator
rng = np.random.default_rng(123)
srs = pd.Series(["a"] * 100)
equal_prob_mutator = mutator.with_group([
mutator.with_insert(rng=rng),
mutator.with_delete(rng=rng),
], rng=rng)
(srs_mut_1,) = equal_prob_mutator([srs], 1.0)
print(srs_mut_1.str.len().value_counts())
# => { 0: 44, 2: 56 }
# no more single character values remain
weighted_prob_mutator = mutator.with_group([
(.25, mutator.with_insert(rng=rng)),
(.25, mutator.with_delete(rng=rng)),
], rng=rng)
(srs_mut_2,) = weighted_prob_mutator([srs], 1.0)
print(srs_mut_2.str.len().value_counts())
# => { 0: 25, 1: 51, 2: 24 }
# half of the original single character values remain
Multiple mutators
Using mutate_data_frame
, you can apply multiple mutators on many columns at once.
It is possible to set probabilities for each mutator, as well as to define multiple mutators per column.
import string
import numpy as np
import pandas as pd
from gecko import mutator
df = pd.DataFrame(
{
"fruit": ["apple", "banana", "orange"],
"type": ["elstar", "cavendish", "mandarin"],
"weight_in_grams": ["241.0", "195.6", "71.1"],
"amount": ["3", "5", "6"],
"grade": ["B", "C", "B"],
}
)
rng = np.random.default_rng(25565)
df_mutated = mutator.mutate_data_frame(df, [
(("fruit", "type"), (.5, mutator.with_permute())), # (1)!
("grade", [ # (2)!
mutator.with_substitute(charset=string.ascii_uppercase, rng=rng),
]),
("amount", [ # (3)!
(.8, mutator.with_insert(charset=string.digits, rng=rng)),
(.2, mutator.with_delete(rng=rng))
])
])
print(df_mutated)
# => [["fruit", "type", "weight_in_grams", "amount", "grade"],
# ["elstar", "apple", "241.0", "53", "M"],
# ["cavendish", "banana", "195.6", "59", "Q"],
# ["mandarin", "orange", "71.1", "68", "V"]]
- You can assign probabilities to a mutator for a column. In this case, the permutation mutator will be applied to 50% of all records. The remaining 50% remain untouched.
- You can assign multiple mutators to a column. In this case, all specified mutators will be applied to all rows.
- You can assign probabilities to multiple mutators for a column. In this case, the insertion and deletion mutator are applied to 80% and 20% of all records respectively.