Mutating data

A mutator is a function that takes in a list of Pandas series and a probability between zero and one and returns a list of mutated series. The probability dictates which maximum percentage of rows within each series should be mutated.

If a mutator fails to meet the requested percentage, most likely due to the contents of a series not being able to be mutated due to preconditions imposed by the mutator, it will emit a GeckoWarning.

Gecko comes with a bunch of built-in mutators which are described on this page. They are exposed in Gecko's mutator module.

Available mutators

Keyboard typos

One of the most common sources for typos are adjacent keys on a keyboard. Gecko supports loading of keyboard layouts and applying typos based on them. Currently, keyboard layouts must be provided as an XML file from the Unicode CLDR repository. Gecko parses these files and determines all neighboring keys of each key, as well as their variants with and without Shift pressed.

Warning

As of Unicode CLDR keyboard specification is under a major redesign as of release 44. Support will be added as soon as the specification is finalized. For now, please retrieve CLDR keyboard files from a release tagged 43 or earlier. The examples in this documentation use files from the CLDR release 43.

Download the German keyboard layout from the CLDR repository. The corresponding mutator is called with_cldr_keymap_file. Point the mutator to the file you just downloaded. In the following example, one character in each word is substituted by another neighboring character on the German keyboard.

import pandas as pd
import numpy as np

from gecko import mutator

rng = np.random.default_rng(3141)
kb_mutator = mutator.with_cldr_keymap_file(
    "./de-t-k0-windows.xml",
    rng=rng
)
srs = pd.Series(["apple", "banana", "clementine"])
print(kb_mutator([srs], 1.0))
# => [["aople", "ganana", "clementkne"]]

By default, this mutator considers all possible neighboring keys for each key. If you want to constrain typos to a certain set of characters, you can pass an optional string of characters or a list of characters to this mutator. One such example is to limit the mutator to digits when manipulating a series of numbers that are broken up by non-digit characters. The following snippet avoids the substitution of hyphens by specifying that only digits may be manipulated.

import pandas as pd
import numpy as np
import string

from gecko import mutator

rng = np.random.default_rng(2718)
kb_mutator = mutator.with_cldr_keymap_file(
    "./de-t-k0-windows.xml",
    charset=string.digits,
    rng=rng
)
srs = pd.Series(["123-456-789", "727-727-727", "294-753-618"])
print(kb_mutator([srs], 1.0))
# => [["123-457-789", "717-727-727", "295-753-618"]]

Phonetic errors

One of the most challenging error sources to model are phonetic errors. These are words that sound the same but are written differently.

In German, for example, "ß" can almost always be replaced with "ss" and still have the word that it's in sound the same. Whether one writes "Straße" or "Strasse" does not matter as far as pronunciation is concerned. The same holds for "dt" and "tt" at the end of a word, since both reflect a hard "t" sound. One can derive rules from similarly sounding character sequences.

Gecko offers a method for modelling these rules and introducing phonetic errors based on them. A phonetic rule in Gecko consists of a source pattern ("ß", "dt"), a target pattern ("ss", "tt") and positional flags. The flags determine whether this rule applies at the start (^), in the middle (_) or the end ($) of a word. These flags can be freely combined. The absence of a positional flag implies that a rule can be applied anywhere in a string. Taking the example from above, a suitable rule table could look like this.

CSVTable

source,target,flags
ß,ss,
dt,tt,$

Source	Target	Flags
ß	ss
dt	tt	$

Gecko exposes the with_phonetic_replacement_table function to handle these types of tables. The call signature is similar to that of with_replacement_table.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(8844167)

phonetic_mutator = mutator.with_phonetic_replacement_table(
    "./phonetic-rules-de.csv",
    source_column="source",
    target_column="target",
    flags_column="flags",
    rng=rng,
)

srs = pd.Series(["straße", "stadt", "schießen"])
print(phonetic_mutator([srs], 1.0))
# => [["strasse", "statt", "schiessen"]]

Missing values

A textual representation of a "missing value" is sometimes used to clearly indicate that a blank or an empty value is to be interpreted as a missing piece of information. In datasets sourced from large databases, this "missing value" might consist of characters that do not adhere to a table or column schema. A simple example would be ###_MISSING_### in place of a person's date of birth, since it does not conform to any common date format and consists entirely of letters and special characters.

Gecko provides the function with_missing_value which replaces values within a series with a representative "missing value". In the following example, 50% of all rows will be converted to a missing value.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(2905)

missing_mutator = mutator.with_missing_value("###_MISSING_###", rng=rng)
srs = pd.Series(["apple", "banana", "clementine"])
print(missing_mutator([srs], 0.5))
# => [["apple", "###_MISSING_###", "###_MISSING_###"]]

Edit errors

Edit errors are caused by a set of operations on single characters within a word. There are commonly four operations that can induce these types of errors: insertion and deletion of a single character, substitution of a character with a different one, and transposition of two adjacent characters.

Gecko provides mutators for each of these operations. For insertions and substitutions, it is possible to define a set of characters to choose from.

import string

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(8080)
srs = pd.Series(["apple", "banana", "clementine"])

insert_mutator = mutator.with_insert(charset=string.ascii_lowercase, rng=rng)
print(insert_mutator([srs], 1.0))
# => [["appmle", "bananai", "clemenjtine"]]

delete_mutator = mutator.with_delete(rng=rng)
print(delete_mutator([srs], 1.0))
# => [["appl", "anana", "clemntine"]]

substitute_mutator = mutator.with_substitute(charset=string.ascii_lowercase, rng=rng)
print(substitute_mutator([srs], 1.0))
# => [["apyle", "sanana", "clemeneine"]]

transpose_mutator = mutator.with_transpose(rng=rng)
print(transpose_mutator([srs], 1.0))
# => [["paple", "banaan", "cleemntine"]]

Categorical errors

Sometimes an attribute can only take on a set number of values. For example, if you have a "gender" column in your dataset, and it can only take on m for male, f for female and o for other, it wouldn't make sense for a mutated record to contain anything else except these three options.

Gecko offers the with_categorical_values function for this purpose. It sources all possible options from a column in a CSV file and then applies random replacements respecting the limited available options.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(22)
srs = pd.Series(["f", "m", "f", "f", "o", "m", "o", "o"])

categorical_mutator = mutator.with_categorical_values(
    "./gender.csv",  # CSV file containing "gender" column with "f", "m" and "o" as possible values
    value_column="gender",
    rng=rng,
)

print(categorical_mutator([srs], 1.0))
# => [["m", "o", "m", "m", "m", "f", "m", "f"]]

Value permutations

Certain types of information are easily confused with others. This is particularly true for names, where the differentiation between given and last names in a non-native language is challenging to get right. The with_permute function handles this exact use case. It simply swaps the values between series that are passed into it. In this example, 50% of all rows are permuted at random.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(1955104)

srs_given_name = pd.Series(["Max", "Jane", "Jan"])
srs_last_name = pd.Series(["Mustermann", "Doe", "Jansen"])

permute_mutator = mutator.with_permute(rng=rng)
print(permute_mutator([srs_given_name, srs_last_name], 0.5))
# => [["Max", "Doe", "Jan"],
#       ["Mustermann", "Jane", "Jansen"]]

Common replacements

Other various error sources, such as optical character recognition (OCR) errors, can be modeled using simple replacement tables. These tables have a source and a target column, defining mappings between character sequences.

The with_replacement_table function achieves just that. Suppose you have the following CSV file with common OCR errors.

k,lc
5,s
2,z
1,|

You can use this file the same way you can with many other generation and mutation functions in Gecko. Specifying the inline flag ensures that replacements are performed within words.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(6379)
srs = pd.Series(["kick 0", "step 1", "go 2", "run 5"])

replacement_mutator = mutator.with_replacement_table(
    "./ocr.csv",
    inline=True,
    rng=rng,
)

print(replacement_mutator([srs], 1.0))
# => ["lcick 0", "step |", "go z", "run s"]

To only replace whole words, leave out the inline flag or set it to False. One use case is to replace names that sound or seem similar.

CSVTable

source,target
Jan,Jann
Jan,Jean
Jan,John
Jan,Juan
Jann,Jean
Jann,Johann
Jann,John
Jann,Juan

Source	Target
Jan	Jann
Jan	Jean
Jan	John
Jan	Juan
Jann	Jean
Jann	Johann
Jann	John
Jann	Juan

Assuming the table shown above, one could perform randomized replacements using this mutator like so.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(6379)
srs = pd.Series(["Jan", "Jann", "Juan"])

replacement_mutator = mutator.with_replacement_table(
    "./given-names.csv",
    rng=rng,
)

print(replacement_mutator([srs], 1.0))
# => GeckoWarning: with_replacement_table: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["Jann", "Juan", "Juan"]

Note how "Juan" is not replaced since it is only present in the "target" column, not the "source" column. By default, this mutator only considers replacement from the "source" to the "target" column. If it should also consider reverse replacements, set the reverse flag.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(6379)
srs = pd.Series(["Jan", "Jann", "Juan"])

replacement_mutator = mutator.with_replacement_table(
    "./given-names.csv",
    reverse=True,
    rng=rng,
)

print(replacement_mutator([srs], 1.0))
# => ["Jann", "Johann", "Jann"]

Regex replacements

Where the phonetic and generic replacement mutators do not fit the bill, replacements using regular expressions might come in handy. with_regex_replacement_table supports the application of mutations based on regular expressions. This mutator works off of CSV files which contain the regular expression patterns to look for and the substitutions to perform as columns.

Warning

Before using this mutator, make sure that with_phonetic_replacement_table and with_replacement_table are not suitable for your use case. These functions are more optimised, whereas with_regex_replacement_table has to perform replacements on a mostly row-by-row basis which impacts performance.

Let's assume that you want to perform mutations on a column containing dates where the digits of certain days should be flipped. A CSV file that is capable of these mutations could look as follows.

CSVTable

pattern,1
"\d{4}-\d{2}-(30)","03"
"\d{4}-\d{2}-(20)","02"
"\d{4}-\d{2}-(10)","01"

Pattern	1
`\d{4}-\d{2}-(30)`	`03`
`\d{4}-\d{2}-(20)`	`02`
`\d{4}-\d{2}-(10)`	`01`

A mutator using the CSV file above would look for dates that have "10", "20" or "30" in their "day" field and flips the digits to "01", "02" and "03" respectively. This is done by placing a capture group around the "day" field in the regular expression. Since it is the first capture group, once a row matches, Gecko will look up the substitution in the column labelled "1" in the CSV file. This also works when using named capture groups, in which case Gecko will use the name of the capture group to look up substitutions.

CSVTable

pattern,day
"\d{4}-\d{2}-(?P<day>30)","03"
"\d{4}-\d{2}-(?P<day>20)","02"
"\d{4}-\d{2}-(?P<day>10)","01"

Pattern	Day
`\d{4}-\d{2}-(?P<day>30)`	`03`
`\d{4}-\d{2}-(?P<day>20)`	`02`
`\d{4}-\d{2}-(?P<day>10)`	`01`

Substitutions may also reference named capture groups. Suppose you want to flip the least significant digit of the "day" and "month" field under certain conditions. A CSV file capable of performing this type of substitution looks as follows.

CSVTable

pattern,month,day
"\d{4}-0(?P<month>[1-8])-[0-2](?P<day>[1-8])","(?P<day>)","(?P<month>)"

Pattern	Month	Day
`\d{4}-0(?P<month>[1-8])-[0-2](?P<day>[1-8])`	`(?P<day>)`	`(?P<month>)`

with_regex_replacement_table works much like its "phonetic" and "common" siblings in that it requires a path to a CSV file as shown above and the name of the column containing the regex patterns to look for. The columns containing the substitution values are inferred at runtime. In the following snippet, the second example using named capture groups to flip the digits in the day field is shown.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(0x2321)
srs = pd.Series(["2020-01-30", "2020-01-20", "2020-01-10"])

regex_mutator = mutator.with_regex_replacement_table(
    "./dob-day-digit-flip.csv",
    pattern_column="pattern",
    rng=rng
)

print(regex_mutator([srs], 1.0))
# => ["2020-01-03", "2020-01-02", "2020-01-01"]

It is also possible to define a column that contains regex flags. At the time, Gecko supports the ASCII and IGNORECASE flags which can be applied by adding a and i respectively to the flag column.

CSVTable

pattern,suffix,flags
"fooba(?P<suffix>r)","z","i"

Pattern	Suffix	Flags
`fooba(?P<suffix>r)`	`z`	`i`

In the following snippet, case-insensitive matching will be performed. This causes all rows of the input series to be modified.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(0xCAFED00D)
srs = pd.Series(["foobar", "Foobar", "fOoBaR"])

regex_mutator = mutator.with_regex_replacement_table(
    "./foobar.csv",
    pattern_column="pattern",
    flags_column="flags",
    rng=rng
)

print(regex_mutator([srs], 1.0))
# => ["foobaz", "Foobaz", "fOoBaz"]

Case conversions

During data entry or normalization, it may occur that text is converted to all lowercase or uppercase, by accident or on purpose. with_lowercase and with_uppercase handle these use cases. Gecko outputs warnings here because some values in the original series are already all lowercase and uppercase.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(9_0000_8479_1716)
srs = pd.Series(["Foobar", "foobaz", "FOOBAT"])

lowercase_mutator = mutator.with_lowercase(rng=rng)
uppercase_mutator = mutator.with_uppercase(rng=rng)

print(lowercase_mutator([srs], 1.0))
# => GeckoWarning: with_lowercase: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["foobar", "foobaz", "foobat"]

print(uppercase_mutator([srs], 1.0))
# => GeckoWarning: with_uppercase: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.6666666666666666
# => ["FOOBAR", "FOOBAZ", "FOOBAT"]

Date and time offsets

Date and time information is prone to errors where single fields are offset by a couple units. This error source is implemented in the with_datetime_offset function. It requires a range in which time units can be offset by and the format of the data to mutate as expressed by Python's format codes for datetime objects. It is possible to apply offsets in units of days (d), hours (h), minutes (m) and seconds (s).

import numpy as np
import pandas as pd

from gecko import mutator

srs = pd.Series(pd.date_range("2020-01-01", "2020-01-31", freq="D"))
rng = np.random.default_rng(0xffd8)

datetime_mutator = mutator.with_datetime_offset(
   max_delta=5, unit="d", dt_format="%Y-%m-%d", rng=rng
)

print(datetime_mutator([srs], 1.0))
# => ["2019-12-30", "2019-12-29", ..., "2020-02-02", "2020-01-29"]

When applying offsets, it might happen that the offset applied to a single field affects another field, e.g. subtracting a day from January 1st, 2020 will wrap around to December 31st, 2019. If this is not desired, Gecko offers an extra flag that prevents these types of wraparounds at the cost of leaving affected rows untouched. Note how the first and last entry in the output of the snippet below remains unchanged when compared to the previous snippet. Gecko outputs a warning to reflect this.

import numpy as np
import pandas as pd

from gecko import mutator

srs = pd.Series(pd.date_range("2020-01-01", "2020-01-31", freq="D"))
rng = np.random.default_rng(0xffd8)

datetime_mutator = mutator.with_datetime_offset(
   max_delta=5, unit="d", dt_format="%Y-%m-%d", prevent_wraparound=True, rng=rng
)

print(datetime_mutator([srs], 1.0))
# => GeckoWarning: with_datetime_offset: desired probability of 1.0 cannot be met since percentage of rows that could possibly be mutated is 0.9032258064516129
# => ["2020-01-01", "2020-01-02", ..., "2020-01-30", "2020-01-29"]

Repeated values

Erroneous copy-paste operations may yield an unwanted duplication of values. This is implemented in Gecko's with_repeat mutator. By default, it appends values with a space, but a custom joining character can be defined as well.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(1_7117_4268)
srs = pd.Series(["foo", "bar", "baz"])

repeat_mutator = mutator.with_repeat(rng=rng)
repeat_mutator_no_space = mutator.with_repeat(join_with="")

print(repeat_mutator([srs], 1.0))
# => ["foo foo", "bar bar", "baz baz"]

print(repeat_mutator_no_space([srs], 1.0))
# => ["foofoo", "barbar", "bazbaz"]

Using generators

with_generator can leverage Gecko's mutators to prepend, append or replace data. For instance, this can be used for emulating compound names for persons who have more than one given or last name. By default, this function adds a space when prepending or appending generated data, but this can be customized.

import numpy as np
import pandas as pd

from gecko import mutator

def generate_foobar_suffix(rand: np.random.Generator):
    def _generate(count: int) -> list[pd.Series]:
        return [pd.Series(rand.choice(("bar", "baz", "bat"), size=count))]

    return _generate

srs = pd.Series(["foo"] * 100)
rng = np.random.default_rng(0x25504446)

gen_prepend_mutator = mutator.with_generator(generate_foobar_suffix(rng), "prepend")
print(gen_prepend_mutator([srs], 1.0))
# => ["bat foo", "bar foo", ..., "baz foo", "baz foo"]

gen_replace_mutator = mutator.with_generator(generate_foobar_suffix(rng), "replace")
print(gen_replace_mutator([srs], 1.0))
# => ["bar", "bar", ..., "baz", "bat"]

gen_append_mutator = mutator.with_generator(generate_foobar_suffix(rng), "append", join_with="")
print(gen_append_mutator([srs], 1.0))
# => ["foobat", "foobat", ..., "foobat", "foobaz"]

# {} can be used as a placeholder for generated values
gen_prepend_placeholder_mutator = mutator.with_generator(generate_foobar_suffix(rng), "append", join_with=" ({})")
print(gen_prepend_placeholder_mutator([srs], 1.0))
# => ["foo (bar)", "foo (baz)", ..., "foo (bat)", "foo (bar)"]

Grouped mutators

When applying mutators that are mutually exclusive, with_group can be used. It can take a list of mutators or a list of weighted mutators as arguments. When providing a list of mutators, all mutators are applied with equal probability. When using weighted mutators, each mutator is applied with its assigned probability.

import numpy as np
import pandas as pd

from gecko import mutator

rng = np.random.default_rng(123)
srs = pd.Series(["a"] * 100)

equal_prob_mutator = mutator.with_group([
   mutator.with_insert(rng=rng),
   mutator.with_delete(rng=rng),
], rng=rng)

(srs_mut_1,) = equal_prob_mutator([srs], 1.0)
print(srs_mut_1.str.len().value_counts())
# => { 0: 44, 2: 56 }
# no more single character values remain

weighted_prob_mutator = mutator.with_group([
   (.25, mutator.with_insert(rng=rng)),
   (.25, mutator.with_delete(rng=rng)),
], rng=rng)

(srs_mut_2,) = weighted_prob_mutator([srs], 1.0)
print(srs_mut_2.str.len().value_counts())
# => { 0: 25, 1: 51, 2: 24 }
# half of the original single character values remain

Multiple mutators

Using mutate_data_frame, you can apply multiple mutators on many columns at once. It is possible to set probabilities for each mutator, as well as to define multiple mutators per column.

import string

import numpy as np
import pandas as pd

from gecko import mutator

df = pd.DataFrame(
    {
        "fruit": ["apple", "banana", "orange"],
        "type": ["elstar", "cavendish", "mandarin"],
        "weight_in_grams": ["241.0", "195.6", "71.1"],
        "amount": ["3", "5", "6"],
        "grade": ["B", "C", "B"],
    }
)

rng = np.random.default_rng(25565)

df_mutated = mutator.mutate_data_frame(df, [
    (("fruit", "type"), (.5, mutator.with_permute())),  # (1)!
    ("grade", [  # (2)!
        mutator.with_substitute(charset=string.ascii_uppercase, rng=rng),
    ]),
    ("amount", [  # (3)!
        (.8, mutator.with_insert(charset=string.digits, rng=rng)),
        (.2, mutator.with_delete(rng=rng))
    ])
])

print(df_mutated)
# => [["fruit", "type", "weight_in_grams", "amount", "grade"],
#       ["elstar", "apple", "241.0", "53", "M"],
#       ["cavendish", "banana", "195.6", "59", "Q"],
#       ["mandarin", "orange", "71.1", "68", "V"]]

You can assign probabilities to a mutator for a column. In this case, the permutation mutator will be applied to 50% of all records. The remaining 50% remain untouched.
You can assign multiple mutators to a column. In this case, all specified mutators will be applied to all rows.
You can assign probabilities to multiple mutators for a column. In this case, the insertion and deletion mutator are applied to 80% and 20% of all records respectively.