Title: | Tidy Output from Regular Expression Matching |
---|---|
Description: | Wrappers on 'regexpr' and 'gregexpr' to return the match results in tidy data frames. |
Authors: | Gábor Csárdi [aut, cre], Matthew Lincoln [ctb], Posit Software, PBC [cph, fnd] |
Maintainer: | Gábor Csárdi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.1.2.9000 |
Built: | 2024-12-09 06:15:23 UTC |
Source: | https://github.com/r-lib/rematch2 |
Taking a data frame and a column name as input, this function will run
re_match()
and bind the results as new columns to the original
table., returning a tibble::tibble()
. This makes it friendly for
pipe-oriented programming with magrittr.
bind_re_match(df, from, ..., keep_match = FALSE) bind_re_match_(df, from, ..., keep_match = FALSE)
bind_re_match(df, from, ..., keep_match = FALSE) bind_re_match_(df, from, ..., keep_match = FALSE)
df |
A data frame. |
from |
Name of column to use as input for |
... |
Arguments (including |
keep_match |
Should the column |
bind_re_match_()
: Standard-evaluation version that takes a quoted column name.
If named capture groups will result in multiple columns with the same
column name, tibble::repair_names()
will be called on the
resulting table.
Standard-evaluation version bind_re_match_()
that is
suitable for programming.
match_cars <- tibble::rownames_to_column(mtcars) bind_re_match(match_cars, rowname, "^(?<make>\\w+) ?(?<model>.+)?$")
match_cars <- tibble::rownames_to_column(mtcars) bind_re_match(match_cars, rowname, "^(?<make>\\w+) ?(?<model>.+)?$")
Match a regular expression to a string, and return matches, match positions,
and capture groups. This function is like its
match()
counterpart, except it returns match/capture
group start and end positions in addition to the matched values.
re_exec(text, pattern, perl = TRUE, ...) ## S3 method for class 'rematch_records' x$name ## S3 method for class 'rematch_allrecords' x$name
re_exec(text, pattern, perl = TRUE, ...) ## S3 method for class 'rematch_records' x$name ## S3 method for class 'rematch_allrecords' x$name
text |
Character vector. |
pattern |
A regular expression. See |
perl |
logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups. |
... |
Additional arguments to pass to
|
x |
Object returned by |
name |
|
A tidy data frame (see Section “Tidy Data”). Match record entries are one length vectors that are set to NA if there is no match.
The return value is a tidy data frame where each row
corresponds to an element of the input character vector text
. The
values from text
appear for reference in the .text
character
column. All other columns are list columns containing the match data. The
.match
column contains the match information for full regular
expression matches while other columns correspond to capture groups if there
are any, and PCRE matches are enabled with perl = TRUE
(this is on by
default). If capture groups are named the corresponding columns will bear
those names.
Each match data column list contains match records, one for each element in
text
. A match record is a named list, with entries match
,
start
and end
that are respectively the matching (sub) string,
the start, and the end positions (using one based indexing).
To make it easier to extract matching substrings or positions, a special
$
operator is defined on match columns, both for the .match
column and the columns corresponding to the capture groups. See examples
below.
base::regexpr()
, which this function wraps
Other tidy regular expression matching:
re_exec_all()
,
re_match_all()
,
re_match()
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # Match first occurrence pos <- re_exec(notables, name_rex) pos # Custom $ to extract matches and positions pos$first$match pos$first$start pos$first$end
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # Match first occurrence pos <- re_exec(notables, name_rex) pos # Custom $ to extract matches and positions pos$first$match pos$first$start pos$first$end
Match a regular expression to a string, and return matches, match positions,
and capture groups. This function is like its
match()
counterpart, except it returns
match/capture group start and end positions in addition to the matched
values.
re_exec_all(text, pattern, perl = TRUE, ...)
re_exec_all(text, pattern, perl = TRUE, ...)
text |
Character vector. |
pattern |
A regular expression. See |
perl |
logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups. |
... |
Additional arguments to pass to
|
A tidy data frame (see Section “Tidy Data”). The entries within the match records within the list columns will be one vectors as long as there are matches for the corresponding text element.
The return value is a tidy data frame where each row
corresponds to an element of the input character vector text
. The
values from text
appear for reference in the .text
character
column. All other columns are list columns containing the match data. The
.match
column contains the match information for full regular
expression matches while other columns correspond to capture groups if there
are any, and PCRE matches are enabled with perl = TRUE
(this is on by
default). If capture groups are named the corresponding columns will bear
those names.
Each match data column list contains match records, one for each element in
text
. A match record is a named list, with entries match
,
start
and end
that are respectively the matching (sub) string,
the start, and the end positions (using one based indexing).
To make it easier to extract matching substrings or positions, a special
$
operator is defined on match columns, both for the .match
column and the columns corresponding to the capture groups. See examples
below.
base::gregexpr()
, which this function wraps
Other tidy regular expression matching:
re_exec()
,
re_match_all()
,
re_match()
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # All occurrences allpos <- re_exec_all(notables, name_rex) allpos # Custom $ to extract matches and positions allpos$first$match allpos$first$start allpos$first$end
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # All occurrences allpos <- re_exec_all(notables, name_rex) allpos # Custom $ to extract matches and positions allpos$first$match allpos$first$start allpos$first$end
re_match
wraps base::regexpr()
and returns the
match results in a convenient data frame. The data frame has one
column for each capture group if perl=TRUE
, and one final columns
called .match
for the matching (sub)string. The columns of the capture
groups are named if the groups themselves are named.
re_match(text, pattern, perl = TRUE, ...)
re_match(text, pattern, perl = TRUE, ...)
text |
Character vector. |
pattern |
A regular expression. See |
perl |
logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups. |
... |
Additional arguments to pass to |
A data frame of character vectors: one column per capture
group, named if the group was named, and additional columns for
the input text and the first matching (sub)string. Each row
corresponds to an element in the text
vector.
re_match
uses PCRE compatible regular expressions by default
(i.e. perl = TRUE
in base::regexpr()
). You can switch
this off but if you do so capture groups will no longer be reported as they
are only supported by PCRE.
Other tidy regular expression matching:
re_exec_all()
,
re_exec()
,
re_match_all()
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" re_match(text = dates, pattern = isodate) # The same with named groups isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])" re_match(text = dates, pattern = isodaten)
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" re_match(text = dates, pattern = isodate) # The same with named groups isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])" re_match(text = dates, pattern = isodaten)
This function is a thin wrapper on the base::gregexpr()
base R function, to extract the matching (sub)strings as a data frame.
It extracts all matches, and potentially their capture groups as well.
re_match_all(text, pattern, perl = TRUE, ...)
re_match_all(text, pattern, perl = TRUE, ...)
text |
Character vector. |
pattern |
A regular expression. See |
perl |
logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups. |
... |
Additional arguments to pass to
|
A tidy data frame (see Section “Tidy Data”). The list columns contain character vectors with as many entries as there are matches for each input element.
The return value is a tidy data frame where each row
corresponds to an element of the input character vector text
. The
values from text
appear for reference in the .text
character
column. All other columns are list columns containing the match data. The
.match
column contains the match information for full regular
expression matches while other columns correspond to capture groups if there
are any, and PCRE matches are enabled with perl = TRUE
(this is on by
default). If capture groups are named the corresponding columns will bear
those names.
Each match data column list contains match records, one for each element in
text
. A match record is a named list, with entries match
,
start
and end
that are respectively the matching (sub) string,
the start, and the end positions (using one based indexing).
If the input text character vector has length zero,
base::regexpr()
is called instead of
base::gregexpr()
, because the latter cannot extract the
number and names of the capture groups in this case.
Other tidy regular expression matching:
re_exec_all()
,
re_exec()
,
re_match()
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) re_match_all(notables, name_rex)
name_rex <- paste0( "(?<first>[[:upper:]][[:lower:]]+) ", "(?<last>[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) re_match_all(notables, name_rex)