--- title: "URL Validation" author: "Jim Hester" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{URL Validation} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- Consider the task of correctly [validating a URL](https://mathiasbynens.be/demo/url-regex). From that page two conclusions can be made. 1. Validating URLs require complex regular expressions. 2. Creating a correct regular expression is hard! (only 1 out of 13 regexs were valid for all cases). Because of this one may be tempted to simply copy the best regex you can find ([gist](https://gist.github.com/dperini/729294)). The problem with this is that while you can copy it now, what happens later when you find a case that is not handled correctly? Can you correctly interpret and modify this? ```{r url_parsing_stock, eval=F} "^(?:(?:http(?:s)?|ftp)://)(?:\\S+(?::(?:\\S)*)?@)?(?:(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)(?:\\.(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)*(?:\\.(?:[a-z0-9\u00a1-\uffff]){2,})(?::(?:\\d){2,5})?(?:/(?:\\S)*)?$" ``` However if you re-create the regex with `rex` it is much easier to understand and modify later if needed. ```{r url_parsing_url} library(rex) library(magrittr) valid_chars <- rex(except_some_of(".", "/", " ", "-")) re <- rex( start, # protocol identifier (optional) + // group(list("http", maybe("s")) %or% "ftp", "://"), # user:pass authentication (optional) maybe(non_spaces, maybe(":", zero_or_more(non_space)), "@"), #host name group(zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)), #domain name zero_or_more(".", zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)), #TLD identifier group(".", valid_chars %>% at_least(2)), # server port number (optional) maybe(":", digit %>% between(2, 5)), # resource path (optional) maybe("/", non_space %>% zero_or_more()), end ) ``` We can then validate that it correctly identifies both good and bad URLs. (_IP address validation removed_) ```{r url_parsing_validate} good <- c("http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.com/blah_blah_(wikipedia)_(again)", "http://www.example.com/wpstyle/?p=364", "https://www.example.com/foo/?bar=baz&inga=42&quux", "http://✪df.ws/123", "http://userid:password@example.com:8080", "http://userid:password@example.com:8080/", "http://userid@example.com", "http://userid@example.com/", "http://userid@example.com:8080", "http://userid@example.com:8080/", "http://userid:password@example.com", "http://userid:password@example.com/", "http://➡.ws/䨹", "http://⌘.ws", "http://⌘.ws/", "http://foo.com/blah_(wikipedia)#cite-1", "http://foo.com/blah_(wikipedia)_blah#cite-1", "http://foo.com/unicode_(✪)_in_parens", "http://foo.com/(something)?after=parens", "http://☺.damowmow.com/", "http://code.google.com/events/#&product=browser", "http://j.mp", "ftp://foo.bar/baz", "http://foo.bar/?q=Test%20URL-encoded%20stuff", "http://مثال.إختبار", "http://例子.测试", "http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com", "http://1337.net", "http://a.b-c.de", "http://223.255.255.254") bad <- c( "http://", "http://.", "http://..", "http://../", "http://?", "http://??", "http://??/", "http://#", "http://##", "http://##/", "http://foo.bar?q=Spaces should be encoded", "//", "//a", "///a", "///", "http:///a", "foo.com", "rdar://1234", "h://test", "http:// shouldfail.com", ":// should fail", "http://foo.bar/foo(bar)baz quux", "ftps://foo.bar/", "http://-error-.invalid/", "http://-a.b.co", "http://a.b-.co", "http://0.0.0.0", "http://3628126748", "http://.www.foo.bar/", "http://www.foo.bar./", "http://.www.foo.bar./") all(grepl(re, good) == TRUE) all(grepl(re, bad) == FALSE) ``` You can now see the power and expressiveness of building regular expressions with `rex`!