0%

Regex - fundamental rules

Have no idea why the first one becomes 404, so post it again.
Creating proper regex is pretty simple.
step 1: open your preferred editor
step 2: let a cat play on your keyboard

Stop joking, here comes some notes and something I may collect from Stackflow, youtube, books or somewhere I googled.

One example:

1
2
^(\d{10}|\d{12}|\d{6}-\d{4}|\d{8}-\d{4}|\d{8} \d{4}|\d{6} \d{4}) swedish id number (9901189876)
//897876-1865 this will also be a match...so this only check length and format(not highly accurate)
regex-usage

Some very basic stuff

1
2
3
4
5
6
7
8
9
10
11
^:        anchor to the beginning of the string 
$: anchor to the end of the string
[...] Character ranges, group characters
?: match zero or one
*: match zero or more
+: one or more times
\t\f\n\r: white space == \s
\s Shorthand character codes
?*+{#} Quantifiers
(...|...) Grouping and alternation
i/g/m: Modifiers

JavaScript regex quick reference:

Seriously, I doubt if there is anyone can read and remember all this stuff in the beginning. The best way is trying to rememeber some simple basic ones. So in practice when you need to do some specific catch, repeatedly search these methods and by that you can really remember this magic stuff.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[abx-z]	One character of: a, b, or the range x-z
[^abx-z] One character except: a, b, or the range x-z
a|b a or b
a? Zero or one a's (greedy)
a?? Zero or one a's (lazy)
a* Zero or more a's (greedy)
a*? Zero or more a's (lazy)
a+ One or more a's (greedy)
a+? One or more a's (lazy)
a{4} Exactly 4 a's
a{4,8} Between (inclusive) 4 and 8 a's
a{9,} 9 or more a's
(?=...) A positive lookahead
(?!...) A negative lookahead
(?:...) A non-capturing group
(...) A capturing group
^ Beginning of the string
$ End of the string
\d A digit (same as [0-9])
\D A non-digit (same as [^0-9])
\w A word character (same as [_a-zA-Z0-9])
\W A non-word character (same as [^_a-zA-Z0-9])
\s A whitespace character
\S A non-whitespace character
\b A word boundary
\B A non-word boundary
\n A newline
\t A tab
\cY The control character with the hex code Y
\xYY The character with the hex code YY
\uYYYY The character with the hex code YYYY
. Any character
\Y The Y'th captured group

i ignore case
g global
m multiline

Several good website to do the regex pattern check

1. regexr Recommended!🎇🎇🎇
2. regextester
3. regex101
4. regexper
5. for PHP
6. Too sophisticated for now though very good for sure
7. some cheatsheet
8. One article about group capture in Chinese

9. RegexEgg - good website
10. Some article about greedy mode


Laziness is only enabled for the quantifier with ?.
Other quantifiers remain greedy.

For instance:

1
alert( "123 456".match(/\d+ \d+?/) ); // 123 4

The pattern \d+ tries to match as many digits as it can (greedy mode), so it finds 123 and stops, because the next character is a space ' '.
Then there's a space in the pattern, it matches.
Then there's \d+?. The quantifier is in lazy mode, so it finds one digit 4 and tries to check if the rest of the pattern matches from there.
…But there's nothing in the pattern after \d+?.
The lazy mode doesn't repeat anything without a need. The pattern finished, so we're done. We have a match 123 4.


1
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/

⬆ for all possible alphabet in name ⬆


One article from Juejin

1.) 模式匹配的用法(x) pattern matching
也称为串匹配(string matching, pattern matching),就是给定一组特定的字符串集合P,对于任意的一个字符串T,找出P中的字符串在T中的所有出现位置。我们称P为模式串集合,称P中的元素为模式串(或关键词),称T为文本。字符串中的字符都取自一个有限的符号集合Σ ,简称字母表或字符集。
2.) 非捕获括号的模式匹配(?:x)
3.) 先行断言x(?=y)
4.) 后行断言(?<=y)x
5.) 正向否定查找x(?!y)
6.) 反向否定查找(?<!y)x
7.) 字符集合反向字符集合的用法xyz/^xyz
8.) 词边界非单词边界匹配/b/B
9.) 空白字符/非空白字符匹配/s/S
10.) 单字字符/非单字字符匹配/w/W

1.) Capturing Groups

模式匹配主要用来匹配某一类字符串并记住匹配项,JS中不支持命名捕获组,一般用于替换匹配。(test/replace)

1
2
3
4
let str = 'xuxi is xuxi is'
let reg = /(xuxi) (is) \1 \2/g
reg.test(str) // true (1)
str.replace(reg, '$1 $2') // xuxi is (2)

( )=>捕获括号(Capture Groups)

2.) Non-Capturing Groups(?:x)

主要用来匹配某一类字符串但不记住匹配项。用来判断某类字符是否存在于某字符串中。(test if exist)
e.g.

1
2
3
let str = 'xuxixuxi'
let reg = /(?:xuxi){1,2}/g
reg.test(str) // true (1)

正则表达式的先行断言后行断言一共有4种形式:
(?=pattern) 零宽正向先行断言(zero-width positive lookahead assertion)
(?!pattern) 零宽负向先行断言(zero-width negative lookahead assertion)
(?<!pattern) 零宽负向后行断言(zero-width negative lookbehind assertion)
(?<=pattern) 零宽正向后行断言(zero-width positive lookbehind assertion)

3.) positive lookahead assertion: x(?=y)

先行断言: 匹配’x’仅仅当’x’后面跟着’y’

e.g.

1
2
3
let str = '王者融化'
let reg = /王(?=者)/
reg.test(str) // true (1)

4.) Positive Lookbehind Assertion: (?<=y)x

后行断言: 匹配’x’仅当’x’前面是’y’.

e.g.

1
2
3
let str = 'xuxiA'
let reg = /(?<=xuxi)A/
reg.test(str) // true (1)

5.) Negative Lookahead Assertion: x(?!y)

正向否定查找: 仅仅当’x’后面不跟着’y’时匹配’x’.

e.g.

1
2
3
let str = '3.1415'
let reg = /\d+(?!\.)/
reg.exec(str) // [1415] (1)

6.) Negative Lookbehind Assertion: (?<!y)x

反向否定查找: 仅仅当’x’前面不是’y’时匹配’x’.

e.g.

1
2
3
let str = '3.1415'
let reg = /(?<!\.)\d+/
reg.exec(str) // [3] (1)

7.) 字符集合和反向字符集合的用法 xyz / ^xyz

e.g.

1
2
3
4
5
let str = 'abcd'
let reg1 = /[a-c]+/
let reg2 = /[^d]$/
reg1.test(str) // true (1)
reg2.test(str) // false (2)

8.) Word Boundaries & Non-Word Boundaries (\b \B)

\b 匹配一个词的边界。一个词的边界就是一个词不被另外一个“字”字符跟随的位置或者前面跟其他“字”字符的位置,例如在字母和空格之间。注意,匹配中不包括匹配的字边界。换句话说,一个匹配的词的边界的内容的长度是0。

\B 匹配一个非单词边界。匹配如下几种情况:
(1)字符串第一个字符为非“字”字符
(2)字符串最后一个字符为非“字”字符
(3)两个单词字符之间
(4)两个非单词字符之间
(5)空字符串

e.g.

1
2
3
4
5
let str = 'xuxi'
let reg1 = /xi\b/
let reg2 = /xu\B/
reg1.exec(str) // [xi] (1)
reg2.exec(str) // [xu] (2)

(1)中匹配到了单词边界,即xi, 为该字符串的末尾.(2)中应为xu为非单词边界,所以会被其匹配到.

9.) Whitespace & Non-Whitespace (\s \S)

\s: 匹配一个空白字符,包括空格、制表符、换页符和换行符.
\S: 匹配一个非空白字符

e.g.

1
2
3
4
5
let str = 'xuxi is'
let reg1 = /.*\s/g
let reg2 = /\S\w*/g
reg1.exec(str) // [xuxi] (1)
reg2.exec(str) // [xuxi] (2)

(1)和(2)中执行之后都将匹配xuxi, 一个是空白字符之前的匹配, 一个是非空白字符的匹配.

10.) Word Character / Non-Word Character (\w /W)

\w: 匹配一个单字字符(字母、数字或者下划线)。等价于 [A-Za-z0-9_]。
\W: 匹配一个非单字字符。等价于 [^A-Za-z0-9_]


10 Examples:

1.) 去除字符串内指定元素的标签

1
2
3
4
function trimTag(tagName, htmlStr) {
let reg = new RegExp(`<${tagName}(\\s.*)*>(\\n|.)*<\\/${tagName}>`, "g")
return htmlStr.replace(reg, '')
}

2.) 短横线命名转驼峰命名

1
2
3
4
5
6
7
8
// 短横线转驼峰命名, flag = 0为小驼峰, 1为大驼峰
function toCamelCase(str, flag = 0) {
if(flag) {
return str[0].toUpperCase() + str.slice(1).replace(/-(\w)/g, ($0, $1) => $1.toUpperCase())
}else {
return str.replace(/-(\w)/g, ($0, $1) => $1.toUpperCase())
}
}

3.) 实现一个简单的模板引擎
关于实现一个模板引擎, 实现中用到了大量的正则,建议感兴趣的可以直接看实现一个简单的模板引擎.

4.) 去除字符串中的空格符

1
2
3
function trimAll(str) {
return str.replace(/\s*/g,"")//* seems to be unnecessary?
}

5.) 判断指定格式的数据输入合法性

1
2
3
4
5
6
7
function numCheck(str, specialNum) {
if(str.indexOf(',') > -1) {
return str.splite(',').every(item=>this.numCheck(item));
} else {
return str.split(specialNum).length === 2;
}
}

6.) 去除url参数字符串中值为空的字段

1
2
3
const trimParmas = (parmaStr:string = '') => {
return parmaStr.replace(/((\w*?)=&|(&\w*?=)$)/g, '')
}

7.) 将浏览器参数字符串转化为参数对象

1
2
3
4
5
6
7
function unParams(params = '?a=1&b=2&c=3') {
let obj = {}
params && params.replace(/((\w*)=([\.a-z0-9A-Z]*)?)?/g, (m,a,b,c) => {
if(b || c) obj[b] = c
})
return obj
}

8.) 计算字符串字节数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* 计算字符串字节数
* @param str
* @desc 一个中文占2个字节, 一个英文占一个字节
*/
function computeStringByte(str) {
let size = 0,
strArr = str.split(''),
reg = /[\u4e00-\u9fa5]/ // 判断是否为中文
for(let i = strArr.length; i--; i>=0) {
if(reg.test(strArr[i])) {
size+= 2
}else {
size += 1
}
}
return size
}

9.) 匹配是否包含中文字符

1
2
3
4
function hasCn(str) {
let reg = /[\u4e00-\u9fa5]/g
return reg.test(str)
}

10.) 实现搜索联想功能

1
2
3
4
5
6
function searchLink(keyword) {
// 模拟后端返回数据
let list = ['abc', 'ab', 'a', 'bcd', 'edf', 'abd'];
let reg = new RegExp(keyword, 'i');
return list.filter(item => reg.test(item))
}