Regex – Tìm và thay-xóa chuỗi giữa 2 kí tự bất kỳ
Regex – Tìm và thay-xóa chuỗi giữa 2 kí tự bất kỳ
(Regex Match All Characters Between Two Specified Characters)
Tìm hiểu: RegEx là gì?
Regex can be used to select everything between the specified characters. This can be useful for things like extracting contents of parentheses like (abc) or for extracting folder names from a file path (e.g. C:/documents/work/).
Regex có thể được sử dụng để chọn mọi thứ giữa các ký tự được chỉ định. Điều này có thể hữu ích cho những thứ như trích xuất nội dung của dấu ngoặc đơn như (abc) hoặc để trích xuất tên thư mục từ đường dẫn tệp (ví dụ: C: / Documents / work /).
A regular expression that matches all characters between two specified characters makes use of look-ahead (?=…) and look-behind (?<=…) statements to isolate the string, and then uses the dot character . to select all contents between the delimiters.
Một biểu thức chính quy khớp với tất cả các ký tự giữa hai ký tự được chỉ định sử dụng các câu lệnh nhìn trước (? =…) Và nhìn sau (? <=…) Để tách chuỗi và sau đó sử dụng ký tự dấu chấm. để chọn tất cả nội dung giữa các dấu phân cách.
An expression that does matches everything between a and b is:
Một biểu thức phù hợp với mọi thứ giữa a và b là:
/(?<=a).*(?=b)/g
Let’s discuss how it works:
How it Works
The expression starts with a positive look-behind (?<=…) which ensures that the matched string is preceded to whatever is in the place of …. In this case, we want to ensure that the letter a directly precedes the matched string.
/(?<=a)/
Look-aheads and look-behinds are assertive, which means that they are only used to check if a certain condition is true. Their contents (a in this case) are not matched.
After the presence of the a character, we want to match any character. This is denoted by the dot symbol . which will match any character except a newline character. On its own, the dot symbol will only match a single character, so we need to include a zero-or-more quantifier * behind it to ensure that we match zero or more of any character.
/(?<=a).*/
We want to stop matching when we encounter a b character. This is specified by a positive look-ahead (?=…). This will ensure that the matched string is directly followed by whatever is in the place of ….
In this case, we use the character b inside the positive look-ahead:
/(?<=a).*(?=b)/
Finally, to return every instance of this match and not just the first, we include the global modifier g at the very end of the expression:
/(?<=a).*(?=b)/g
Match All Characters Greedy vs. Lazy
The following expression will match as many characters between a and b as it can. This is because the zero-or-more quantifier * is greedy.
/(?<=a).*(?=b)/g
This will produce the following matches:
another baby bathtub
Notice how it skips over three b characters and only stops the match right at the last b.
However, if we add a lazy identifier ? behind the zero-or-more quantifier, it makes the quantifier lazy, causing it to match as few characters as possible.
/(?<=a).*?(?=b)/g
This will produce the following matches:
another baby bathtub
Regex Match All Including Newline Characters
The expression above will match all characters between the two specified characters, except the newline character. To include the newline character in the match, we have several options.
This can be done by including the dotall modifier s (also called the single-line modifier) at the end, which treats the entire input text as a single line and therefore also matches newline characters.
/(?<=a).*(?=b)/gs
Some flavours of regex allow turning on the dotall modifier inside the expression using (?s):
/(?s)(?<=a).*(?=b)/g
If the dotall modifier is not available in your flavour of regex, you can substitute the dot symbol . for [\s\S] enclosed in square brackets. This matches all whitespace characters \s (which include spaces, tabs, newlines, etc.) and all non-whitespace characters \S (which include letters, numbers, punctuation, etc.).
/(?<=a)[\s\S]*(?=b)/g
The square brackets indicate that we can match any of the characters in any order, and the zero-or-more quantifier * works just as before.
Match All Between Two Characters Without Lookarounds
Some flavours of regex do not support look-aheads and look-behinds at all. In these cases, we can use the following expression.
/a(.*)b/g
Here we used the dot symbol . together with the zero-or-more modifier * to match zero-or-more of any character. These are enclosed in parentheses () to capture the contents for return it for later use.
Finally, this entire expression is sandwiched between the two characters we want to have matched, a and b in this case.
Note that this will expression will return the a and b together with the contents between them. However, the contents without a and b will be contained in the first capture group returned.
All the above modifications above be used on this expression. For example, newline characters can be included with:
/a([\s\S]*)b/g
Or the zero-or-more quantifier can be made lazy using the lazy indicator ?:
/a(.*?)b/g
Which Flags to Use
To extract all matches from the piece of text, and not just the first match, be sure to include the global modifier g at the end of the expression:
/(?<=a).*(?=b)/g
Since we are working with text here, you can also include the case insensitive modifier i to include matches regardless of their case.
Source: https://regexland.com/all-between-specified-characters/