C# Regex.Match
Regex. Patterns are everywhere.
In text,
we often discover,and must process,textual patterns. A regular expression describes a text-based transformation.
A class, Regex, handles regular expressions. We specify patterns as string arguments. Methods (like Match and Replace) are available.
Match.
This program introduces the Regex class. We use its constructor and the
Match method, and then handle the returned Match object.
Namespace:All these types are found in the System.Text.RegularExpressions namespace.
Pattern:The Regex uses a pattern that indicates one or more digits. The characters "55" match this pattern.
Success:The returned Match object has a bool property called Success. If it equals true, we found a match.
Based on:
.NET 4.5
Program that uses Match, Regex: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(@"\d+");
Match match = regex.Match("Dot 55 Perls");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
Output
55
Static method.
Here we match parts of a string (a file name in a directory path). We
only accept ranges of characters and some punctuation. On Success, we
access the group.
Static:We use the Regex.Match static method. It is also possible to call Match upon a Regex object.
Success:We test the result of Match with the Success property. When true, a Match occurred and we can access its Value or Groups.
Groups:This collection is indexed at 1, not zero—the first group is found at index 1. This is important to remember.
GroupsProgram that uses Regex.Match: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// First we see the input string.
string input = "/content/alternate-1.aspx";
// Here we call Regex.Match.
Match match = Regex.Match(input, @"content/([A-Za-z0-9\-]+)\.aspx$",
RegexOptions.IgnoreCase);
// Here we check the Match instance.
if (match.Success)
{
// Finally, we get the Group value and display it.
string key = match.Groups[1].Value;
Console.WriteLine(key);
}
}
}
Output
alternate-1
Pattern details
@" This starts a verbatim string literal.
content/ The group must follow this string.
[A-Za-z0-9\-]+ One or more alphanumeric characters.
(...) A separate group.
\.aspx This must come after the group.
$ Matches the end of the string.
NextMatch.
More than one match may be found. We can call the NextMatch method to
search for a match that comes after the current one in the text.
NextMatch can be used in a loop.
Here:We match all the digits in the input string (4 and 5). Two matches occur, so we use NextMatch to get the second one.
Return:NextMatch returns another Match object—it does not modify the current one. We assign a variable to it.
Program that uses NextMatch: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string value = "4 AND 5";
// Get first match.
Match match = Regex.Match(value, @"\d");
if (match.Success)
{
Console.WriteLine(match.Value);
}
// Get second match.
match = match.NextMatch();
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
Output
4
5
Preprocess.
Sometimes we can preprocess strings before using Match() on them. This
can be faster and clearer. Experiment. I found using ToLower to
normalize chars was a good choice.
ToLowerProgram that uses ToLower, Match: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// This is the input string.
string input = "/content/alternate-1.aspx";
// Here we lowercase our input first.
input = input.ToLower();
Match match = Regex.Match(input, @"content/([A-Za-z0-9\-]+)\.aspx$");
}
}
Static.
Often a Regex instance object is faster than the static Regex.Match.
For performance, we should usually use an instance object. It can be
shared throughout an entire project.
Static Regex
Sometimes:We only need to call Match once in a program's execution. A Regex object does not help here.
Class:Here a static class stores an instance Regex that can be used project-wide. We initialize it inline.
Static ClassProgram that uses static Regex: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// The input string again.
string input = "/content/alternate-1.aspx";
// This calls the static method specified.
Console.WriteLine(RegexUtil.MatchKey(input));
}
}
static class RegexUtil
{
static Regex _regex = new Regex(@"/content/([a-z0-9\-]+)\.aspx$");
/// <summary>
/// This returns the key that is matched within the input.
/// </summary>
static public string MatchKey(string input)
{
Match match = _regex.Match(input.ToLower());
if (match.Success)
{
return match.Groups[1].Value;
}
else
{
return null;
}
}
}
Output
alternate-1
Numbers.
A common requirement is extracting a number from a string. We can do
this with Regex.Match. To get further numbers, consider Matches() or
NextMatch.
Digits:We extract a group of digit characters and access the Value string representation of that number.
Parse:To parse the number, use int.Parse or int.TryParse on the Value here. This will convert it to an int.
int.Parseint.TryParseProgram that matches numbers: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// ... Input string.
string input = "Dot Net 100 Perls";
// ... One or more digits.
Match m = Regex.Match(input, @"\d+");
// ... Write value.
Console.WriteLine(m.Value);
}
}
Output
100
Value, length, index.
A Match object, returned by Regex.Match has a Value, Length and Index.
These describe the matched text (a substring of the input).
Value:This is the matched text, represented as a separate string. This is a substring of the original input.
Length:This is the length of the Value string. Here, the Length of "Axxxxy" is 6.
Index:The index where the matched text begins within the input string. The character "A" starts at index 4 here.
Program that shows value, length, index: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Match m = Regex.Match("123 Axxxxy", @"A.*y");
if (m.Success)
{
Console.WriteLine("Value = " + m.Value);
Console.WriteLine("Length = " + m.Length);
Console.WriteLine("Index = " + m.Index);
}
}
}
Output
Value = Axxxxy
Length = 6
Index = 4
IsMatch, Matches.
In the .NET Framework, there are some other matching methods. Matches()
returns multiple Match objects at once. And IsMatch tells us whether a
match exists.
MatchesMatches: QuoteIsMatch
Star:Also known as a Kleene closure in language theory. It is important to know the difference between the star and the plus.
Star
Words:With Regex we can count words in strings. We compare this method with Microsoft Word's implementation.
Word CountReplace.
Sometimes we need to replace a pattern of text with some other text.
Regex.Replace helps. We can replace patterns with a string, or with a
value determined by a MatchEvaluator.
Replace:We use
the Replace method, with strings and MatchEvaluators, to replace text.
We replace spaces, numbers and position-based parts.
ReplaceReplace: EndReplace: NumbersReplace: Spaces
Spaces:Whitespace isn't actually white. But it is often not needed for future processing of data.
Replace: TrimSplit.
Do you need to extract substrings that contain only certain characters
(certain digits, letters)? Split() returns a string array that will
contain the matching substrings.
Split
Numbers:We can handle certain character types, such as numbers, with the Split method. This is powerful. It handles many variations.
Split: Numbers
Caution:The Split method in Regex is more powerful than the one on the string type. But it may be slower in common cases.
String SplitEscape.
This method can change a user input to a valid Regex pattern. It
assumes no metacharacters were intended. The input string should be only
literal characters.
Note:With Escape, we don't get out of jail free, but we do change the representation of certain characters in a string.
EscapeUnescape.
The term "unescape" means to do the reverse of escape. It returns
character representations to a non-escaped form. This method is rarely
useful.
UnescapeFiles. We often need to process text files.
The Regex type,
and its methods,are used for this. But we need to combine a file input type, like StreamReader, with the Regex code.
Regex: FilesHTML.
Regex can be used to process or extract parts of HTML strings. There
are problems with this approach. But it works in many situations.
Title, P:We focus on title and P elements. These are common tags in HTML pages.
Title: HTMLParagraphs: HTML
Remove HTML:We also remove all HTML tags. Please be cautious with this article. It does not work on many HTML pages.
Remove HTML TagsRegexOptions. With the Regex type, the RegexOptions enum is used to modify method behavior. Often I find the IgnoreCase value helpful.
IgnoreCase:Lowercase and uppercase letters are distinct in the Regex text language. IgnoreCase changes this.
IgnoreCase
Multiline:We can change how the Regex type acts upon newlines with the RegexOptions enum. This is often useful.
MultilineProgram that uses RegexOptions.IgnoreCase: C#
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string value = "TEST";
// ... This ignores the case of the "TE" characters.
if (Regex.IsMatch(value, "te..", RegexOptions.IgnoreCase))
{
Console.WriteLine(true);
}
}
}
Output
True
Is Regex fast?
This question is a topic of great worldwide concern. Sadly Regex often
results in slower code than imperative loops. But we can optimize Regex
usage.
1. Compile.Using the RegexOptions.Compiled argument to a Regex instance will make it execute faster. This however has a startup penalty.
RegexOptions.CompiledRegex Performance
2. Replace with loop.Some Regex method calls can be replaced with a loop. The loop is much faster.
Regex vs. Loop
3. Use static fields.You can cache a Regex instance as a static field—an example is provided above.
Research.
A regular expression can describe any "regular" language. These
languages are ones where complexity is finite: there is a limited number
of possibilities.
Caution:Some languages, like
HTML, are not regular languages. This means you cannot fully parse them
with traditional regular expressions.
Automaton:A regular expression is based on finite state machines. These automata encode states and possible transitions to new states.
Operators.
Regular expressions use compiler theory. With a compiler, we transform
regular languages (like Regex) into tiny programs that mess with text.
These
expressions are commonly used to describe patterns. Regular expressions
are built from single characters, using union, concatenation, and the
Kleene closure, or any-number-of, operator.
Compilers: Principles, Techniques and ToolsA summary.
Regular expressions are a concise way to process text data. This comes
at a cost. For performance, we can rewrite Regex calls with low-level
char methods.
Comments
Post a Comment