Regular Expressions, Part III (Phone Number – basic)

As I promised, here is another discussion about regular expressions, and more specific, phone numbers. The first part of this, I’m going to discuss the HackerRank Python Regex problem on validating phone numbers with regular expressions, but as its a very simplistic case, I’m going to expand on it.

Because, it’s pretty simplistic, I’m simply going to show you the code for the Hackerrank problem.

^[789]\d{9}$

Since it’s so simplistic, I figure I’ll just go over the whole thing. First, notice the ^ and $ anchors. Again, these refer to the fact that everything is occurring on the evaluation. I’ll explain this better last…

The next piece is [789]. This simply tells us that we are looking for a 7, 8, or 9 as the first digit.

Next is \d{9} which is 2 parts, the first \d meaning we are looking for any digit, and the second {9} meaning we are looking for exactly 9 of them.

All that combined, along with the anchors I mentioned before, means that we are searching the input we are using regex on for 10 numbers, starting with a 7, 8, or 9. Nothing else will generate a match.

Now, suppose we didn’t care about starting with a 7, 8, or 9. The first thing we would do is drop the [789] and change {9} to {10}. Simple enough!

However, lets make this a bit more realistic. This makes no allusion to formatting… it doesn’t quantify the real formatting for a North American phone number. So let’s sat that formatting (which I’m simply going to refer to North American Phone Numbers as phone numbers).

A little education for us here. Phone numbers are divided into 3 sections. The first 3 digits are referred to as the area code. The next 3 digits are referred to as the Central Office code, frequently called the exchange, and then the last 4 digits are the subscriber number. Now there are some rules for the area code and the exchange.

For both the area code and exchange, the first digit must be between 2 and 9 (thus no 0’s or 1’s). The next 2 digits can be anything, however, for the exchange, they can’t be 1 and 1 (as those are for special services such as 911 or 411, etc). For simplicity, I’m going to leave it at that, and not worry about the extra formatting or extensions and such… this is more for a basic understanding…

We know we will use our anchors still, so ^$ go in immediately. Also, the last 4 digits are simple as well. I’m going to use parenthesis in the regex to simply group everything, so we know which sections are for which. We know that \d{4} will work for our last for digits, so we’ll add those.

So we currently have ^(\d{4})$. Now, the area code is pretty simple. It’s 3 digits, but can’t be 0 or 1, so that means [2-9]\d{2}, so we’ll add that in the front, giving us ^([2-9]\d{2})(\d{4})$. Now comes the hard part.

The exchange is a bit more difficult, because it can’t be #11. It can be #1#, or ##1, so it has to be broken down into multiple pieces.
The first piece is if the 2nd digit is a 1. That would be [2-9]1[02-9]. Remember the first digit can’t be 0 or 1 (always, from the above rule). We already stated that in this piece, the second digit is 1, which means the third piece can’t be 1, thus the [02-9].
Now the next piece will be the 3rd digit being 1. The nice thing is all the other rules apply, so all we need to do is switch the 2nd and 3rd numbers, giving us [2-9][02-9]1.
Now the last piece is the one that makes logical sense. If we’ve covered the case where the 2nd digit is 1, and the case where the 3rd digit is 1, then all we need to do is cover the case where neither digit is 1. We already have that rule [02-9], so all we do is put that in both spots, giving us [2-9][02-9][02-9]. There are all 3 pieces. Now we just put them together.

If you remember back in the Roman Numerals post, we talked about alternation, or the pipe |. That is basically a way to say this or that, and would look like this|that. All we have to do is combine the 3 pieces, putting that alternation between each piece, giving us [2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9][02-9].

Which gives us our final result of ^([2-9]\d{2})([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9][02-9])(\d{4})$. All done.

For the record, I tested the above on regexr.com, adding the following phone numbers: 5123456789 and 5123114567. It only matched the first one! Yay!

Until next time!

Never B Flat, Sometimes B Sharp, Always B Natural