Regular Expressions, Part III (Phone Number – basic)

As I promised, here is another discussion about regular expressions, and more specific, phone numbers. The first part of this, I’m going to discuss the HackerRank Python Regex problem on validating phone numbers with regular expressions, but as its a very simplistic case, I’m going to expand on it.

Because, it’s pretty simplistic, I’m simply going to show you the code for the Hackerrank problem.

^[789]\d{9}$

Since it’s so simplistic, I figure I’ll just go over the whole thing. First, notice the ^ and $ anchors. Again, these refer to the fact that everything is occurring on the evaluation. I’ll explain this better last…

The next piece is [789]. This simply tells us that we are looking for a 7, 8, or 9 as the first digit.

Next is \d{9} which is 2 parts, the first \d meaning we are looking for any digit, and the second {9} meaning we are looking for exactly 9 of them.

All that combined, along with the anchors I mentioned before, means that we are searching the input we are using regex on for 10 numbers, starting with a 7, 8, or 9. Nothing else will generate a match.

Now, suppose we didn’t care about starting with a 7, 8, or 9. The first thing we would do is drop the [789] and change {9} to {10}. Simple enough!

However, lets make this a bit more realistic. This makes no allusion to formatting… it doesn’t quantify the real formatting for a North American phone number. So let’s sat that formatting (which I’m simply going to refer to North American Phone Numbers as phone numbers).

A little education for us here. Phone numbers are divided into 3 sections. The first 3 digits are referred to as the area code. The next 3 digits are referred to as the Central Office code, frequently called the exchange, and then the last 4 digits are the subscriber number. Now there are some rules for the area code and the exchange.

For both the area code and exchange, the first digit must be between 2 and 9 (thus no 0’s or 1’s). The next 2 digits can be anything, however, for the exchange, they can’t be 1 and 1 (as those are for special services such as 911 or 411, etc). For simplicity, I’m going to leave it at that, and not worry about the extra formatting or extensions and such… this is more for a basic understanding…

We know we will use our anchors still, so ^$ go in immediately. Also, the last 4 digits are simple as well. I’m going to use parenthesis in the regex to simply group everything, so we know which sections are for which. We know that \d{4} will work for our last for digits, so we’ll add those.

So we currently have ^(\d{4})$. Now, the area code is pretty simple. It’s 3 digits, but can’t be 0 or 1, so that means [2-9]\d{2}, so we’ll add that in the front, giving us ^([2-9]\d{2})(\d{4})$. Now comes the hard part.

The exchange is a bit more difficult, because it can’t be #11. It can be #1#, or ##1, so it has to be broken down into multiple pieces.
The first piece is if the 2nd digit is a 1. That would be [2-9]1[02-9]. Remember the first digit can’t be 0 or 1 (always, from the above rule). We already stated that in this piece, the second digit is 1, which means the third piece can’t be 1, thus the [02-9].
Now the next piece will be the 3rd digit being 1. The nice thing is all the other rules apply, so all we need to do is switch the 2nd and 3rd numbers, giving us [2-9][02-9]1.
Now the last piece is the one that makes logical sense. If we’ve covered the case where the 2nd digit is 1, and the case where the 3rd digit is 1, then all we need to do is cover the case where neither digit is 1. We already have that rule [02-9], so all we do is put that in both spots, giving us [2-9][02-9][02-9]. There are all 3 pieces. Now we just put them together.

If you remember back in the Roman Numerals post, we talked about alternation, or the pipe |. That is basically a way to say this or that, and would look like this|that. All we have to do is combine the 3 pieces, putting that alternation between each piece, giving us [2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9][02-9].

Which gives us our final result of ^([2-9]\d{2})([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9][02-9])(\d{4})$. All done.

For the record, I tested the above on regexr.com, adding the following phone numbers: 5123456789 and 5123114567. It only matched the first one! Yay!

Until next time!

Never B Flat, Sometimes B Sharp, Always B Natural

Regular Expressions, Part II (Roman Numerals)

So, lately with HackerRank, I’ve been going over their regex section in python. I thought I’d review some of what I’ve done, for my own sake to review later on down the road!

Here are some of the regex problems I’ve solved… For the record, I’ve modified the original post of this to what it is now – before, I just listed the regex expression, and what the whole expression did… Now, however, I’m explaining what each bit does for my own understanding! I’m also reducing how many I do per post so there will be more posts about it as I go into a deeper detail for them…

The one that I will be discussing today is a regular expression to validate roman numerals.

For roman numerals, the expression is:
^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

We start of with ^ which is called an anchor and it is used to say we are starting from the beginning of the string we are comparing…

Next is M{0,3} which tells us we are expecting the letter M between 0 and 3 times. Which covers any thousands digit.

The next piece is (CM|CD|D?C{0,3}) and it involves our first grouping. The parenthesis defines our grouping which can be used to later to pull out this specific information. We next have letters between pipes (|), which is called alternation. It’s saying we are next looking for CM or CD or D?C, which the question mark here means that the D is optional – meaning we are looking for C alone, or DC together, and that is followed by {0,3} which, like before, means we are looking for 0-3 of… so that whole grouping says we are for CM (900), CD (400), or D (500) with up to 3 C’s… which covers hundreds portion of any number over 99.

The piece after this is (XC|XL|L?X{0,3}) and it is also a grouping (which again, we can pull from later). This section covers our 10’s digit, and just like the last grouping, is looking for XC (90) or XL (40) or L (50) and/or up to 3 X’s (10 each).

The last piece is (IX|IV|V?I{0,3})$ and covers our 1’s digit. It’s also grouped so we can pull out this information later. Just like the prior 2 groups, it’s looking for IX (9) or IV (4) or V (5) and/or up to 3 I’s (1 each). The last piece, the $, is referring to the end of the line.

So if we put this all together, we are saying in the line we pass in, if there is a character that isn’t an M, D, C, L, X or I, this will fail. Also, if proper characters aren’t in the proper order, this will fail, and if there is anything after the lowest number, it will fail.

All in all, a pretty neat expression!

Next time, I’ll go over phone numbers.

Never B Flat, Sometimes B Sharp, Always B Natural

PyCharm… and my regex learning…

So I installed PyCharm today… I know I’m a VIM user, and I could use vim to write my python code very easily, but I’ve learned, both from past experience (i.e. an older job where I wrote PHP via VIM) and my current experience (i.e. my current job where I write Javascript/HTML 5 via VisualStudio) that there are some features that an IDE provides that are incredibly useful.

For me, 2 of the most useful features are the debugger (being able to see my python results immediately, and having clues about bad pythonic form) and being able to jump around based on a function (meaning if i’m making a function call, in the code, I can highlight the function call itself, click a key, and jump to the definition of the function… very useful!

I’m sure there are lots of different ways VIM can do things as well, but so far, I’m happy with PyCharm and will continue to use it!

On a different note, I’m working on learning regex. Regex is one of those things that have popped up since I took programming in college, and has really only been something seen as very minor in the peripheral view of coding… But i’m realizing its usefulness more and more, so I started working on it…

My first big breakthrough was on my gaming program i’m working on… Of course, when making a RPG Game manager, you are going to need dice rolled, and I know that python has the randint function to simply give me a single random die roll, but I wanted to be able to handle all kinds of die situations… one of them was for rolling stats for an NPC… so I would pass in ‘6@4d6!1’ into the die roll parser… (which basically means roll 4d6, drop the lowest result, and do that 6 times – giving me all 6 individual results)

In my javascript version, I was using this complex set of ifs to loop through each character in the string, to determine if there was an @ symbol, etc… Then I learned about grouping in regex, and was able to use ‘(\d*)@*(\d+)[Dd](\d+)([\+\-\*/!\^]*)(\d*)’ to group everything I needed in 1 statement… now I had groups that had the values I needed instead of trying to parse individual characters from the string…

I’ve seen the light!

Whole Lotta Love – Led Zepplin
Never B Flat, Sometimes B Sharp, Always B Natural