Regular Expressions, Part II (Roman Numerals)

So, lately with HackerRank, I’ve been going over their regex section in python. I thought I’d review some of what I’ve done, for my own sake to review later on down the road!

Here are some of the regex problems I’ve solved… For the record, I’ve modified the original post of this to what it is now – before, I just listed the regex expression, and what the whole expression did… Now, however, I’m explaining what each bit does for my own understanding! I’m also reducing how many I do per post so there will be more posts about it as I go into a deeper detail for them…

The one that I will be discussing today is a regular expression to validate roman numerals.

For roman numerals, the expression is:
^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

We start of with ^ which is called an anchor and it is used to say we are starting from the beginning of the string we are comparing…

Next is M{0,3} which tells us we are expecting the letter M between 0 and 3 times. Which covers any thousands digit.

The next piece is (CM|CD|D?C{0,3}) and it involves our first grouping. The parenthesis defines our grouping which can be used to later to pull out this specific information. We next have letters between pipes (|), which is called alternation. It’s saying we are next looking for CM or CD or D?C, which the question mark here means that the D is optional – meaning we are looking for C alone, or DC together, and that is followed by {0,3} which, like before, means we are looking for 0-3 of… so that whole grouping says we are for CM (900), CD (400), or D (500) with up to 3 C’s… which covers hundreds portion of any number over 99.

The piece after this is (XC|XL|L?X{0,3}) and it is also a grouping (which again, we can pull from later). This section covers our 10’s digit, and just like the last grouping, is looking for XC (90) or XL (40) or L (50) and/or up to 3 X’s (10 each).

The last piece is (IX|IV|V?I{0,3})$ and covers our 1’s digit. It’s also grouped so we can pull out this information later. Just like the prior 2 groups, it’s looking for IX (9) or IV (4) or V (5) and/or up to 3 I’s (1 each). The last piece, the $, is referring to the end of the line.

So if we put this all together, we are saying in the line we pass in, if there is a character that isn’t an M, D, C, L, X or I, this will fail. Also, if proper characters aren’t in the proper order, this will fail, and if there is anything after the lowest number, it will fail.

All in all, a pretty neat expression!

Next time, I’ll go over phone numbers.

Never B Flat, Sometimes B Sharp, Always B Natural