Regular expression for string parsing
It is always a tedious job two parse a string when it comes to search or replace within really long string data. At such a times, we need to find that special case or position of data within the data and somehow apply conditional loops to get that fruit in silver platter. Its however possible to get string parsed in much more easier way just by finding out a general expression that can depict the data with its vicinity elements. Even I wasn’t aware until I completed coarse “Using python to access web data” on coursera. I recommend to follow the coarse whenever possible since it revolves around basic and advanced level operation on web services using the powerful computer language python. You can refer to to official website which in itself contains loads of tutorials, but I would recommend you to follow this set of slides which is sufficient for daily needs like parsing lengthy html or php.
Coming to Regular expression, it is a coding language within itself. In computing, a regular expression also referred as “regex” or “regexp” provides a concise and flexible means for matching strings of text, such as particular charachters, words, or patterns of charachters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.
Important aspects:
- Before you can use regular expression in program, you must import the library using “import re”
- You can use “re.search()” to see if a string matches a regular expression, similiar to using the find() method for strings
- You can use re.findall() extract portions of a string that match your regular expression similar to a combination of “find()” and slicing:var[5:10]
Examples:
Matching and extracting numbers:
import re
x="My 2 favorite numbers are 19 and 42"
y=re.findall('[0-9]+',x)
print y
['2','19','42']
Peculiarity of regex re.findall() is that it returns a list of all matching strings and hence proves very useful when the data is lengthy.
I found this very useful when I wanted to get cgpa / sgpa of batch mates from institute website. First, I fetched the long string data which contains the information using urllib library buildin python. I will be posting a tutorial for that soon. For now, lets assume that I had data stored as string in some python variable. Click here, to view structure of data which was received by me. Its about 70000 lines of data and I am sure that with normal string parsing it would take hours to code appropriately to get required information. Hence, I resorted to regex. The only task was to find any one position of occurrence of required information like in my case sgpa. So, a general expression for data vicinity to sgpa was found and coded as regex language ( taking help from slides I mentioned) . Here is relevant code corresponding to string parsing:
name=re.findall("<b>Name.*?<td>(.*?)</td>",data) print name[0] sgpa_lst=re.findall("SGPA.*?([0.00-9.00]+)</td>",data) print sgpa_lst cgpa_lst=re.findall("CGPA.*?([0.00-9.00]+)</td>",data) if(len(cgpa_lst)>0): print cgpa_lst[0]
Look at above expressions , just a single line of character code of regex fetches required data.
Thought some might say that xml parsing would have been a better alternative, but then it would have become necessary to understand tree arrangement which is a tedious job for lengthy data.
I would end here up with my small discussion over regex citing powerful pdf tutorial mentioned at beginning. I have just tried to show usefulness of regex owing to its short length code and analogy to general expression within data to be parsed. Please comment for help seeking or edits.