RegEx Help Needed

Last post 07-07-2009 12:10 AM by imran_ku07. 6 replies.

Sort Posts:

  • RegEx Help Needed

    07-04-2009, 1:57 PM
    • Member
      21 point Member
    • ddelella
    • Member since 08-17-2007, 6:47 PM
    • Cincinnati, OH
    • Posts 61
    <td valign="top">
    <img src="/images/b.gif" width="1" height="6">
    <br>
    <a href="/title/tt0133093/" onclick="(new Image()).src='/rg/find-title-1/title_popular/images/b.gif?link=/title/tt0133093/';">The Matrix</a> (1999) (V)     
    </td>


     I need some help in creating a regular expression that finds elements matching the above criteria.  The string starts with a td valign top and ends with the closing tag.  The image and br tags are always there but are irrelevant.  As far as the anchor tag i need the value of the href to be a group named "url".  I need the innerHTML of the anchor tag to be a group called "title" and i need the 4 digits in the first set of () after the acnhor tags to be in a group called "year".  I have a RegEx which is picking up the anchor tag and the value for the year but it is to vague and is also picking up so extra values.  If possible I would like to pick only anchors with the "/title_popular/" or "/title_approx/" in the on click statement.  Below is my current anchor, any help would be great.  For those who recognize the value above its for scrapping the html of the imdb search.  The results of this regex should be 20 items, 4 popular and 16 approximate when search http://www.imdb.com/find&q=matrix?s=tt.

    Dim myRegex As New Regex("<a\s+(?:(?:\w+\s*=\s*)(?:\w+|""[^""]*""|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|""[^""]*""|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|""[^""]*""|'[^']*'))*?>(?<title>.+?)</a>(?<year>.+?)</td>")

  • Re: RegEx Help Needed

    07-05-2009, 12:46 AM
    • All-Star
      61,128 point All-Star
    • TATWORTH
    • Member since 02-04-2003, 8:34 AM
    • England
    • Posts 11,990
    • TrustedFriends-MVPs

    Try using the HTML Agility pack from http://www.codeplex.com/htmlagilitypack

    Don't forget to click "Mark as Answer" on the post that helped you.
    This credits that member, earns you a point and marks your thread as Resolved so we will all know you have been helped.
  • Re: RegEx Help Needed

    07-06-2009, 12:18 AM
    • Member
      21 point Member
    • ddelella
    • Member since 08-17-2007, 6:47 PM
    • Cincinnati, OH
    • Posts 61

    There are many reasons not to use the agility pack.  The code has not been updated in a long long time.  The code is slow when trying to traverse html that has little to no identifiers.  If the code was structure to the point where I could easily identify the objects it may make sense but in the case where the html no easy way to find the cells I need it will be easier with the RegEx.

    The above RegEx works very fast and find 95% accurate results.  There are 2 - 3 extra items showing in the list which should not be there.  I added some extra code to the original to check for title_popular and title_approx only and found a small issue with what was showing in the Matches collection:

     This is what is showing when I look at the myMatch.Value:

    <a href="/title/tt1074193/" onclick="(new Image()).src='/rg/find-title-16/title_substring/images/b.gif?link=/title/tt1074193/';">Decoded: The Making of &#x27;The Matrix Reloaded&#x27;</a> (2003) (TV)     </td>

    This is what is showing in the view source from IE8:

    <a href="/title/tt0410519/" onclick="(new Image()).src='/rg/find-title-16/title_approx/images/b.gif?link=/title/tt0410519/';">The Matrix Recalibrated</a> (2003) (TV)     </td>

    There are some weird discrepencies.  1) title_approx became title_substring?  And the characters in the html were escaped showing as &#x27; ?  Anyone have any ideas what in the RegEx could be causing these issue or why the could it be a problem with the WebClient object?

     

  • Re: RegEx Help Needed

    07-06-2009, 12:53 AM
    • All-Star
      17,207 point All-Star
    • imran_ku07
    • Member since 06-04-2008, 9:21 AM
    • KARACHI, PAKISTAN
    • Posts 3,144

    try this

    string sr = "<a href=\"/title/tt0133093/1\" onclick=\"(new Image()).src='/rg/find-title-1/title_popular/images/b.gif?link=/title/tt0133093/';\">The Matrix1</a> (1991) (V) ";
                sr += "<a href=\"/title/tt0133093/2\" onclick=\"(new Image()).src='/rg/find-title-1/title_abc/images/b.gif?link=/title/tt0133093/';\">The Matrix2</a> (1992) (V) ";
                sr += "<a href=\"/title/tt0133093/3\" onclick=\"(new Image()).src='/rg/find-title-1/title_approx/images/b.gif?link=/title/tt0133093/';\">The Matrix3</a> (1993) (V) ";
                string Pattern = "<a\\s+href=['\"](?<url>[^'\"]*)['\"](?<All>[^>]*)>(?<title>[^<]*)<\\s*/\\s*a\\s*>[^\\(]*\\((?<year>\\d+)";
                MatchCollection m = Regex.Matches(sr, Pattern, RegexOptions.IgnoreCase);
                foreach(Match mm in m)
                {
                    string temp = mm.Groups["All"].Value.ToLower();
                    if (temp.Contains("title_popular") || temp.Contains("title_approx"))
                    {
                        Response.Write(mm.Groups["url"].Value);
                        Response.Write(mm.Groups["title"].Value);
                        Response.Write(mm.Groups["year"].Value);
                    }
                }



  • Re: RegEx Help Needed

    07-06-2009, 8:23 AM
    • Member
      21 point Member
    • ddelella
    • Member since 08-17-2007, 6:47 PM
    • Cincinnati, OH
    • Posts 61

    The string pattern produces an ArgumentException when trying to parse.  The exact message is:

    parsing "<a\\s+href=['"](?<url>[^'"]*)['"](?<All>[^>]*)>(?<title>[^<]*)<[^\\(]*\\((?\\d" mce_href="file://\\s*/\\s*a\\s*>[^\\(]*\\((?\\d">\\s*/\\s*a\\s*>[^\\(]*\\((?<year>\\d+)" - Not enough )'s.

  • Re: RegEx Help Needed

    07-06-2009, 8:32 AM
    • Member
      21 point Member
    • ddelella
    • Member since 08-17-2007, 6:47 PM
    • Cincinnati, OH
    • Posts 61
            Dim myClient As New WebClient
            Dim myHTML As String = myClient.DownloadString("http://www.imdb.com/find?q=matrix&s=tt")
            Response.Write(myHTML)


    Okay the problem looks to be somewhat on IMDB.  Running the above code in the page load gets me the same out put my RegEx is producing however to imdb.com and searching on matrix yields a total different page than it was before.  Instead of 16 approximates and a title_approx in the html it has title_substring and shows 24 approximate.  The later seems to be the correct which means the above is working.  I appreciate the suggestion on query strings but now I just need to find out why the characters are being pulled back as escaped from the WebClient object.  Thanks

  • Re: RegEx Help Needed

    07-07-2009, 12:10 AM
    Answer
    • All-Star
      17,207 point All-Star
    • imran_ku07
    • Member since 06-04-2008, 9:21 AM
    • KARACHI, PAKISTAN
    • Posts 3,144

    This Pattern is differnt from which i gave you.

    ddelella:

    It is always Better to Use WebRequest Class.

     

Page 1 of 1 (7 items)