Spiders

title says it all

Spiders

Postby dgemily on Fri Oct 14, 2005 7:00 pm

I have been playing with spiders :lol: I will try to post my spiders developpement here, maybe it should be nice to have a place to centralize all spiders developpements from everybody....
maybe a sticky thread or, I can, if you want, build a specific topic ( with www in the aim of duplicating the contents on the xlobby web site) then I can add the spiders from everybody on it... what do you think about that ? do you have another idea ?
-----------------------------------------
Update: 25 january 2006

Spiders Mobygames

22 spiders, one by platforms :
- gameboy
- gba
- gb color
- game cube
- game gear
- genesis
- intellivision
- neo geo
- neo geo cd
- nes
- n gage
- nintendo 64
- nintendo ds
- playstation
- playstaion 2
- psp
- sega master system
- snes
- xbox
- saturn

fields build : Released, publisher, developer, rate, platform, genre, plot and the coverart.

if you need one spider for another platform, let me know ;)

- clips - album - yahoo.fr.zip
- clips - artist - yahoo.com.zip
- clips - artist - yahoo.fr.zip
- clips - title - fnac.com.zip
- clips - title - yahoo.com.zip

- dvd - allocine.fr.zip
- dvd - amazon.fr.zip
- dvd - buy.com 2.zip
- dvd - buy.com.zip
- dvd - moviecovers.fr.com.zip
- dvd - amazon.co.uk.zip

- mangas - animeka.zip

- music - amazon.fr.zip
- music - buy.com 2.zip
- music - buy.com.zip

- serietv - allocine.fr.zip
- serietv - serie-theque.fr.zip
- serietv - serieslive.fr.zip
Last edited by dgemily on Wed Jan 25, 2006 6:19 pm, edited 1 time in total.
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby dgemily on Mon Nov 14, 2005 7:53 pm

Update :roll:
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby Naylia on Mon Nov 14, 2005 8:50 pm

sweet...i think spiders are next for me now that i just got my playlist rename working...onward and upward we go!
Naylia
 
Posts: 530
Joined: Tue Oct 19, 2004 7:50 pm
Location: Boston, MA

Postby dgemily on Mon Nov 14, 2005 10:48 pm

Naylia wrote:onward and upward we go!


Image
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby tswhite70 on Fri Dec 30, 2005 5:17 am

I've been playing around with spiders recently and while I'm still not sure I totally understand regex I think I'm getting there. So below is a spider for DVDTown.com (personally I prefer their plots over IMDB - they also have good quality coverart).

***EDIT*** DVDTown changed their site structure in Nov. 2006. The new site is no longer capable of being spidered by Xlobby.

I've also created a TV episode spider using epgguide.com and tv.com, the code can be found here: http://www.xlobby.com/forum/viewtopic.php?p=24631#24631

Hope somebody finds these usefull!
tsw
Last edited by tswhite70 on Sun Apr 22, 2007 6:56 pm, edited 2 times in total.
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Learning How to Build Spiders

Postby tswhite70 on Wed Jan 18, 2006 12:00 am

I've been working on spiders for the last month or so and thought I would write up a little blurb regarding what I've learned.

The spiders use Regular Expressions or regex to parse through a text file and return a result. There is lots of info on the web for regex, just google it, but I did find a tutorial site that was pretty well laid out for learning some of the basics http://www.regular-expressions.info/. This should help explain some of the cryptic symbols being used in the spiders.

Xlobby uses the .NET flavor of regex so you can also look at MS for flavor specific info http://msdn.microsoft.com/library/en-us/cpgenref/html/cpconRegularExpressionsLanguageElements.asp.

I also found a great little tool for testing your regex - which makes life a whole lot easier than blindly trying to figure out why your expression isn't providing a result in Xlobby. Expresso is free http://www.codeproject.com/dotnet/Expresso.asp (you do have to register with the site to download it). If you use Expresso to test your regex make sure to set the following options in the upper right of the window: Ignore Case = On, SingleLine=On; uncheck all the other options.

Your best bet for writing your own spiders is to look at the ones already in existence and step through them comparing them to the URL source to see how they work. Spiders are located in the Xlobby\Spiders directory. Because Xlobby doesn't provide a way to view what's going on with the spider they can be maddening to troubleshoot but with a little work it's not that bad. Expresso can be a huge help here as you will already know that your regex is correct. Be aware that because of the nature of the regex engine you can easily put it into a exponential recursive search of your document where it can make millions of comparsions trying to match your expression - this can hang Xlobby requiring you to kill the Xlobby process and restart.

I'll walk through the spider posted above for DVDTown.com as an example of how to build your own spider. DVDtown contains DVD information and cover art for both new and old releases (over 17000) at this time. They also write pretty good reviews of DVD's which include sound & video quality ratings.

***EDIT*** DVDTown changed their site structure in Nov. 2006. The new site is no longer capable of being spidered by Xlobby. So the example below will no longer result in a workable spider, but hopefully the methodology of creating the spider can still be useful.

Going to the main link http://www.dvdtown.com you'll notice a search box at the top of the page which is exactly what we need to get started. Entering "Troy" in the search box yields a Search Results page with the first 10 of 32 results for the search. The URL for the Search Results page is what we are initially interested in:

http://www.dvdtown.com/search/index.php ... y&x=17&y=8

It looks pretty ugly, but with a little trial and error we find out that we only need the first section up to the ? and then the &query=moviename section to actually search the page. So our search URL looks like this:

http://www.dvdtown.com/search/index.php?&query=troy

Open Notepad and save a new spider as "dvd - dvdtown.comTEST.txt" into the Xlobby\Spiders directory. On the first line of this new spider we are going to input the search URL discovered above with the search variable to be passed from Xlobby:

url=http://www.dvdtown.com/search/index.php?&query=%searchstring%

This line tells Xlobby to concatenate the Display variable for the item in the database you are clicked on to the URL as %searchstring%. Xlobby will download the source code from the resulting URL and utilize it for the regex search on the next line of the Spider.

Now we need to get the source from the search url and design our regex for the "Results=" line of the spider. So run a search of DVDtown for "Troy" and choose "View\Source" in IE to see the source code for the URL. Holding the mouse pointer over the details links in the search results web page shows us that we are looking for a link with the words "discdetails" in it. Use Find in Notepad to search for "discdetails" in the web page source. Double check to make sure that you found the proper line for the link to the discdetails, looking at the source you'll notice that each unique result listing has the same format for the line which is what we want.

In this particular case, the first line we find with discdetails is:

<a href="http://dvdtown.mondosearch.com/BTCLog.dll?query=troy&LOGTYPE=C&USERID=DVDTown&URL=http://www.dvdtown.com/discdetails/troywidescreen/12965/"><img src="/media/coverart/small/12965.jpg" alt="Troy (Widescreen)" border="0" width="70" height="100" style="padding-right: 5px;"></a>

Now we need to parse the line, there are two variables we want to find for the result: the url to the disc details page, and the title of the link for display in the spider selection results box. We'll break it up into 4 sections to make it easier to understand.

Section1: <a href="http://dvdtown.mondosearch.com/BTCLog.dll?query=troy&LOGTYPE=C&USERID=DVDTown&URL=
We don't actually need the first part of this line for anything other than to help make sure we are matching the correct line in the source. We need to make sure that we make any result specific parts of the line generic in the regex. For instance in "query=troy&", "troy" is result specific, so our regex needs to be "query=.*?&". The ".*?" tells regex to match any number of characters between the previous regex match, "=" in this case, and the next explicit regex match "&". This will now match "query=troy&" or "query=Xlobby Rules <img source=xlobby.jpg>&". Notice that we do need to be careful with the use of ".*?" to make sure we get the right matches. So now we have as our section1 regex:
<a href="http://dvdtown.mondosearch.com/BTCLog.dll?query=.*?&LOGTYPE=C&USERID=DVDTown&URL=

Next let's back up and look at "BTCLog.dll?query" - notice the ? mark, we just used that in the last query as a regex symbol. The question mark has special meaning in regex and we don't want this in our expression unless intended, it'll muck things up. There are several ways to avoid it, we'll look at two: escape it "\?", and generalize it with a "." Escaping the ? with "\?" tells the regex to look for a literal "?" while avoiding the special regex meaning of the question mark. Generalizing the ? with a "." tells regex to match any character in that position. So a regex of "BTCLog.dll\?query" will only match "BTCLog.dll?query" and nothing else; while a regex of "BTCLog.dll.query" will match ""BTCLog.dll?query" or "BTCLog.dllXquery" or "BTCLog.dll!query" etc. You'll notice that there is already a "." in this section of code. This "." is interpreted by the regex engine as a special character and tells regex to match any character in that position - so the regex "BTClog.dll" matches BTClog.dll" or "BTClogXdll". This is not typically a problem since we are matching the whole URL but we want to be aware of it. To match the literal "." we can escape it as "\."

Our regex now looks like:
<a href="http://dvdtown.mondosearch.com/BTCLog.dll\?query=.*?&LOGTYPE=C&USERID=DVDTown&URL=
We can clean this up a lot more, but this will work so let's move on...

Section 2:
http://www.dvdtown.com/discdetails/troy ... een/12965/">
This is actually the URL that we want Xlobby to get. Again we'll apply the ".*?" to generalize result specific sections of the source. Looking at the text we can see that "http://www.dvdtown.com/discdetails/" is static regardless of the result, "troywidescreen" is specific and needs to be generalized along with "12965". So let's apply ".*?" to those sections and we get:
http://www.dvdtown.com/discdetails/.*?/.*?/">

Because the "> at the end of the string is unique in ending the url we can easily shorten this to:
http://www.dvdtown.com/discdetails/.*?
and get the same result.
Now we need to specify the regex named group to collect the URL. .NET allows named group collections of the form (?<groupname>), everything between the parenthesis can be referenced by utilizing the group name. In this case the group we want is (?<url>). We want the entire URL starting from the "h" of http to the last "/" of the url but excluding the ":
(?<url>http://www.dvdtown.com/discdetails/.*?)">
Looking back at our original text the regex will find the following value in the regex group <url>: http://www.dvdtown.com/discdetails/troy ... een/12965/

So far our entire regex looks like:
<a href="http://dvdtown.mondosearch.com/BTCLog.dll\?query=.*?&LOGTYPE=C&USERID=DVDTown&URL=(?<url>http://www.dvdtown.com/discdetails/.*?)">

Section3, display name:
<img src="/media/coverart/small/12965.jpg" alt="Troy (Widescreen)"
Again, we'll use ".*?" to remove the result specific sections and for this one we need the regex named group (?<display>) to show the result in the spider window and allow selection. "12965" is specific as is "Troy" and "Widescreen":
<img src="/media/coverart/small/.*?.jpg" alt=".*? (.*?)"

We don't really care about the text in the img src area though and in this case we don't need it so lets get rid of it:
<img src=".*?" alt=".*? (.*?)"

Let's add our named group to collect everything between the alt=" and the closing ":
<img src=".*?" alt="(?<display>.*? (.*?))"
We have a problem though, the () characters that were around "Widescreen" are special characters in regex so we need to escape them with \:
<img src=".*?" alt="(?<display>.*? \(.*?\))"
This will work for any movies that have (Widescreen) or (Fullscreen) but not for those that don't have the parenthesis, so we really want to get rid of them and just match everything between the quotes:
<img src=".*?" alt="(?<display>.*?)"

If we wanted to exclude the parenthesis and everything inside from our result we can make them a non-capturing optional group. So we enclose that section in (), escape the enclosed () with \, generalize the text with .*?, make the match optional with ? after the closing parenthesis and tell regex not to remember the group with "?:" just inside the 1st capturing parenthesis. If you followed all that, here is our resulting regex:
<img src=".*?" alt="(?<display>.*?)(?:\(.*?\))?"
This will return "Troy" from "Troy (Widescreen)", or "Troy". In this instance though we really want the information in the parenthesis if it's available to help differentiate our results so we'll leave the regex as:
<img src=".*?" alt="(?<display>.*?)"

So far our entire regex looks like:
<a href="http://dvdtown.mondosearch.com/BTCLog.dll\?query=.*?&LOGTYPE=C&USERID=DVDTown&URL=(?<url>http://www.dvdtown.com/discdetails/.*?)"><img src=".*?" alt="(?<display>.*?)"

Section4, finish off the regex:
border="0" width="70" height="100" style="padding-right: 5px;"></a>
We don't really care about this stuff so we'll generalize the whole thing execept for the trailing </a> which we know ends the line:
.*?</a>

Leaving us with our entire results regex of:
<a href="http://dvdtown.mondosearch.com/BTCLog.dll\?query=.*?&LOGTYPE=C&USERID=DVDTown&URL=(?<url>http://www.dvdtown.com/discdetails/.*?)"><img src=".*?" alt="(?<display>.*?)".*?</a>

We'll add "results=" to the front of this and it becomes the second line of our spider. If you've downloaded Expresso to do your testing just paste the regex above into the "Regular Expression" box and paste the web page source from the search results page into "Sample Input Data" and click "Find Matches" (make sure the Ignore Pattern WS box is not checked in the options on the right). You'll see the matches in the results pane, click the match to see the group names returned for each match.

You should now be able to save your spider and run it in Xlobby. NOTE: If you are copying the example regex statements into Notepad make sure to delete any trailing spaces at the end of the line!
It should show the results in the results pane. Of course clicking on the result won't show anything yet so let's add to our spider so we can see some info. To do this we need to take one of the URL's from our result match (?<url>) and get the source for it. The first info we'll try to get is the movie title (?<title>). Looking at the details page of the first match for Troy we see the title of the movie next to "DVD Details:". Using notepad on the html source for this page search for "DVD Details:" The first match is in the html title portion of the source, the second match is down in the content so we'll use that one - the entire line is:
<tr><td colspan="15"><h1>DVD Details: Troy <img src="/media/logos/video/widescreen.gif" alt="Widescreen version which preserves the original theatrical aspect ratio approx." border="0" /></h1></td></tr>

This is the only place where "<h1>DVD Details:" shows up in the source so we'll use that as the start of our regex, the end of the title is denoted by the start of "<img src" so we'll use that as the end our regex. "Troy" is result specific so we want to make it general with ".*?":
<h1>DVD Details: .*? <img src
We want the result in the group name "title":
<h1>DVD Details: (?<title>.*?) <img src

Notice I left the whitespace between ": (", this is actually part of the pattern match, it keeps our title from having a leading space. Doing a little more checking of the DVDTown website we'll find out that not all titles are followed by "<img src", some do not have a graphic and are just followed by "</h1>". We need to account for both possiblities and the easiest way to do this is to notice that in both cases the character immediately following the title is "<". So we'll change the regex to the following to pick up both cases, add these lines to your spider:

//Title - the double slash makes this a comment
<h1>DVD Details: (?<title>.*?) <

Save your spider and try it again in Xlobby, when clicking on a result you should now see "title: Troy" in the info box. We can actually name the info anything we want, if our regex was (?<BigBoy>.*?) it would show up in the info as "BigBoy: Troy". One more comment about whitespace, since we are using it as part of our pattern matching we need to be aware of any unintended whitespace in the regex. It's very easy to leave an space in the middle of our expression and especially at the end - this will throw the whole thing off if we aren't carefull.

Let's skip ahead and regex for the Actors so we can see how the (?<variable>) designation works. Looking at our details web page we see that actors are listed after "Starring:" so let's search our html source for that - what we end up with is whole bunch of html with the actors names interspaced in between:
Starring:</h2></td>
<td width="40%" valign="top">
<a class="cast" href="/search/index.php?type=cast&query=Brad%20Pitt">Brad Pitt</a><br /><a class="cast" href="/search/index.php?type=cast&query=Eric%20Bana">Eric Bana</a><br /><a class="cast" href="/search/index.php?type=cast&query=Orlando%20Bloom">Orlando Bloom</a><br /><a class="cast" href="/search/index.php?type=cast&query=Peter%20O'Toole">Peter O'Toole</a><br /><a class="cast" href="/search/index.php?type=cast&query=Sean%20Bean">Sean Bean</a><br /><br />
</td>
<td width="40%" valign="top">
<a class="cast" href="/search/index.php?type=cast&query=Brendan%20Gleeson">Brendan Gleeson</a><br /><a class="cast" href="/search/index.php?type=cast&query=Brian%20Cox">Brian Cox</a><br /><a class="cast" href="/search/index.php?type=cast&query=Eric%20Bana">Eric Bana</a><br /><a class="cast" href="/search/index.php?type=cast&query=Diane%20Kruger">Diane Kruger</a><br /><a class="cast" href="/search/index.php?type=cast&query="></a><br />
</td>
</tr>


Xlobby allows you to save a section of the html source as a variable, the next regex command then works recursively on that variable to find all matches in it and concatenate them with "," in between. So we want this whole section of text caught in the regex for the variable - we don't really care what's in it right now so to get the whole thing we specify:
Starring:</h2>.*?</tr>
You'll notice if your using Expresso that you need the Singleline Option checked which allows the .*? to match end of line and new line characters.
Let's add the variable group name to the match:
Starring:</h2>(?<variable>.*?)</tr>

Now Xlobby will use the source contained in (?<variable>) to perform recursive matches using the next regex which we want to be the actors name. So looking at our source, each actors name is formatted as follows:
<a class="cast" href="/search/index.php?type=cast&query=Brad%20Pitt">Brad Pitt</a><br />
We are only interested in the part ">Brad Pitt<" to get the name so we'll generalize most of the query and use the group name "actors" to match the name:
<a class="cast".*?">(?<actors>.*?)</a><br />
If you run this in Expresso against the entire http source you'll notice you end up with actors, blank results, and directors. But since Xlobby is only applying the actors regex to the source captured in (?<variable>) we'll only get the actors and a blank trailing match that will leave us with a "," at the end of our result set. This last blank match is because dvdtown has the actors in a table so blank entries in the table are in the source as:
<a class="cast" href="/search/index.php?type=cast&query="></a><br />
You can see this at the end of the variable section of code. Our ".*?" in the actors regex group matches everything betweeen the >.*?<, even nothing. We'll get rid of that by using the \b regex character which defines a word boundary:
<a class="cast".*?">(?<actors>\b.*?)</a><br />
Now we are only matching if there is a word boundary directly following the ">".

So put the following lines into your spider and you should see the actors show up as "actors: name1,name2,name3,..."
//Actors - using variable, matches will be made with next regex
Starring:</h2>(?<variable>.*?)</tr>
<a class="cast".*?">(?<actors>\b.*?)</a><br />

We'll do the coverart and show the string replace ability next, looking at our movie detail webpage we see over on the right a small coverart image with the text "Cover art - click to zoom" below it. Search the http source for that and we find the following line:
<a class=MotdLink href=/coverart/troywidescreen/12965/><img style="border-color : #000000;" src="/media/coverart/medium/12965.jpg" width="150" border="1" alt="troywidescreen" /></a><br /><h4>Cover art - Click to zoom</h4><br />

The small coverart image shown on the details page is found in the tag src="/media/coverart/medium/12965.jpg". The url for the large coverart web page is the initial href "href=/coverart/troywidescreen/12965/>". Go ahead and click the link on the details page for the coverart. Now we see the large coverart and we want to check the URL for the image. You can right click on the image and choose Properties to see the url:
http://www.dvdtown.com/media/coverart/big/12965.jpg
This is ultimately the URL that we want to pass to Xlobby in the (?<coverart>) group name so that it can download the large image for us as our coverart.

While it's not readily apparent, it turns out that in this particular case the URL for the small image on the details page is identical to the url of the large image on the coverart page except for the word "medium".

details pg: /media/coverart/medium/12965.jpg
cover pg: /media/coverart/big/12965.jpg

This provides us with an opportunity to utilize the replace functionality that Xlobby has with spiders. Note that the Xlobby replace appears to be a string replace method and is not the regex.replace method (at least I haven't been able to get it to work using expressions - it would be great if we did have access to regex replace in addition to or in place of the string replace!).

Back to getting the coverart, we now know that we can use the url to the medium image as our base reference so we want to get the url into the group name (?<coverart). We start with our reference string and generalize all the result specific sections and add the group name:
src="(?<coverart>/media/coverart/medium/.*?.jpg)"

This is the only place on the web page where we have a link to a medium coverart image so that's all we need for our regex. Ok, so we have the url correct except for the word "medium". We always want to replace the word "medium" with the word "big" so we specify a replace operation in our spider in the form "replace=groupname:phrase:replacementphrase":
replace=coverart:medium:big

Whenever Xlobby tries to access the coverart URL it will perform the replacement and then download the resulting URL. Two more things to consider about replace; 1st, if the word medium appears anywhere else in the url it will also get replaced so we need to be carefull. 2nd, replace only seems to work on coverart in Xlobby - I haven't been able to get it to work on normal info groupnames.

So here's the lines to add to your spider for getting coverart:
//Coverart - using replace to modify url
src="(?<coverart>/media/coverart/medium/.*?.jpg)"
replace=coverart:medium:big

Last thing to show is the use of the (?<url>) group name. Xlobby allows you to regex a new url by using the url group name. The next regex immediatly following the url regex utilizes the source from that url - confusing? Let's see it in action with a movie review, looking at our details web page we seen a link to the movie review titled "DVD Review" in the upper selection bar on the page. Searching our source for "DVD Review" yields:
<h3><a class="motdlink" href="/review/troywidescreen/12965/2573">DVD Review</a></h3>

So we want the review url put into (?<url>) with the result specific info generalized:
<h3><a class="motdlink" href="(?<url>/review/.*?/.*?/.*?)">DVD Review</a></h3>
Lets go ahead and get rid of the redundant "/.*?" we have:
<h3><a class="motdlink" href="(?<url>/review/.*?)">DVD Review</a></h3>

Now that we have the url for the review, let's get the source for that web page and see where the review starts:
<h2><small style="font-weight: normal; text-align: left;"><b>By <a class="" href="/mytown/users/profile.php?id=12">John J. Puccio</a>
(December 21, 2004)<br /><br /></b></small></h2>

Let's go ahead and clean this up:
<h2><small style="font-weight: normal; text-align: left;"><b>.*?</b></small></h2>

And where the review ends:
<!-- Ratings -->
Putting the two together and capturing everything in between to the regex groupname review:
<h2><small style="font-weight: normal; text-align: left;"><b>.*?</b></small></h2>(?<review>.*?)<!-- Ratings -->

So add the following to your spider:
//Review - use url tag for next regex
<h3><a class="motdlink" href="(?<url>/review/.*?)">DVD Review</a></h3>
<h2><small style="font-weight: normal; text-align: left;"><b>.*?</b></small></h2>(?<review>.*?)<!-- Ratings -->


Sadly, you'll notice this doesn't work. So let's troubleshoot it - we have two regex's involved in this data and we don't know which one is broken. Since Xlobby uses the groupname (?<url>) as a special instance we can simply change so that info that is collected by the regex is shown. Changing (?<url>/review/.*?) to (?<urltest>/review/.*?) shows us that the returned url has some session information attached to it that we weren't seeing when accessing the site via IE. Instead of getting "review/troywidescreen/12965/2573" back as our url were getting something like "review/troywidescreen/12965/2573?PHPSESSID=9bb6290401f99199c7326f7e2bec377b". So we'll change our regex to quit matching at the question mark which we must escape as "\?":
<h3><a class="motdlink" href="(?<url>/review/.*?)\?.*?">DVD Review</a></h3>

This same technique can be used with variable groupname to see what's going into our variable before the next regex searches it. The text returned in the review groupname is going to be full of html tags that Xlobby probably won't like, but you get the idea of using the (?<url>) tag.

Hope this helps those that are interested in spiders...

Sorry for the long post! :)

good luck,
tsw
Last edited by tswhite70 on Sun Apr 22, 2007 6:58 pm, edited 13 times in total.
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Coverart Spider for Impawards.com

Postby tswhite70 on Thu Jan 19, 2006 11:23 pm

I wrote this spider for another project, but it turns out that http://www.impawards.com has a lot of cool coverart. The site contains high quality scans of the actual movie posters, so you don't have to settle for the boring DVD coverart if you don't want to. For instance Troy has 14 different posters available. In searching the site I've also found a fair amount of non-english posters - some Spanish, French, Belgian, Japanese, etc.

http://home.comcast.net/~twhite644/spiders/UpdatedDVDSpiders.zip

updated: 01/29/2006
updated: 2/11/2006 - seems impawards has switched to Google Adsense for their search engine...
updated: 12/03/2006 - Google changed their html link structure
updated: 2/2/2007 - Google changed their html link structure

Note: I don't know why I had to do the replace on "posters" but I couldn't get the spider to work without it...

good luck,
tsw
Last edited by tswhite70 on Tue Nov 06, 2007 5:49 pm, edited 5 times in total.
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby iGotNoTime on Mon Jan 23, 2006 6:40 pm

I have been trying all sorts of these formatting rules and when I try to use the spider I am trying to make it seems to search the site, but I never get anything in the results. Could different sites use different formatting in the coding accepted?

The site I am trying to get the spider for is http://www.ethaicd.com, do you have any advice at all?

BTW thanks for your great work thus far, I am sure it has helped many many people. I can only imagine the time you have put into it so far!

Edited: I forgot to put in what I have so far....
Code: Select all
url=http://ethaicd.com/list.php?%searchstring%
results=<a class="text2bold" href="(?<url>/cdimage/.*?">(?<display>.*?)</a>
//find coverart <url> tag will open that url and use next regex
"javascript:photo_opener\('(?<coverart>.*?)&
iGotNoTime
 
Posts: 24
Joined: Mon Jan 23, 2006 5:14 pm

Ethaicd.com coverart spider

Postby tswhite70 on Mon Jan 23, 2006 7:43 pm

The code below should work, but I haven't had a chance to check it with XL yet. The results= section of the code needs to point to the url for the individual CD. So that you can use that URL to find the coverart for the CD. Sort of an intermediate step for the spider. Your code was trying to set the URL in results= as that of the coverart image on the search page, you were just missing the intermediate step.


url=http://ethaicd.com/list.php?keyword=%searchstring%
results=<TD width="25%"><A href="(?<url>/show.php.*?)"><IMG alt="(?<display>.*?)" src=".*?" border=0 onContext="return false" onContextMenu="return false"></A> <BR>
//coverart
<IMG alt=".*?" src="(?<coverart>/cdimage/.*?)" onContext="return false" onContextMenu="return false">

replace=coverart:cdimage:cdimage


I had the initial URL wrong (was missing keyword=) and turns out I needed the replace (though I have no idea why that makes it work)...

You can download the spider here:
http://home.comcast.net/~twhite644/spid ... cd.com.zip


good luck,
tsw
Last edited by tswhite70 on Tue Nov 06, 2007 5:49 pm, edited 5 times in total.
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby iGotNoTime on Mon Jan 23, 2006 7:53 pm

Still comes up blank. It would have been very cool though. :(

Thanks for the help anyway. It works great using the default spiders, but most of my media is not found on those locations. 90% of the stuff I don't have info or covers for is found on that website.
iGotNoTime
 
Posts: 24
Joined: Mon Jan 23, 2006 5:14 pm

Postby tswhite70 on Tue Jan 24, 2006 1:12 am

I fixed the Ethaicd spider, see post above...
tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby iGotNoTime on Tue Jan 24, 2006 1:57 am

Very cool, more progress than I would have made on my own! It now shows some search results at least. Doesn't show any info or the album cover images that is hosts. Thank you very much for the extra attention, I don't think I ever would have got this far on my own.
iGotNoTime
 
Posts: 24
Joined: Mon Jan 23, 2006 5:14 pm

Postby tswhite70 on Tue Jan 24, 2006 5:08 am

Are you using the spider out of the zip file or did you copy the text from the post into your spider? If you copied the text, you may have a trailing space on one or more of the lines messing things up. Try the zip file. I've tried the zip file on both .NET 2.0 and pre 2.0 versions of Xlobby successfully. Searching for "Good" I get about 20 results, all of them have coverart. I'm not pulling any info with the spider, just coverart.

If you are still having trouble, give me an example of what you are searching for (the actual CD name) so I can make sure that there isn't something weird going on based on the result set. One last thought, are you using the spider through the DB editor or via the skin itself?

tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby iGotNoTime on Tue Jan 24, 2006 12:03 pm

I feel so stupid. I didn't see the zip, I just copied the code again. Works flawlessly, thank you very very much. Ididn't mean to be sounding like I was begging for help, I am eager to learn and even reading manuals at work trying to catch up to you guys. This was simply beyond me at this point and I was hoping for a fast fix. You gave it to me. Thanks so much.

If you ever need any databases or code messed up send me an email and I am certain I could help you out. :P
iGotNoTime
 
Posts: 24
Joined: Mon Jan 23, 2006 5:14 pm

Postby S Pittaway on Wed Jan 25, 2006 4:43 pm

is there any chance of someone knocking up a dvd spider that uses amazon.co.uk instead of amazon.com?

1/2 the time i end up draging dvd covers from there instead of using the spider (dont get a match on .com)
S Pittaway
 
Posts: 651
Joined: Wed Jan 25, 2006 11:08 am
Location: Manchester, England

Next