How do i go to next page on this site

Post Reply
jinnydar
Posts: 2
Joined: 31 Dec 2022, 05:25

How do i go to next page on this site

Post by jinnydar » 31 Dec 2022, 05:31

How do i download images in all pages on this site?

https://www.houzz.com/photos/gray-laund ... 753~a_88-8
https://www.houzz.com/photos/gray-laund ... _88-8?pg=2
https://www.houzz.com/photos/gray-laund ... _88-8?pg=3.... and so on.

I cant get it to go to next page while using "current page only" instead of "current directory and deeper" which then downloads from other parts of the site.

What is the custom parser or included url string i need to add?

Maksym
Site Admin
Posts: 2077
Joined: 02 Mar 2009, 17:02

Re: How do i go to next page on this site

Post by Maksym » 02 Jan 2023, 15:14

Well, downloading all images from all pages on the website is pretty easy and can be done with the generic template. But it looks like you want all images from all pages of the "Gray Laundry Room Ideas" thread. Right?

In cases like this, when I want my project to craw only specific website pages, I do it like this:

1. I set the exploration type to [ Entire website ]

2. In the [ Filters - Excluded URLs ] I add a Regular Expression to exclude entire domain of the Starting address to make Extreme Picture Finder crawl only addresses I allow in the [ Included URLs ]. Like this:

^https?://(www\.)?houzz\.com

This excludes all addresses that start with the following:

httр://www.houzz.com
httр://houzz.com
httрs://www.houzz.com
httрs://houzz.com

which is, basically, any address that belong to the domain of the Starting address, but not to the sub-domains. And that's very important, because images are often located on sub-domain, for example:

httрs://cdn.domain.com

Another way of doing it would be excluding all addresses by adding

.

(a single point) to the Excluded URLs, but in this case, you would need to add Regular Expressions for all possible image locations to the Included URLs. And if you only exclude a domain of your Starting address and leave the [ Download files from external sites ] box checked, then all images located on sub-domains or any other external domains will be downloaded automatically.

Also it's a good idea to exclude thumbnail URLs by adding one thing they have in common. Usually it's a folder, like "/thumbs", but here it's this:

-w\d+-h\d+

to get rid of all URLs like

-w150-h150

which are the thumbnails on houzz.com.

3. Then I add Regular Expressions that will cover all the pages I want Extreme Picture Finder to crawl in a project. I'm trying to keep them as simple and as unique as possible. In this project I will need all pages of the "Gray Laundry Room Ideas" thread:

~a_88-8\?pg=\d+$

And then the photo pages:

/photos/[^/\?]+~\d+$

And that would have done it if the photo pages didn't have links to other photo pages that do not belong to the same thread. But they do. Under the "Similar ideas" heading and others. And this is where [ Excluded Page Parts ] come in handy.

4. I add Regular Expressions to the [ Excluded Page Parts ] to prevent Extreme Picture Finder from crawling links that match Regular Expressions in the [ Included URLs ] but do not match my project needs. [ Excluded Page Parts ] allow Extreme Picture Finder to ignore certain parts of the HTML page source. Normally I prefer to leave "unignored" only the parts that have the content or the links I need. Yes, this requires studying the HTML source code of the page (or pages). So, in the source of the photo page I found the main image tag and then added unique HTML elements that surround that image. This is what I ended up with:

1.
From:
To: <div class="view-photo-image-pane">

2.
From: data-compid="vp-fullscreen"
To:

If you leave [ From ] field empty, then Extreme Picture Finder will ignore everything from the beginning of the page to the text matching a Regular Expression in the [ To ] field. If you leave [ To ] field empty, then Extreme Picture Finder will ignore everything from the text matching a Regular Expression in the [ From ] field to the end of the page.

That's it. Here is the resulting project:

houzz.com - gray laundry room ideas

Right-click the above link and select [ Save link as... ]. Then double-click the saved file to add it to Extreme Picture Finder.

jinnydar
Posts: 2
Joined: 31 Dec 2022, 05:25

Re: How do i go to next page on this site

Post by jinnydar » 02 Jan 2023, 19:57

It worked perfectly

Post Reply