Limiting Extreme Picture Finder exploration

Post Reply
Maksym
Site Admin
Posts: 2230
Joined: 02 Mar 2009, 17:02

Limiting Extreme Picture Finder exploration

Post by Maksym »

I've been asked many times to explain how to limit Extreme Picture Finder's exploration to make it download content only from a certain part of a website. And here is the first post on this topic. I'm going to talk only about the options available in [ Regular site ] section, because the [ TGP ] section is very limited and was designed only for a specific type of pages.

Exploration mode

So, the exploration mode is selected in the project properties [ Site exploration -> Regular site ] section or when you create a new project and click [ Next > ] button several times.

Image

The following exploration modes (or types) are currently available:

1. Entire website. Extreme Picture Finder will look for [ Target Files ] on all addresses that belong to the domain of the Starting URL. Even if you specify a folder or a page that is not a home page of the website as a Starting URL, the program will still try to crawl all pages of the website, including the home page or any other page that it finds on the domain of the Starting URL. For example, if you use this Starting URL:

httрs://example.com/folder1/folder2/page.html

Extreme Picture Finder will look for the [ Target Files ] on pages like

httрs://example.com
httрs://example.com/folder1/index.html
httрs://example.com/another-folder1/
httрs://example.com/category.php?id=2


In this mode the program will not crawl pages located on any sub-domain of the Starting address domain or on other domains:

httрs://forum.example.com

or

httрs://media.exmple.com

but it will download [ Target Files ] from any sub-domain or any other domain only if the [ Download files from external sites ] box is checked and there are direct links to the [ Target Files ].

2. Current directory and deeper. When this mode is selected Extreme Picture Finder will look for the [ Target Files ] only on URLs that are located deeper in the folder structure of website or in the same folder with your Starting URL. So, for example, if you use this Starting URL:

httрs://example.com/folder1/page.html

Extreme Picture Finder will explore addresses like this:

httрs://example.com/folder1/page-2.html
httрs://example.com/folder1/folder2/index.html


And addresses like this will not be crawled:

httрs://example.com
httрs://example.com/page.html
httрs://example.com/other-folder/


because they are not in the sub-folders of the Starting URL.

And for Starting URLs that do not end with a file name or a slash, like this:

httрs://example.com/somethinghere

Extreme Picture Finder will crawl all URLs that begin with the Starting URL in this mode:

httрs://example.com/somethinghere/index.html
httрs://example.com/somethinghere/folder/folder2/
httрs://example.com/somethinghere?name=value


In this mode [ Target Files ] will be saved from any part of the example.com if they are linked directly from the "allowed" pages and from the external domains if the [ Download files from external sites ] box is checked. So, the "sub-folder" limit is not extended to the [ Target Files ] in this mode, only to the website pages.

3. Current page only. Extreme Picture Finder will download only your Starting URL and all [ Target Files ] that are linked directly from the Starting URL. If the [ Download files from external sites ] box is checked, then the [ Target Files ] from external domains, linked directly from the Starting URL will be downloaded.

Important thing to note here is the meaning of the term "linked directly". The direct link means that there is a link to the [ Target File ] in HTML text of a page. For example, if your want to download JPG images and your [ Target File ] is "*.jpg", then direct links should look like this:

<a href="httрs://example.com/images/image.jpg">Image</a>

or this:

<img src="/assets/image.jpg">

You can use the [ View page source ] pop-up menu item in any browser to see the HTML text of any page (or Ctrl + U on your keyboard). And for those smart-ass websites who are trying to prevent you from viewing their page source using the standard method, you can always do this: simply add "view-source:" in front of the page address in the browser address bar, like this:

Image

OK, once again, if there are links to the "image pages" on external image hosts - they will not be downloaded in any of the above modes, because they are links to the image pages on external domains, not the direct links to the image files. You will need to use the next exploration mode to do the job or use the [ Included URLs ] filters to add those image pages to the exploration.

4. Follow all links, limit only the exploration depth. In this mode Extreme Picture Finder will look for the [ Target Files ] on any pages it can find after downloading your Starting URL, not limiting itself to the domain of the Starting URL, so it's important to set the correct [ Exploration depth ] in this mode. Otherwise, the project will never stop and you won't get the results you wanted. This mode is useful when the [ Target Files ] are located one or two pages away from the domain of the Starting URL. A good example would be a forum with a lot of links to the image pages located on external image hosts.

[ Exploration depth ] limit

All of the above exploration modes, except for the [ Current page only ], can be additionally limited with the [ Exploration depth ] limit. This limit tells Extreme Picture Finder how many links from the Starting URL it is allowed to follow and in combination with the exploration mode can limit the exploration area substantially. For example, the [ Exploration limit ] of 1 allows Extreme Picture Finder to download only the Starting URL and all Target Files linked directly from the Starting URL. Very much like the [ Current page only ] mode. [ Exploration limit ] of 2 will let Extreme Picture Finder download your Starting URL and all Target Files linked directly from the Starting URL as well as all non-target-file links (links to other pages) found on the Starting URL and all Target Files linked directly from those non-target-file links. The [ Exploration limit ] of 3 will allow Extreme Picture Finder to follow links found on the Starting URL, and then all non-target-file links found on the previous step and then stop.

OK, this is it for the first post. If anyone have questions - please post them as the replies to this thread. And if anyone is interested, I'll explain more advanced topics: how to use [ URL filters ] and [ Excluded page parts ] to get the precise results. Most of the recent generic templates reply completely on those two with the exploration mode of [ Entire website ].
hecramsey
Posts: 6
Joined: 01 Jan 2023, 02:25

Re: Limiting Extreme Picture Finder exploration

Post by hecramsey »

great thansk this is perfect. is this part of an online documentation? I was thinking i had to write regex. ugh. just for maybe future if you build more docs what would be great is use cases . simple task = single simple template as example. I can put a few together
ie start page >gallery page>download every .jpg where link contains "myfavoritepic" = my simple template
start page >login > go to every link except where link contains "some keyword" >2 layers deep = my less simple template.
Maksym
Site Admin
Posts: 2230
Joined: 02 Mar 2009, 17:02

Re: Limiting Extreme Picture Finder exploration

Post by Maksym »

I'm planning to add more articles like this to the documentation. Right now this article is only here.
hecramsey
Posts: 6
Joined: 01 Jan 2023, 02:25

Re: Limiting Extreme Picture Finder exploration

Post by hecramsey »

great thanks this is great.
Maksym
Site Admin
Posts: 2230
Joined: 02 Mar 2009, 17:02

Re: Limiting Extreme Picture Finder exploration

Post by Maksym »

The exploration mode of [ Current directory and deeper ] should work in this case.
Maksym
Site Admin
Posts: 2230
Joined: 02 Mar 2009, 17:02

Re: Limiting Extreme Picture Finder exploration

Post by Maksym »

Here is a real-life project setting explained. A good case study of more advanced techniques where only specific pages of a huge website are crawled to get the target files:

hancinema.net picture galleries of a selected company
Post Reply