Site Exploration

Amino
Posts: 21
Joined: 17 Jan 2024, 16:14

Site Exploration

Post by Amino »

Im trying to grab images from this page https://www.warhammer.com/en-GB/shop/wa ... s/tyranids and the images in the link. To me this seems like a simple request unless checking the whole site it does not seem to work the whole site has over 5000+ that are not needed

any help?
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

The shops are never easy. You cannot use the exploration type [ Current directory and deeper ] because product pages are outside the folder of your [ Starting address ]:

warhammer.com/en-GB/shop/warhammer-40000/xenos-armies/tyranids

warhammer.com/en-GB/shop/tyranids-psychophage-2023

So, what I do in this case - is I use the exploration type of [ Entire website], then I exclude all pages of the website by adding

^https?://(www\.)?warhammer\.com

to the [ Excluded URLs ], so that Extreme Picture Finder crawls only addresses that match Regular Expressions in the [ Included URLs ]. And then I populate the [ Included URLs ] with the pages I need to crawl. For example, the RE for your product pages would be:

warhammer\.com/en-GB/shop/[^/#]+$

Also, I recommend using [ Excluded page parts ] to prevent Extreme Picture Finder from going from one product page to another - you want it to visit product pages only from the category page, right? So, you will need to open the page source of the product page (Ctrl + U in your browser), identify the parts that have links to other products, and "cut them out" using the appropriate Regular Expressions. But I prefer to "cut everything out" except for the part that has the full-size image addresses. I suggest testing your [ Excluded page parts ] before running the project - use the [ Test... ] button under the list to make sure that links to the full-size images are not cut out.
Amino
Posts: 21
Joined: 17 Jan 2024, 16:14

Re: Site Exploration

Post by Amino »

Im gonna give that a try.

I know that the images are stored at https://www.warhammer.com/app/resources ... g/product/ is there a way to simply have it search the main page and only pull images from the resources links?
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

Can I have an example of the "resource link"? How can I find or generate such link(s) using the page source of the "main page"?
Amino
Posts: 21
Joined: 17 Jan 2024, 16:14

Re: Site Exploration

Post by Amino »

Website structure is pretty much like this
Main search page - https://www.warhammer.com/en-US/shop/wa ... s/tyranids
Sub page - https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
- https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
Images are stored -
https://www.warhammer.com/app/resources ... =920&h=948
https://www.warhammer.com/app/resources ... =920&h=948

each image from the website i wanna grab is stored in the product folder from the website the links would be found in the main page
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

I don't understand. Where in the source of

httрs://warhammer.com/en-US/shop/warhammer-40000/xenos-armies/tyranids

can I find links to product images? I don't even see the links to the product pages. I did a small investigation and found both product and image links can be found in the response of the [ Fetch/XHR ] request with this URL:

httрs://m5ziqznq2h-dsn.algolia.net/1/indexes/prod-lazarus-product-en-us/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.20.0)%3B%20Browser

But you cannot see the content of this URL by simply opening it in your browser. This means you will need to use Extreme Picture Finder's built-in browser to get all necessary request headers, such as "X-Algolia-Api-Key" and "X-Algolia-Application-Id" (at the very least). This trick is used in several templates, for example, the one for instagram and twitter.

Image

As I told you in the very first reply: "The shops are never easy."
Amino
Posts: 21
Joined: 17 Jan 2024, 16:14

Re: Site Exploration

Post by Amino »

ill keep trying on my end to get the results i want i just don't understand why its not grabbing the resources from the page they show up in source and network 60010199067_WH40kUltimateStarterSet1.jpg?fm=webp&w=360&h=371 they just jpgs shrunk down in size and displayed in webp indeed is a pain might just use downloadthemall as it grabs them I'm just trying to understand the software better
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

Can you show me a part of the source code of a CATEGORY page (you want to start from the category page, right?) where that image link is shown?

If you want to start from a product page - then just create a project with the exploration type of [ Current page only ] and your product page as a Starting Address and you're good to go.
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

There is another option. I forgot about it. You can use the [ Address list ] tab of the built-in browser to get the links of all images shown on the category page. Take a look:

Image

And then you can apply a filter using a field at the bottom of the window and click the [ Select all ] button or simply manually check the addresses you want Extreme Picture Finder to download.
Maksym
Site Admin
Posts: 2246
Joined: 02 Mar 2009, 17:02

Re: Site Exploration

Post by Maksym »

You can also use this to get the addresses of the product pages and then allow Extreme Picture Finder to find product images on those pages. You can automate this process using the following section of the project properties:

Image
Post Reply