Im trying to grab images from this page https://www.warhammer.com/en-GB/shop/wa ... s/tyranids and the images in the link. To me this seems like a simple request unless checking the whole site it does not seem to work the whole site has over 5000+ that are not needed
any help?
Site Exploration
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
The shops are never easy. You cannot use the exploration type [ Current directory and deeper ] because product pages are outside the folder of your [ Starting address ]:
warhammer.com/en-GB/shop/warhammer-40000/xenos-armies/tyranids
warhammer.com/en-GB/shop/tyranids-psychophage-2023
So, what I do in this case - is I use the exploration type of [ Entire website], then I exclude all pages of the website by adding
^https?://(www\.)?warhammer\.com
to the [ Excluded URLs ], so that Extreme Picture Finder crawls only addresses that match Regular Expressions in the [ Included URLs ]. And then I populate the [ Included URLs ] with the pages I need to crawl. For example, the RE for your product pages would be:
warhammer\.com/en-GB/shop/[^/#]+$
Also, I recommend using [ Excluded page parts ] to prevent Extreme Picture Finder from going from one product page to another - you want it to visit product pages only from the category page, right? So, you will need to open the page source of the product page (Ctrl + U in your browser), identify the parts that have links to other products, and "cut them out" using the appropriate Regular Expressions. But I prefer to "cut everything out" except for the part that has the full-size image addresses. I suggest testing your [ Excluded page parts ] before running the project - use the [ Test... ] button under the list to make sure that links to the full-size images are not cut out.
warhammer.com/en-GB/shop/warhammer-40000/xenos-armies/tyranids
warhammer.com/en-GB/shop/tyranids-psychophage-2023
So, what I do in this case - is I use the exploration type of [ Entire website], then I exclude all pages of the website by adding
^https?://(www\.)?warhammer\.com
to the [ Excluded URLs ], so that Extreme Picture Finder crawls only addresses that match Regular Expressions in the [ Included URLs ]. And then I populate the [ Included URLs ] with the pages I need to crawl. For example, the RE for your product pages would be:
warhammer\.com/en-GB/shop/[^/#]+$
Also, I recommend using [ Excluded page parts ] to prevent Extreme Picture Finder from going from one product page to another - you want it to visit product pages only from the category page, right? So, you will need to open the page source of the product page (Ctrl + U in your browser), identify the parts that have links to other products, and "cut them out" using the appropriate Regular Expressions. But I prefer to "cut everything out" except for the part that has the full-size image addresses. I suggest testing your [ Excluded page parts ] before running the project - use the [ Test... ] button under the list to make sure that links to the full-size images are not cut out.
-
- Posts: 21
- Joined: 17 Jan 2024, 16:14
Re: Site Exploration
Im gonna give that a try.
I know that the images are stored at https://www.warhammer.com/app/resources ... g/product/ is there a way to simply have it search the main page and only pull images from the resources links?
I know that the images are stored at https://www.warhammer.com/app/resources ... g/product/ is there a way to simply have it search the main page and only pull images from the resources links?
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
Can I have an example of the "resource link"? How can I find or generate such link(s) using the page source of the "main page"?
-
- Posts: 21
- Joined: 17 Jan 2024, 16:14
Re: Site Exploration
Website structure is pretty much like this
Main search page - https://www.warhammer.com/en-US/shop/wa ... s/tyranids
Sub page - https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
- https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
Images are stored -
https://www.warhammer.com/app/resources ... =920&h=948
https://www.warhammer.com/app/resources ... =920&h=948
each image from the website i wanna grab is stored in the product folder from the website the links would be found in the main page
Main search page - https://www.warhammer.com/en-US/shop/wa ... s/tyranids
Sub page - https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
- https://www.warhammer.com/en-US/shop/ty ... e4fa9dcc53
Images are stored -
https://www.warhammer.com/app/resources ... =920&h=948
https://www.warhammer.com/app/resources ... =920&h=948
each image from the website i wanna grab is stored in the product folder from the website the links would be found in the main page
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
I don't understand. Where in the source of
httрs://warhammer.com/en-US/shop/warhammer-40000/xenos-armies/tyranids
can I find links to product images? I don't even see the links to the product pages. I did a small investigation and found both product and image links can be found in the response of the [ Fetch/XHR ] request with this URL:
httрs://m5ziqznq2h-dsn.algolia.net/1/indexes/prod-lazarus-product-en-us/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.20.0)%3B%20Browser
But you cannot see the content of this URL by simply opening it in your browser. This means you will need to use Extreme Picture Finder's built-in browser to get all necessary request headers, such as "X-Algolia-Api-Key" and "X-Algolia-Application-Id" (at the very least). This trick is used in several templates, for example, the one for instagram and twitter.
As I told you in the very first reply: "The shops are never easy."
httрs://warhammer.com/en-US/shop/warhammer-40000/xenos-armies/tyranids
can I find links to product images? I don't even see the links to the product pages. I did a small investigation and found both product and image links can be found in the response of the [ Fetch/XHR ] request with this URL:
httрs://m5ziqznq2h-dsn.algolia.net/1/indexes/prod-lazarus-product-en-us/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.20.0)%3B%20Browser
But you cannot see the content of this URL by simply opening it in your browser. This means you will need to use Extreme Picture Finder's built-in browser to get all necessary request headers, such as "X-Algolia-Api-Key" and "X-Algolia-Application-Id" (at the very least). This trick is used in several templates, for example, the one for instagram and twitter.
As I told you in the very first reply: "The shops are never easy."
-
- Posts: 21
- Joined: 17 Jan 2024, 16:14
Re: Site Exploration
ill keep trying on my end to get the results i want i just don't understand why its not grabbing the resources from the page they show up in source and network 60010199067_WH40kUltimateStarterSet1.jpg?fm=webp&w=360&h=371 they just jpgs shrunk down in size and displayed in webp indeed is a pain might just use downloadthemall as it grabs them I'm just trying to understand the software better
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
Can you show me a part of the source code of a CATEGORY page (you want to start from the category page, right?) where that image link is shown?
If you want to start from a product page - then just create a project with the exploration type of [ Current page only ] and your product page as a Starting Address and you're good to go.
If you want to start from a product page - then just create a project with the exploration type of [ Current page only ] and your product page as a Starting Address and you're good to go.
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
There is another option. I forgot about it. You can use the [ Address list ] tab of the built-in browser to get the links of all images shown on the category page. Take a look:
And then you can apply a filter using a field at the bottom of the window and click the [ Select all ] button or simply manually check the addresses you want Extreme Picture Finder to download.
And then you can apply a filter using a field at the bottom of the window and click the [ Select all ] button or simply manually check the addresses you want Extreme Picture Finder to download.
-
- Site Admin
- Posts: 2246
- Joined: 02 Mar 2009, 17:02
Re: Site Exploration
You can also use this to get the addresses of the product pages and then allow Extreme Picture Finder to find product images on those pages. You can automate this process using the following section of the project properties: