metascraper icon indicating copy to clipboard operation
metascraper copied to clipboard

[RFC] Metascraper for e-commerce

Open Kikobeats opened this issue 4 years ago • 4 comments

The idea behind this issue is to determine what kind of data can be extracted and normalized across e-commerce URLs.

examples of e-commerces

  • https://www.garnier.ca
  • https://www.lorealparis.ca
  • https://www.kerastase.com

(no exhausted list, we need a lot more!)

Kikobeats avatar May 24 '21 21:05 Kikobeats

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

  • [x] parse and return data from ld+json objects that use schema.org @type: 'Product'
  • [ ] Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently
  • [ ] (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata
  • [ ] (feature request) have an option to parse page elements and return their innerText so that redundant inner HTML gets excluded
  • [ ] parse and return multiple products based on offers https://schema.org/offers
  • [ ] Support RDFa parsing, though I have yet to come across a site that uses RDF so this could be a low priority

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

  • https://www.walmart.com/ip/Miracle-Gro-Garden-Soil-Vegetables-and-Herbs-1-5-cu-ft/46928865?athcpid=46928865&athpgid=athenaHomepage&athcgid=dealspage-home-2524396&athznid=ItemCarouselType_BestInDeals&athieid=v1&athstid=CS020&athguid=466001f5-9a18a716-46880cef9f15260d&athancid=null&athena=true
  • https://www.garnier.ca/en-ca/about-our-brands/hair-care/fructis/hair-treats/garnier-fructis-nourishing-treat-with-coconut-extract-400-ml
  • https://www.kerastase.ca/en/collections/nutritive/3474636721832.html
  • https://www.lorealparis.ca/en-ca/excellence-creme/excellence-creme-f-medium-brown
  • https://www.staples.ca/products/2735027-en-brother-tn760-black-toner-cartridge-high-yield
  • https://thelionchain.com/collections/exclusive-promotions/products/the-gold-edition-trap-set
  • https://shop.3dtotal.com/anatomy-figure/3dtotal-anatomy-3-piece-set-of-animal-figures
  • https://hellostella.myshopify.com/collections/rustic-stella/products/highland-fingering-posy
  • https://www.toysrus.ca/en/Hot-Wheels-Sky-Crash-Tower-Track-Set/242C6973.html
  • https://www.homedepot.com/p/RYOBI-18-Volt-ONE-Cordless-AirStrike-18-Gauge-Brad-Nailer-Tool-Only-with-Sample-Nails-P320/203810823?MERCH=REC--pnf--312306957--203810823--N&
  • https://thewhiteelephantdesigns.com/collections/the-baby-shop/products/chicken-dress

theetrain avatar May 28 '21 23:05 theetrain

Has this moved anywhere in the past last years? or are you using addons like https://github.com/samirrayani/metascraper-shopping?

very keen to know more about this.

adentranter avatar Feb 08 '24 21:02 adentranter

Thanks for helping to build easier e-commerce data extraction.

Overall, e-commerce sites that I've tested that use ld+json tend to consistently contain brand, product name, and sku information in a predictable manner. Sites that opt for structured microdata without ld+json tend to be more inconsistent in how they represent brand information; with some using an element with itemprop="name|brand|sku" or nested inside an itemtype="http://schema.org/Thing" element, or some other yet-discovered pattern.

As of today, critical e-commerce data I'm seeking include product name, product brand, and product sku. In the near future, I may have a need for product pricing, variants, and accessories as defined in https://schema.org/Product.

Some data-gathering strategies I intend to use for products include:

* [x]  parse and return data from ld+json objects that use schema.org `@type: 'Product'`

* [ ]  Come up with schema.org microdata parsing and fallback strategies to cover as many e-commerce sites as possible, since some websites do not structure their data consistently

* [ ]  (feature request) conditionally retry page parsing every second, up to 5 seconds, if no products can be found. This is due to some e-commerce sites that use client-side rendering take a while to display ld+json or microdata

* [ ]  (feature request) have an option to parse page elements and return their [`innerText`](https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText) so that redundant inner HTML gets excluded

* [ ]   parse and return multiple products based on offers https://schema.org/offers

* [ ]  Support [RDFa](https://www.w3.org/MarkUp/2009/rdfa-for-html-authors) parsing, though I have yet to come across a site that uses RDF so this could be a low priority

Based on current Microlink features, I am able to extra product data using the prerender and waitForTimeout options. Here is a working demo: https://runkit.com/theetrain/microlink-mql-product-data/1.0.0

Product pages I have tested:

* https://www.walmart.com/ip/Miracle-Gro-Garden-Soil-Vegetables-and-Herbs-1-5-cu-ft/46928865?athcpid=46928865&athpgid=athenaHomepage&athcgid=dealspage-home-2524396&athznid=ItemCarouselType_BestInDeals&athieid=v1&athstid=CS020&athguid=466001f5-9a18a716-46880cef9f15260d&athancid=null&athena=true

* https://www.garnier.ca/en-ca/about-our-brands/hair-care/fructis/hair-treats/garnier-fructis-nourishing-treat-with-coconut-extract-400-ml

* https://www.kerastase.ca/en/collections/nutritive/3474636721832.html

* https://www.lorealparis.ca/en-ca/excellence-creme/excellence-creme-f-medium-brown

* https://www.staples.ca/products/2735027-en-brother-tn760-black-toner-cartridge-high-yield

* https://thelionchain.com/collections/exclusive-promotions/products/the-gold-edition-trap-set

* https://shop.3dtotal.com/anatomy-figure/3dtotal-anatomy-3-piece-set-of-animal-figures

* https://hellostella.myshopify.com/collections/rustic-stella/products/highland-fingering-posy

* https://www.toysrus.ca/en/Hot-Wheels-Sky-Crash-Tower-Track-Set/242C6973.html

* https://www.homedepot.com/p/RYOBI-18-Volt-ONE-Cordless-AirStrike-18-Gauge-Brad-Nailer-Tool-Only-with-Sample-Nails-P320/203810823?MERCH=REC-_-pnf-_-312306957-_-203810823-_-N&

* https://thewhiteelephantdesigns.com/collections/the-baby-shop/products/chicken-dress

https://github.com/zbicin/metascraper-shopping might have some of the goods that you are looking for.

adentranter avatar Feb 08 '24 21:02 adentranter