It all started back in 2008 with the launch of the t-shirt search engine Teenormous.com. At the time of the launch, Teenormous consisted of just under 15,000 t-shirts, offered by roughly a half-dozen merchants.
Over the next few months, more tees and merchants were added to the site and the inventory quickly grew. As Teenormous started approaching 50,000 shirts, updating the site was becoming increasingly more painful and taking much longer. The manual process of doing things was quickly becoming a full-time job and was no longer a viable option. Something had to change.
Enter automation. Over the past 8 years, I’ve been refining the way I manage Teenormous, as well as 4 additional niche search engines I own, through a series of automated procedures. These techniques have allowed me to manage over 300,000 products, across 5 sites, nearly on auto-pilot.
I’ve decided to share some of these techniques in the hope that some of this may prove useful to you as well.
For the sake of this post, I’ll focus on my t-shirt search engine Teenormous, since it is the largest and oldest of the niche search engines.
Here is an overview of the major moving parts within the system:
All of the shirts on Teenormous come from one of 2 sources:
The data feeds are in the form of either a custom Teenormous-specific feed format, or an affiliate feed format from one of the big affiliate networks such as Commission Junction or ShareASale.
For each type of feed, I’ve developed a custom feed processor to parse the feed and extract the data into a format that we can use.
Some merchants are unable to create a feed for their inventory. For these sites, we crawl the merchant’s store (with their permission) to obtain any t-shirt information needed.
Each web crawler requires some custom code and logic to obtain as much of the relevant tee information as possible. This data is then stored in a Teenormous feed format so it can fit right into the same process that other data feed items flow through.
The purpose of the processing pipeline is as follows:
This processing pipeline is designed to be run in parallel. Running multiple jobs at the same time reduced the total processing time by a factor of 10 versus sequentially. This alone was responsible for reducing the total update time from 12 plus hours to just a couple of hours on average.
When you are dealing with a lot of data from a lot of different sources, you can be certain there is going to be a never ending stream of issues with the data. To make matters worse, this is a moving target since merchants routinely put bad data in feeds.
As a result, the validation step is one of the most important parts of this pipeline. Validation makes sure that the required fields are there, such as title, price, image(s), and description.
In addition, this step will kick out any items that don’t belong on the site, such as non-shirt items. The process of determining which items are valid is not an exact science though. For example, hoodies are allowed on the site, but these cute little hoodie bears at one point snuck in as well:
The data process phase is responsible for determining the correct product attributes for a given item. These include:
These attributes are used to build the search facets on the site:
The final step in the process is to create the product images. These images will be cropped and sized to fit within the layout on Teenormous so all of the varied t-shirt images feel like they belong together on the site.
These images are uploaded to an Amazon Cloudfront-enabled S3 bucket. This allows the images to be served immediately after they are created, without needing to move these images around to another place. Using a CDN, such as Cloudfront, results in these images being served much faster, since each visitor fetches a copy of the image from server geographically closer to them.
If all went well, the final destination for a good item in the feed is to end up in the Site Feed. The Site Feed is a subset of the initial feed that contains all of the good items that have been processed and are ready to be imported into Teenormous. Any invalid items, or non-tee items, or items without images will not be in this feed.
The Site Feed for each merchant is placed in an Amazon S3 bucket and the processing job is complete. Teenormous will then check this bucket for updated feeds and update the data for the merchant if it has changed. Since each site feed contains all good items for a merchant, the import process will also hide any items that are not in the feed. These items are typically out of stock or no longer for sale, so it is important to remove them from the site so shoppers don’t see them.
Separating the processing of feeds and web crawling from the actual site updates is a great way to divide and conquer the major responsibilities for keeping a site up to date. The processing can be done using a farm of worker machines, while the site updates can stay simple and focused on just doing simple inserts and updates.
If all goes well, this process will run like a well-oiled machine and I can just kick back and enjoy the profits. However, if there is one thing I’ve learned over the years, as Murphy says, whatever can go wrong will.
Here are just a few highlights of the types of issues that can go wrong within this process, and how I handle them.
In the t-shirt space, there have been a LOT of merchants that come and go. When this happens, the merchant often times just shuts down without a word.
To account for this, I use an automated merchant check that will crawl the most recent items in a merchant’s feed and report back any failing results so we can look into it.
If a merchant updates their site, such as with an updated theme, this can break both a web crawler, or often times this can also break the a feed processor, if the merchant does not remember to update their feed to reflect the new structure.
These types of errors are usually logged during the processing pipeline. If this is a web crawler issue, the fix involves manually modifying the crawling code to reflect the new structure. If this is a feed issue, the merchant is notified, and in some cases, temporarily hidden from the site until the issue is fixed.
Missing or bad images occur when a merchant specifies an image location in their feed that leads to a missing or invalid image url. This is most likely just an oversight on the merchant’s part. However, this happens more often than you’d think.
When this occurs, the image processing portion of the pipeline creates an error notification so this can be looked into. The items with bad images will be skipped during the process and the merchant will be notified as needed.
An incomplete feed typically occurs when a merchant does not properly build their feed and as a result there are only a fraction of the items that should be in there.
To work around this, a site import on Teenormous is halted if there are less than 60% of the current number of items on the site within the feed. If the decrease seems legit, the feed needs to be force imported to override this check.
Last but not least, a stale or outdated feed is probably the most common and most annoying of the issues to deal with on Teenormous. This typically occurs because a merchant is not creating their feed automatically and they forget to update it, or it is just too tedious to do so they put it off until someone complains.
To help detect this scenario on Teenormous, we track the time a feed was changed and imported on the site. If this goes beyond a certain number of days, we will reach out to the merchant to address this.
In summary, when dealing with a large number of products from a varied number of sources, a little automation goes a long way. Automation is a living, breathing process that will grow over time based on the ever changing needs of your company. I hope you found some of these experiences helpful.