One million (small web) screenshots


Background

Last month I came across onemillionscreenshots.com and was pleasantly surprised at how well it worked as a tool for discovery. We all know the adage about judging book covers, but here, …it just kinda works. Skip over the sites with bright flashy colors begging for attention and instead, seek out the negative space in between.

The one nitpick I have though is in how they sourced the websites. They used the most popular websites from Common Crawl which is fine, but not really what I’m interested in…

There are of course exceptions, but the relationship between popularity and quality is loose. McDonald’s isn’t popular because it serves the best cheeseburger, it’s popular because it serves a cheeseburger that meets the minimum level of satisfaction for the maximum number of people. It’s a profit maximizing local minima on the cheeseburger landscape.

This isn’t limited to just food either, the NYT Best Sellers list, Spotify Top 50, and Amazon review volume are other good examples. For me, what’s “popular” has become a filter for what to avoid. Lucky for us though, there’s a corner of the internet where substance still outweighs click-through rates. A place that’s largely immune to the corrosive influence of monetization. It’s called the small web and it’s a beautiful place.

A small web variant

The timing of this couldn’t have been better. I’m currently working on a couple of tools specifically focused on small web discovery/recommendation and happen to already have most of the data required to pull this off. I just needed to take some screenshots, sooo… you’re welcome armchairhacker!


    [full screen version: screenshots.nry.me]

    Technical details

    Because I plan on discussing how I gathered the domains in the near future, I’ll skip it for now (it’s pretty interesting). Suffice it to say though, once the domains are available, capturing the screenshots is trivial. And once those are ready, we have a fairly well worn path to follow:

    1. generate visual embeddings
    2. dimensionality reduction
    3. assignment

    I find the last two steps particularly repetitive so I decided to combine them this time via self-organizing maps (SOMs). I tried using SOMs a few years ago to help solve a TSP problem (well, actually the exact opposite…) but ended up going in a different direction. Anyway, despite their trivial implementation they can be extremely useful. A bare bones SOM clocks in at about 10 lines with torch.

    @torch.no_grad()
    def som_step(alpha, sigma):
    
        # pick random training sample
        sample_index = np.random.randint(0, x.shape[0])
        D_t = torch.from_numpy(x[sample_index]).to(DEVICE)
    
        # find the bmu using cosine similarity
        bmu_flat_idx = torch.argmax(torch.nn.CosineSimilarity(dim=2)(D_t, W).flatten())
    
        # convert flat index to 2d coordinates (i, j)
        u_i = bmu_flat_idx // W.shape[1]
        u_j = bmu_flat_idx  % W.shape[1]
    
        # compute the l2  distance between the bmu and all other neurons
        dists_u = torch.sqrt((W_i - u_i)**2 + (W_j - u_j)**2)
    
        # apply neighborhood smoothing (theta)
        theta = torch.exp(-(dists_u / sigma)**2)
    
        # update the weights in-place
        W.add_((theta * alpha).unsqueeze(2) * (D_t - W))
    

    At their core, most SOMs have two elements: a monotonically decreasing learning rate and a neighborhood function with an influence (radius) that is also monotonically decreasing. During training, each step consists of the following:

    1. Randomly select a training sample.
    2. Compare this training sample against all nodes in the SOM. The node with the smallest (quantization) error becomes the BMU (best matching unit).
    3. Update the SOM node weights proportional to how far away from the BMU they are. Nodes closer to the BMU become more like the training sample.

    There are numerous modifications that can be made, but that’s basically it! If I’ve piqued your interest, I highly recommend the book Self-Organizing Maps by Teuvo Kohonen, it’s a fairly quick read and covers the core aspects of SOMs.

    With dimensionality reduction and assignment resolved, we just need the visual embeddings now. I started with the brand new DinoV3 model, but was left rather disappointed. The progression of Meta’s self-supervised vision transformers has been truly incredible, but the latent space captures waaay more information than what I actually need. I just want to encode the high level aesthetic details of webpage screenshots. Because of this, I fell back on an old friend: the triplet loss on top of a small encoder. The resulting output dimension of 64 afforded ample room for describing the visual range while maintaining a considerably smaller footprint.

    This got me 90% of the way there, but it was still lacking the visual layout I had envisioned. I wanted a stronger correlation with color at the expense of visual similarity. To achieve this, I had to manually enforce this bias by training two SOMs in parallel. One SOM operated on the encoder output (visual), the second SOM on the color distribution and were linked using the following:

    When the quantization error is low, the BMU pulling force is dominated by the visual similarity. As quantization error increases, the pulling force due to visual similarity wanes and is slowly overpowered by the pulling force from the color distribution. In essence, the color distribution controls the macro placement while the visual similarity controls the micro placement. The only controllable hyperparameter with this approach is selecting a threshold for where the crossover point occurs.

    I didn’t spend much time trying to find the optimal point, it’s currently peak fall and well, I’d much rather be outside. A quick look at the overall quantization error (below left) and the U-matrix (below right) was sufficient.

    There’s still a lot of cruft that slipped in (substack, medium.com, linkedin, etc…) but overall, I’d say it’s not too bad for a first pass. In the time since generating this initial map I’ve already crawled an additional ~250k new domains so I suppose this means I’ll be doing an update. What I do know for certain though is that self-organizing maps have earned a coveted spot in my heart for things that are simple to the point of being elegant and yet, deceptively powerful (the others of course being panel methods, LBM, Metropolis-Hastings, and the bicycle).