
If we have some associated text for every image then we can index and search these images pretty efficiently looking for particular keywords that may be of interest. After manually categorising two or three thousand images, I was able to train a CNN to classify any prnt.sc image into one of these three categories with pretty decent accuracy.īut we can go further, what about the content of text images? Well, we can identify images containing text with decent accuracy so we can reliably run OCR on these images to extract any text from them. These three types of image are fairly easy to distinguish and being able to separate them automatically would be very useful. There were a few main types of images that people uploaded, ordered by frequency: It’s all well and good having a collection of 1,000,000 random screenshots people have taken without realising anyone other than a few trusted individuals would see, but how do we actually do anything useful with all of this data? Well, we can eyeball it first and see what sort of images people upload.
Best lightshot codes download#
Pick a starting code, scrape the HTML to find the image URL, download it, generate the next code, scrape, download, next code… This alone should not be possible on any website however it can be combined with other techniques to extract information far more efficiently.

CollectionĪs mentioned above, this stage is really simple. There were two stages to the scraper, a collection stage and a processing stage. Try increasing the last number/letter by one and you’ll get the image that was uploaded by someone else right after you uploaded your own image. You can test this, try uploading an image to prnt.sc and you’ll get a URL like. What makes this worse is that these IDs are sequential, which means the image with ID abcdef is was uploaded after abcdee which was uploaded after abcded, abcdec, abcdeb… And so on. Instead, prnt.sc has IDs of the following structure: bghwg3 That’s a 6 character alphanumeric string, which a computer can generate all variations of in seconds. Instead, IDs should be long, random and unpredictable. They allow scrapers to easily traverse every item on the site and if those items contain sensitive information then that’s particularly bad because a scraper can collect a massive amount of information with next to no effort. It’s well known that sequential IDs are a bad thing on websites. But you should be warned that the legality of web scraping is questionable and some sites may not be happy if you bombard their site with many requests for images. If you’ve reached this page because you just want to scrape prnt.sc (or Lightshot, its official name) then look no further than the GitHub link above.
