Know-it-all AI is reading nonstop (and filing) the entire contents of the World Wide Web

© Flickr/cc-licence/Bob West

© Flickr/cc-licence/Bob West

  • Universal Factoid Answering System" or "The Digital Panopticon"?
  • Diffbot scraping everything into a well-stuffed global Knowledge Graph
  • Graph rebuilt every four or five days. Holds more than 10 trillion facts, and growing 
  • Between 100 and 150 million new entities added every month

The daily "Download" from the splendid MIT Technology Review of Cambridge, Massachusetts is always a terrific read. One of the latest issues reveals that Diffbot, a company in Mountain View, in the heart of Silicon Valley, California, is in the process of building the biggest ever "knowledge graph" by using AI, image recognition and natural language processing to scan and archive billions of web pages.

It does so, day in, day out, as it patiently and endlessly scrapes the sum of what the world knows and does from the web that holds it into an all-encompassing knowledge base. This process of the intelligent identification of data enables it to be extracted from any and all websites, in any and all languages without the need for writing any code. It sounds too good, or perhaps too scary, to be true, but it is. After all, Google already does it, if only for its most popular search queries - so far.

What Diffbot's scraper technology does is "visually parse" a web page for important elements and turns them into a structured AI format. The company has been compiling the "Knowledge Graph" since 2015. It now encapsulates well over two billion entries on subject areas including people, companies, articles, products, discussions and more or less any other category you might care to name. It also has at least 10 trillion "facts" in its memory. Diffbot reads everything and, as it does so, it classifies the content into a huge interconnected lattice of tripartite relationships of subject, verb and object. These relationships can then be queried and extracted.

Knowledge Graphs have been around since the dawning of the modern concept of AI. The actual term "Artificial Intelligence" was coined by the US computer scientist John McCarthy back in the mid-1950s, and he is regarded as being the father of the discipline. Back in those days knowledge graphs were small and partial as they were composed, ordered, managed and adapted by hand, a lengthy, complex and expensive process. The only way to build a modern, dynamic and endlessly expanding knowledge graph is fully to automate it.

400 paying customers. Researchers get free access

This is how Diffbot's AI works: using what the MIT Download calls "a super-charged version of the Chrome web browser", the AI instantly scans every pixel of web pages and then uses image recognition algorithms to slot the page into a matrix of 20 different types, such as article, discussion, event, image, video and so on. It then identifies the key elements on the page, including headline, author, product description, price or whatever else and uses natural language processing to extract facts from any text. The process is continuous and unrelenting and the knowledge graph is rebuilt from the ground up every four or five days. As it is, it grows

The MIT report says Diffbot AI adds between 100 million and 150 million new entities a month as fresh people, companies, products, subjects and categories appear across the web. Machine -learning algorithms then come into play to meld the new data with what already exists on any subject, person, product etc. As the sum of knowledge grows more and more new servers have to be added to the Diffbot's server farm and data centre. It is a constant re-iterative process as the knowledge graph expands.

So, does Diffbot make any money,and if so, how? The answer is, "Yes it does". The AI is used by the likes of  Amazon, Bing, Cisco, eBay, and the NASDAQ among its more than 400 other paying customers. Last year Diffbot made a profit of US$5 million. It will make considerably more this year .

Meanwhile bona fide non-commercial researchers get free access to the Knowledge Graph. Customer interactions with Diffbot currently are via code but the company's CEO, Mike Tung, says a natural language is under development.

He adds that the eventual plan is to provide an AI "Universal Factoid Question Answering System" able to answer any question from anyone, anywhere - and provide a list of sources to prove the answer is correct. A bit like Mr. Memory in Alfred Hitchcock's film, "The 39 Steps", but doing it in just in 1 Step, if you see what I mean.​

Email Newsletters

Sign up to receive TelecomTV's top news and videos, plus exclusive subscriber-only content direct to your inbox.