Four, rich, pretrained machine learning APIs bring the smarts behind Google to your apps
In the 2016 Google Founder’s Letter, CEO Sundar Pichai cited Google’s long-term investment in machine learning and AI. “It’s what allows you to use your voice to search for information,” he explained, “to translate the web from one language to another, to filter the spam from your inbox, to search for ‘hugs’ in your photos and actually pull up pictures of people hugging … to solve many of the problems we encounter in daily life. It’s what has allowed us to build products that get better over time, making them increasingly useful and helpful.”
In addition to using machine learning for its own products, Google has released several applied machine learning services — for vision, speech, natural language, and translation — and has open-sourced its TensorFlow scalable machine learning package. An additional service based on TensorFlow, the Cloud Machine Learning Platform, is still in a closed alpha test phase. I hope to review the Cloud Machine Learning Platform and TensorFlow later this year.
All four Google machine learning APIs are managed by the Google Cloud Platform Console, and all have RESTful interfaces; some also have RPC interfaces. There are three authentication options; which one to use depends on the API and the use case.
Although it’s easy enough to construct REST client calls in any language that supports HTTP requests and responses, Google may supply client libraries for C#, Dart, Go, Java, JavaScript (browser), Node.js, Objective-C, PHP, Python, and Ruby, depending on the API. I did most of my experimentation in Python, and I used the supplied HTML forms for constructing and testing REST calls.
Google Cloud Natural Language API
Natural language processing is a big part of the “secret sauce” that makes Google Search popular. Ask “where should I visit in China,” and Google Search will parse enough of your intent to show you articles about popular travel destinations in mainland China at the top of your results list.
It will also show you related queries, such as “visit china visa” and “is it safe to visit china.” Note that the natural-language processing has extracted the verb “visit” and the object “China,” and distinguished the country China from the Republic of China (Taiwan) and from bone china crockery. It has used syntax parsing and entity identification to find popular “nearby” queries in its historical database.
Ratings are an important aspect of the Google Play store. Millions of ratings on a scale of one to five are accompanied by reviews — for example, “Awesome app!” with a five-star rating, or “Very buggy and hard to tell who’s in picture” with a two-star rating of the very same app. Think about these reviews as a great data set to use to train a natural-language-processing neural network for sentiment analysis.
The Cloud Natural Language API, currently in open beta, gives you access to Google’s entity recognition, sentiment analysis, and text annotation (syntax analysis) engines for text. Entity recognition and text annotations are supported in English, Spanish, and Japanese; sentiment analysis is supported only in English. You can embed text in your API call or read a text file from a Google Cloud Storage bucket.

The entity recognition service identifies persons, organizations, locations, and other items mentioned in text. The sentiment analysis service looks at a block of text and decides to what extent it is positive or negative, and it estimates an intensity or magnitude of sentiment. Text annotations analyze parts of speech and provide dependency parse trees for the relationships between words.
As you can see in the figure above, the Entity Recognition service tends to find entities that have Wikipedia articles and returns the URI of the articles. The basic Python code used is shown in the figure below.
Haven OnDemand includes Graph Analysis services (also in preview) trained against English Wikipedia; these are similar to Google entity recognition. IBM Watson Concept Expansion and Concept Insights are similar to Google text annotations and entity recognition. The Microsoft Cortana Entity Linking API is similar to Google Entity Recognition, its Linguistic Analysis API is similar to Google text annotations, and its Text Analytics API includes sentiment analysis, key phrase extraction, and topic detection for English text.

get_service()
call incorporates authentication into the request.
“Google, how old is the Brooklyn Bridge?”
Most Android smartphone users and people who do Google searches by voice over Chrome are familiar with that pattern. The Google Cloud Speech API, currently in open beta, exposes the engine behind the voice transcription used in Google Now, Google voice search, and Google Translate to companies that want to voice-enable their own sites and apps.
The Google Cloud Speech API provides speech-to-text conversion; it doesn’t do text-to-speech. It handles some 80 languages and variants, and that selection is heavy on variants, including nine localizations of English from Australia to the United States, 18 localizations of Spanish, and 15 localizations of Arabic.
There is no automatic language detection in the API; you need to set the language code accurately for the speaker (rather than the location) to get good recognition. For example, a South African or Zimbabwean with a strong accent, living in the United States and speaking English, is more likely to get good recognition using the en-ZA
language code than the en-US
code.
That’s consistent with the experience people have not only with Google voice search, but also with Apple Siri, Microsoft Cortana, and apps using third-party recognition engines such as Nuance NDEV. If you’re writing an app that uses the Cloud Speech API, you’ll probably want to default to the system language code but offer an interface for changing the language code for the app.
The Google Cloud Speech API has both synchronous and asynchronous batch APIs for transcribing stored audio and complete utterances, and a streaming API to recognize speech live. It handles long-form audio in batch, along with short utterances, and offers both REST (nonstreaming only) and RPC APIs.
You can embed your audio in your service call or point to a GCP bucket that contains an audio file. In addition to a language code, the recognition configuration that accompanies the audio specifies the audio encoding, the sample rate, the maximum number of alternatives to return, whether a profanity filter should be used, and a speech context.

Cloud Speech takes word hints that expand its already large vocabulary and increase the likelihood of correct recognition of expected words. It also does command recognition. The optional speech context contains a list of up to 50 phrases with as many as 100 characters each. You can use this for voice-controlled games, and you can combine it with the Cloud Natural Language API.
Supported audio encodings include FLAC (recommended), LINEAR16, MULAW, AMR, and AMR_WB. Note that lossy music formats such as AAC and MP3 are not supported because the recognition accuracy suffers from the compression. Only mono audio is supported.
As you can imagine, Cloud Speech builds on very large training sets gleaned from its use by Google to service voice search. That implies it has learned a large range of regional variations. For example, the spoken U.S. English language includes many diverse dialects, from a Georgia drawl to New England dropped R’s (“Pahk the cah”), to distinctive Lawn Guyland (“Eyoo gawt it”) and Philadelphia (“D’youse want wudder?”) accents. Cloud Speech has also learned to handle noise from, for example, passing cars, and in fact Google recommends that you not try to filter the audio for noise prior to sending it for speech recognition.
HPE Haven OnDemand can recognize 21 languages and variants, including both broadband and telephony quality data sets for the most common languages — for example, Telephony Latin American Spanish. Haven OnDemand can extract audio from video as well as audio files, but does not support synchronous or live recognition.
IBM Watson can recognize eight languages and variants in broadband quality, as well as six in telephony quality. On the standard plan, using telephony models is twice as expensive as using broadband models. The transcription of incoming audio is continuously sent back to the client with minimal delay, and it is corrected as more speech is heard.
Microsoft Bing Speech Recognition supports 28 languages and variants. Real-time streaming is supported on Android, iOS, and Windows when you use the appropriate client library. If you train a Language Understanding Intelligent Service (LUIS) model, you can also receive structured information about the recognized speech to parse the intent of the speaker and drive further actions by the app.
Google Cloud Translate API
The Google Translate website and app, along with the Google Website Translator gadget, have been popular for years. In the early days, bilingual human translators would often roar with laughter at Google’s attempts at machine translation. Over the years, however, human translators have had the opportunity to correct the mistakes made by the machine translator, and many of the corrections have been incorporated into the translation corpus. As a result, Google’s machine translations have improved considerably, although the quality still varies from one language pair to another.
Google Translate API is a paid enterprise service for translating large amounts of text. It supports 90 languages, making thousands of language pairs, though not every language pair is supported. You can, however, query the API for a list of supported iso639-1 language codes in JSON format and a list of supported targets for any given source language.
If you don’t know the identity source language, you can leave out the source language code and the API will try to recognize it. Language detection costs the same $20 per million characters as language translation.
Beyond that, the translation API is straightforward. Supply the source and target language codes, as many source strings as you wish, your API key, and optionally specify the output format. Options include HTML or plain text, pretty printing (using indentations and line breaks), and supplying a callback function.

In many cases, such as the example above, the translation will be of high quality. In others, such as the example below, the translation will fail spectacularly.
By way of comparison, Haven OnDemand can currently identify 85 languages and perform sentiment analysis on them, but cannot translate them. Azure Cognitive Services can detect sentiment and key phrases in four languages, but cannot do translations, though Bing translations are almost as common on the web as Google translations.

Google Cloud Vision API
The Google Cloud Vision API is a trained ML service for categorizing images and extracting various features. It can classify images into thousands of pretrained categories, ranging from generic objects and animals found in the image (such as a cat), to general conditions (for example, dusk), to specific landmarks (Eiffel Tower, Grand Canyon), and identify general properties of the image, such as its dominant colors. It can isolate areas that are faces, then apply geometric (facial orientation and landmarks) and emotional analyses to the faces, although it does not recognize faces as belonging to specific people. The Vision API can also read and extract text from images in some 10 languages, identify product logos, and detect adult, violent, and medical content.
You can construct a JSON request for the Cloud Vision API that either contains the image (in base64 format) or points to the image (in a Google Cloud Storage bucket). The request also needs to contain a list of the features you want to extract, along with the maximum number of items to return for each feature. You can request processing of multiple images in one call, but you’d risk running into the total size limitation.
I managed to run into the size limit for single images on the first JPEG I tried on the service. I naively took a high-quality APS-C DSLR JPEG I had exported from my Lightroom catalog and tried getting a label for it using Python code (checked out from GitHub) from the Google Label Detection tutorial. After struggling with and solving some authentication issues, I got a mysterious Error 400 with “Request Admission Denied.” An email query to my contacts at Google got me the suggestion to look at the Best Practices for the Vision service; as it happens, my file was 6MB, and the limit is 4MB. I generated another version with a lower JPEG quality that was less than 4MB, and this time got a correct label back from the service.
Google suggests several applications for the Vision API. One is to catalog your image collection because not everyone faithfully adds keyword tags to all their images, and not every cloud photo service retains EXIF data in uploaded images. Another is to detect and moderate offensive content in images. (No, I did not try to test that myself. Google has plenty of experience filtering offensive material from image searches.)

Further suggested applications include tasks like finding your logo in images on social media, detecting emotions from faces in those images, and automatically extracting text from selected images. If you wanted to get wild and crazy, you might pipe all of the non-English text retrieved into the Cloud Translate service and analyze the sentiment of all of the OCR’d text using the Cloud Natural Language API.
Among the competition, Haven OnDemand offers four image analysis services: bar-code recognition, face detection, corporate logo recognition, and OCR. The Google Cloud Vision API doesn’t do bar-code recognition, but returns more information about detected faces, recognizes many more items than logos, and has a more mature OCR implementation. (HPE’s OCR is still in preview.)
IBM Bluemix offers a Watson Visual Recognition service that does general classification, face detection, text extraction (English-only, beta), and visual training and tagging. Google Cloud Vision is better at the first three (and offers more capabilities in those areas), but doesn’t do training. Visual training is something you can do with Google TensorFlow now and should be able to do with the Cloud Machine Learning Platform when it is available to the public.
Microsoft Azure Cognitive Services has Face and Emotion APIs that are currently in preview. The Face API does face detection, verification, identification, grouping, and similar face searching; the Emotion API classifies the mood of faces detected by the Face API. These two APIs together provide a subset of the capabilities of Google Cloud Vision.
Machine learning at your service
As we’ve seen, the four Google applied machine learning APIs discussed — the beta natural-language processing and speech-to-text APIs and the production language translation and vision classification APIs — are based on engines that have long histories of production use at Google, with millions of requests served for consumer-facing services. In most cases, a given feature of the Cloud Machine Learning APIs will perform as well or better than competitive APIs from HPE, IBM, and Microsoft, and will have more options.
For example, Google Cloud Speech (in beta) transcribes more than 80 languages and variants; its nearest competitor, Microsoft Bing Speech Recognition, supports 28 languages and variants. The accuracy? Well, it’ll depend as much on the conditions as the service, but my experience with Google voice search and Cortana gives a slight nod to Google.
Nevertheless, as the car ads say in fine print, your mileage may vary. If you’re considering using natural-language processing, speech-to-text, translation, or vision APIs, the Google Cloud Machine Learning services are worth testing in your application and on your data.
The dollar cost of trying them out is minimal because of the free monthly service allowances. The effort to try them out is fairly low. I learned to use all four APIs and all three kinds of authentication over a weekend, and I had only one glitch, which I would have been able to avoid had I read the best-practices documentation before trying the vision API.
Google Cloud Machine Learning pricing
Cloud Natural Language API: Priced per feature per thousand records, ranging from 25 cents to $2 depending on feature and quantity; first 50,000 records per month are free
Cloud Speech API: 0.6 cents per 15 seconds; first 60 minutes per month are free
Cloud Translate API: $20 per 1 million characters of text for translation, plus $20 per 1 million characters of text for language detection
Cloud Vision API: Priced per feature and prices drop with increased usage; prices range from 60 cents per thousand features to $5 per thousand features; first 1,000 requests per month are free
Cloud storage (per gigabyte per month): Standard Storage 2.6 cents, Durable Reduced Availability (DRA) Storage 2 cents, Nearline Storage 1 cent
This story, “First look: Google Cloud Machine Learning soars” was originally published by InfoWorld.