Feeding the machine: We give an AI some headlines and see what it does – Ars Technica

Enlarge / Turning the lens on ourselves, as it were. Is Our Machine Learning?View more stories

There's a moment in any foray into new technological territory when you realize you may have embarked on a Sisyphean task. Staring at the multitude of options available to take on the project, you research your options, read the documentation, and start to workonly to find that actually just defining the problem may be more work than finding the actual solution.

Reader, this is where I found myself two weeks into this adventure in machine learning. I familiarized myself with the data, the tools, and the known approaches to problems with this kind of data, and I tried several approaches to solving what on the surface seemed to be a simple machine-learning problem: based on past performance, could we predict whether any given Ars headline will be a winner in an A/B test?

Things have not been going particularly well. In fact, as I finished this piece, my most recent attempt showed that our algorithm was about as accurate as a coin flip.

But at least that was a start. And in the process of getting there, I learned a great deal about the data cleansing and pre-processing that goes into any machine-learning project.

Our data source is a log of the outcomes from 5,500-plus headline A/B tests over the past five yearsthat's about as long as Ars has been doing this sort of headline shootout for each story that gets posted. Since we have labels for all this data (that is, we know whether it won or lost its A/B test), this would appear to be a supervised learning problem. All I really needed to do to prepare the data was to make sure it was properly formatted for the model I chose to use to create our algorithm.

I am not a data scientist, so I wasn't going to be building my own model any time this decade. Luckily, AWS provides a number of pre-built models suitable to the task of processing text and designed specifically to work within the confines of the Amazon cloud. There are also third-party models, such as Hugging Face, that can be used within the SageMaker universe. Each model seems to need data fed to it in a particular way.

The choice of the model in this case comes down largely to the approach we'll take to the problem. Initially, I saw two possible approaches to training an algorithm to get a probability of any given headline's success:

The second approach is much more difficult, and there's one overarching concern with either of these methods that makes the second even less tenable: 5,500 tests, with 11,000 headlines, is not a lot of data to work with in the grand AI/ML scheme of things.

So I opted for binary classification for my first attempt, because it seemed the most likely to succeed. It also meant the only data point I needed for each headline (besides the headline itself) is whether it won or lost the A/B test. I took my source data and reformatted it into a comma-separated value file with two columns: titles in one, and "yes" or "no" in the other. I also used a script to remove all the HTML markup from headlines (mostlya few HTML tags for italics). With the data cut down almost all the way to essentials, I uploaded it into SageMaker Studio so I could use Python tools for the rest of the preparation.

Next, I needed to choose the model type and prepare the data. Again, much of data preparation depends on the model type the data will be fed into. Different types of natural language processing models (and problems) require different levels of data preparation.

After that comes tokenization. AWS tech evangelist Julien Simon explains it thusly: Data processing first needs to replace words with tokens, individual tokens. A token is a machine-readable number that stands in for a string of characters. So 'ransomware would be word one, he said, crooks would be word two, setup would be word three so a sentence then becomes a sequence of tokens, and you can feed that to a deep-learning model and let it learn which ones are the good ones, which ones are the bad ones.

Depending on the particular problem, you may want to jettison some of the data. For example, if we were trying to do something like sentiment analysis (that is, determining if a given Ars headline was positive or negative in tone) or grouping headlines by what they were about, I would probably want to trim down the data to the most relevant content by removing "stop words"common words that are important for grammatical structurebut don't tell you what the text is actually saying (like most articles).

However, in this case, the stop words were potentially important parts of the dataafter all, we're looking for structures of headlines that attract attention. So I opted to keep all the words. And in my first attempt at training, I decided to use BlazingText, a text processing model that AWS demonstrates in a similar classification problem to the one we're attempting. BlazingText requires the "label" datathe data that calls out a particular bit of text's classificationto be prefaced with "__label__". And instead of a comma-delimited file, the label data and the text to be processed are put in a single line in a text file, like so:

Another part of data preprocessing for supervised training ML is splitting the data into two sets: one for training the algorithm, and one for validation of its results. The training data set is usually the larger set. Validation data generally is created from around 10 to 20 percent of the total data.

There has been a great deal of research into what is actually the right amount of validation datasome of that research suggests that the sweet spot relates more to the number of parameters in the model being used to create the algorithm rather than the overall size of the data. In this case, given that there was relatively little data to be processed by the model, I figured my validation data would be 10 percent.

In some cases, you might want to hold back another small pool of data to test the algorithm after it's validated. But our plan here is to eventually use live Ars headlines to test, so I skipped that step.

To do my final data preparation, I used a Jupyter notebookan interactive web interface to a Python instanceto turn my two-column CSV into a data structure and process it. Python has some decent data manipulation and data science-specific toolkits that make these tasks fairly straightforward, and I used two in particular here:

Heres a chunk of the code in the notebook that I used to create my training and validation sets from our CSV data:

I started by using pandas to import the data structure from the CSV created from the initially cleaned and formatted data, calling the resulting object "dataset." Using the dataset.head() command gave me a look at the headers for each column that had been brought in from the CSV, along with a peek at some of the data.

The pandas module allowed me to bulk-add the string "__label__" to all the values in the label column as required by BlazingText, and I used a lambda function to process the headlines and force all the words to lower case. Finally, I used the sklearn module to split the data into the two files I would feed to BlazingText.

Follow this link:

Feeding the machine: We give an AI some headlines and see what it does - Ars Technica

Chinese national arrested and charged with stealing AI trade secrets from Google - NPR - March 8th, 2024 [March 8th, 2024]
President Biden Calls for Ban on AI Voice Impersonations During State of the Union - Variety - March 8th, 2024 [March 8th, 2024]
Revolutionize Your Business with AWS Generative AI Competency Partners | Amazon Web Services - AWS Blog - March 8th, 2024 [March 8th, 2024]
Broadcom Expects AI Demand to Help Offset Weakness Elsewhere - Yahoo Finance - March 8th, 2024 [March 8th, 2024]
Micron Hits Record High With Analysts Calling It an 'Under-Appreciated AI Beneficiary' - Investopedia - March 8th, 2024 [March 8th, 2024]
The Adams administration quietly hired its first AI czar. Who is he? - City & State New York - March 8th, 2024 [March 8th, 2024]
AI likely to increase energy use and accelerate climate misinformation report - The Guardian - March 8th, 2024 [March 8th, 2024]
This Artificial Intelligence (AI) Stock Could Double, and It Is Way Cheaper Than Nvidia - Yahoo Finance - March 8th, 2024 [March 8th, 2024]
Fake images made to show Trump with Black supporters highlight concerns around AI and elections - The Associated Press - March 8th, 2024 [March 8th, 2024]
Artificial intelligence and illusions of understanding in scientific research - Nature.com - March 8th, 2024 [March 8th, 2024]
Analysis | House AI task force leaders take long view on regulating the tools - The Washington Post - March 8th, 2024 [March 8th, 2024]
Don't Give Your Business Data to AI Companies - Dark Reading - March 8th, 2024 [March 8th, 2024]
NIST, the lab at the center of Bidens AI safety push, is decaying - The Washington Post - March 8th, 2024 [March 8th, 2024]
Essay | AI is Coming! Tips for Staying Calm and Carrying On - The Wall Street Journal - March 8th, 2024 [March 8th, 2024]
AI can be easily used to make fake election photos - report - BBC.com - March 8th, 2024 [March 8th, 2024]
5 Artificial Intelligence (AI) Stocks That Could Make You a Millionaire - Yahoo Finance - March 8th, 2024 [March 8th, 2024]
AI could be an extraordinary force for good. So why do our politicians still not have a plan? - The Guardian - March 8th, 2024 [March 8th, 2024]
Mapping Disease Trajectories from Birth to Death with AI - Neuroscience News - March 8th, 2024 [March 8th, 2024]
India plans 10,000-GPU sovereign AI supercomputer - The Register - March 8th, 2024 [March 8th, 2024]
SAP enhances Datasphere and SAC for AI-driven transformation - CIO - March 8th, 2024 [March 8th, 2024]
Jim Cramer names companies and sectors poised to rally on the AI wave - CNBC - March 8th, 2024 [March 8th, 2024]
The job applicants shut out by AI: The interviewer sounded like Siri - The Guardian - March 8th, 2024 [March 8th, 2024]
Microsoft confirms Surface and Windows AI event for March 21st - The Verge - March 8th, 2024 [March 8th, 2024]
Adobes new Express app brings Firefly AI tools to iOS and Android - The Verge - March 8th, 2024 [March 8th, 2024]
A Google AI Watched 30,000 Hours of Video GamesNow It Makes Its Own - Singularity Hub - March 8th, 2024 [March 8th, 2024]
Palantir CEO Karp on TITAN, AI Warfare Technology - Bloomberg - March 8th, 2024 [March 8th, 2024]
Elliptic Curve Murmurations Found With AI Take Flight - Quanta Magazine - March 8th, 2024 [March 8th, 2024]
5 AI Stocks to Buy in March 2024, According to Analysts - TipRanks.com - TipRanks - March 8th, 2024 [March 8th, 2024]
Wix's new AI chatbot builds websites in seconds based on prompts - The Verge - March 8th, 2024 [March 8th, 2024]
Amid record high energy demand, America is running out of electricity - The Washington Post - March 8th, 2024 [March 8th, 2024]
AI Crypto Tokens in 5 Minutes: What to Know and Where to Start - Inc. - February 26th, 2024 [February 26th, 2024]
'The Worlds I See' by AI visionary Fei-Fei Li '99 selected as Princeton Pre-read - Princeton University - February 26th, 2024 [February 26th, 2024]
AI is having a 1995 moment, analyst says - Business Insider - February 26th, 2024 [February 26th, 2024]
Vatican research group's book outlines AI's 'brave new world' - National Catholic Reporter - February 26th, 2024 [February 26th, 2024]
Honor's Magic 6 Pro launches internationally with AI-powered eye tracking on the way - The Verge - February 26th, 2024 [February 26th, 2024]
Google explains Gemini's embarrassing AI pictures of diverse Nazis - The Verge - February 26th, 2024 [February 26th, 2024]
Google cut a deal with Reddit for AI training data - The Verge - February 26th, 2024 [February 26th, 2024]
What's the point of Elon Musk's AI company? - The Verge - February 26th, 2024 [February 26th, 2024]
AI agents like Rabbit aim to book your vacation and order your Uber - NPR - February 26th, 2024 [February 26th, 2024]
Announcing Microsofts open automation framework to red team generative AI Systems - Microsoft - February 26th, 2024 [February 26th, 2024]
After Nvidia's latest blowout, here are 20 AI stocks expected to rise as much as 44% - Yahoo Finance - February 26th, 2024 [February 26th, 2024]
1 Exceptional AI Chip Stock Investors Need to Know About in 2024 - The Motley Fool - February 26th, 2024 [February 26th, 2024]
Nvidia briefly hits $2 trillion valuation as AI frenzy grips Wall Street - Reuters - February 26th, 2024 [February 26th, 2024]
AI Chatbots Can Guess Your Personal Information From What You ... - WIRED - October 18th, 2023 [October 18th, 2023]
Harvard IT Launches Pilot of AI Sandbox to Enable Walled-Off Use ... - Harvard Crimson - October 18th, 2023 [October 18th, 2023]
Advancing policing through AI: Insights from the global law ... - Police News - October 18th, 2023 [October 18th, 2023]
Hochul announces new SUNY, IBM investments in AI - Olean Times Herald - October 18th, 2023 [October 18th, 2023]
Nvidia's banking on TensorRT to expand its generative AI dominance - The Verge - October 18th, 2023 [October 18th, 2023]
AI expands from MRFs to vehicles - Plastics Recycling Update - October 18th, 2023 [October 18th, 2023]
AI Reads Ancient Scroll Charred by Mount Vesuvius in Tech First - Scientific American - October 18th, 2023 [October 18th, 2023]
A DEEPer (squared) dive into AI Harvard Gazette - Harvard Gazette - October 18th, 2023 [October 18th, 2023]
Florida bar weighs whether lawyers using AI need client consent - Reuters - October 18th, 2023 [October 18th, 2023]
Cognizant and Vianai Systems Announce Strategic Partnership to ... - PR Newswire - October 18th, 2023 [October 18th, 2023]
How AI could speed up scientific discoveries, from proteins to ... - NPR - October 18th, 2023 [October 18th, 2023]
AI challenge to deliver better healthcare | Western Australian ... - Government of Western Australia - October 18th, 2023 [October 18th, 2023]
Henry Kissinger: The Path to AI Arms Control - Foreign Affairs Magazine - October 18th, 2023 [October 18th, 2023]
Stability AI releases StableStudio in latest push for open-source AI - The Verge - May 18th, 2023 [May 18th, 2023]
Google CEO Sundar Pichai Predicts That This Profession Will Be ... - The Motley Fool - May 18th, 2023 [May 18th, 2023]
Frances privacy watchdog eyes protection against data scraping in AI action plan - TechCrunch - May 18th, 2023 [May 18th, 2023]
Investing in Hippocratic AI - Andreessen Horowitz - May 18th, 2023 [May 18th, 2023]
As Alphabet flexes its AI prowess, there's a 'new elephant in the room' for Google - MarketWatch - May 18th, 2023 [May 18th, 2023]
The Boring Future of Generative AI | WIRED - WIRED - May 18th, 2023 [May 18th, 2023]
OpenAI readies new open-source AI model, The Information reports - Reuters.com - May 18th, 2023 [May 18th, 2023]
What every CEO should know about generative AI - McKinsey - May 18th, 2023 [May 18th, 2023]
AI creates images of the 'perfect' man and woman - Sky News - May 18th, 2023 [May 18th, 2023]
Audit AI search tools now, before they skew research - Nature.com - May 18th, 2023 [May 18th, 2023]
3 Reasons C3.ai Stock Could Be Your Golden Ticket to the AI ... - InvestorPlace - May 18th, 2023 [May 18th, 2023]
Zoom makes a big bet on AI with investment in Anthropic - VentureBeat - May 18th, 2023 [May 18th, 2023]
AI voice phone scams are on the rise. Here's how to avoid them - USA TODAY - May 18th, 2023 [May 18th, 2023]
Amazon is building an AI-powered conversational experience for ... - The Verge - May 18th, 2023 [May 18th, 2023]
AI speculators need to 'differentiate between actual spending and investment' and hype: Strategist - Yahoo Finance - May 18th, 2023 [May 18th, 2023]
AI Can Be Both Accurate and Transparent - HBR.org Daily - May 18th, 2023 [May 18th, 2023]
You're Probably Underestimating AI Chatbots | WIRED - WIRED - May 18th, 2023 [May 18th, 2023]
AI presents political peril for 2024 with threat to mislead voters - The Associated Press - May 18th, 2023 [May 18th, 2023]
We need AI to help us face the challenges of the future - The Guardian - May 18th, 2023 [May 18th, 2023]
End Of Googles Dominance? Stock Gets Rare Analyst Downgrade Over AI Fears - Forbes - May 18th, 2023 [May 18th, 2023]
Watch 44 million atoms simulated using AI and a supercomputer - New Scientist - May 18th, 2023 [May 18th, 2023]
AI Is The New Electricity: Bank Of America Picks 20 Stocks To Cash In On ChatGPT Hype - Forbes - March 2nd, 2023 [March 2nd, 2023]
Tech Giants Are Barreling Headfirst Into an AI Arms Race - February 20th, 2023 [February 20th, 2023]
Bing's AI Is Threatening Users. That's No Laughing Matter - TIME - February 20th, 2023 [February 20th, 2023]

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Feeding the machine: We give an AI some headlines and see what it does – Ars Technica

The Prometheus League

Breaking News and Updates

Prometheism

Forbidden Fruit

The Evolutionary Perspective

Transtopia Menu

Library Updates

Library Books

Future Euvolution

Lucid Dreams from Childhood

Genetic Revolution

Speciation + Self-Directed Evolution