How Ai Platforms Collect and Use Your Data. Understanding Ai Data Collection, Privacy Risks, and Big Tech's Influence.

April 8, 2025 • Ubik Team

The Onliness of Y2K Babies The year 2000. Y2K, my birth year. It was a year with cultural baggage and meaning that I and all the other Y2K babies will never fully understand. We grew up after the panic and under the influence. The cultural shifts and novel technologies have shaped and changed the process of growing up. We grew up on the "internet of everything"; the Internet stretches back to my earliest memory. I remember being tiny and only in a diaper. I was standing on my dad's desk and dancing to "Hash Pipe" by Weezer, staring at a computer monitor. As a computer programmer (and video game lover (addict)), he would spend late nights and long days on his desktop computer. It wasn't long before I had a personal computer, laptop, and smartphone. Technology has been a naturally-consistent part of my life. Now 24 years old, it wasn't until around 18-19 that I realized my entire life, the good and the bad, was online. Embarrassing moments, tagged posts that I had no awareness of. A breadcrumb, or, let's say, a cookie crumb trail running parallel to my digital footprint. Big-tech companies like Google, Twitter, Meta (Facebook and Instagram), and so many more have priority commodified people like me, Gen Z kids, and Gen Alpha kids who have grown up on these platforms and, more often than not, become addicted and socially dependent. Children are turned into data points to sell ads, create data profiles, and capitalize on. Digital footprints have real-world consequences that often go unnoticed by users, so with the addition of ChatGPT, Gemini, Claude, and all the other new AI chatbot platforms, understanding how user data flows through and gets used by the platform is integral for user safety and future regulation/safe standard practices.

Bad Data Practice

Platforms like Google and Meta have a long-standing history of privacy issues and intrusive data collection methods. In 2018, BM (Before Meta) Facebook and Mark Zuckerberg took the stand before Congress to address and explain how the data of 87 million Americans became part of the Cambridge Analytica scandal in 2016. This congressional hearing wasn't just informational; the hearings revealed the methods that Facebook used to collect data intrusively from users outside of the app's usage and across other apps and websites. The most alarming method is collecting user biometric data (face scans) if the user did not specifically opt out of their user preferences. When a product is free, the user is the product; Google is no different. Users who passively "bake" cookies and "surf" their favorite websites unknowingly create passive income for Google with seemingly unlimited data. The "bad practice" done by these platforms occurs when your passive, often unknowingly, organic data is sold to companies to create targeted ads without your knowledge. Google, for example, tracks users' geolocation through Google Maps, so Google knows your daily routes, maybe some stores to spend money in on the way, and what to advertise to you at home based on where you are and what you do (Bensinger). Big-tech platforms want to learn as much about you as possible with little escape from their eyes. Tech platforms should be overly communicative and transparent with users about the data they collect, where it goes, and how much it is worth. Today, users remain in the dark with little to no explanation or knowledge that organic user-made data is the most valuable asset and the "price" for user accessibility.

Pushing the Limit

By law, companies must have users' consent to collect their data; this consent is often granted by clicking "I agree" and marking the check box, most likely located under extensively long, small-font terms and service agreement forms. Company officials design these forms to be awful and unenjoyable, yet skippable with a quick click of the "I agree" button. Sadly, these practices are legal in the US, and almost all online platforms intrusively collect, analyze, and sell user data. There may be regulatory standards and laws like the Children's Online Privacy Protection Act (COPPA), but from personal (shared) experience, avoiding these standards is easy. Signing up and posting on social media before the "required minimum age" of 13 isn't hard for a tech-savvy 8-year-old to get access to. There is no ID requirement during the signup process for Meta; it is just an email (often Gmail (Google)), which also does not require identification to sign up. Since user data is valuable to government agencies, the state or federal government minimally enforces regulatory standards and laws. Platforms like Google, Meta, and now ChatGPT give government agencies user data upon request to conduct surveillance or criminal activity. The government encourages intrusive data collection methods by these companies since, in the long run, extensive amounts of organic data on American Users benefit the government.

AI: What's New in 2024?

A stark difference between how users interact with social media platforms and AI chatbots is the authenticity behind each user interaction. On social media, users are passive voyeurs, "liking" the pictures of loved ones, glorifying the lives of celebrities, and memes--maybe a comment left here and there, but mostly, users on social media spend their time intentionally "brain rotting." This doom-scrolling is polar to the authentic human prompts used and made on platforms like ChatGPT. AI Chatbots have undoubtedly changed how humans interact with information online and complete daily tasks. With a fresh take on user-search interaction, users now chat back and forth with an anthropomorphized search engine instead of just a search query input and result output. The user benefits from quickly completed work tasks and the summarization of hard-to-understand concepts that Google lacked. OpenAI (the creators of ChatGPT) benefits from this new treasure trove of user-made data through chats and the extensive public resources of the Internet.

Copyrights and AI ChatGPT When OpenAI launched ChatGPT in 2023, legal battles quickly followed. The New York Times (NYT) accused OpenAI of illegally using its archive and current articles to train ChatGPT's models. A key issue is OpenAI's use of "retrieval augmented generation" (RAG), which allegedly reproduces copyrighted content verbatim, allowing users to "free-ride" articles without engaging with sources (Allyn).

Similarly, Scarlett Johansson publicly criticized OpenAI for using a voice eerily similar to her character in Her for ChatGPT's voice assistant despite her refusal to license her voice. Many [nerds], including OpenAI's Sam Altman, had pushed for a Johansson-voiced ChatGPT, and its unauthorized inclusion sparked concerns over personal likeness rights and AI's ethical boundaries (Allyn). Both cases highlight broader issues around AI companies exploiting intellectual property without consent. These legal disputes highlight the urgent need for AI regulation and transparency in how AI companies train models and whose work/data they are built upon.

Becoming Data-Conscious

Copyright infringement is not a minor crime in the US. Federal copyright law comes with severe fines ranging from $200 to $150,000. The NYT is currently accusing OpenAI of copying millions of works, so significant financial penalties in the hundreds of millions could kill OpenAI and AI companies alike. Suppose ChatGPT has illegally trained off of the works of millions. That means public posts, public digital footprint, and the cookie crumbs left (unknowingly) for companies like OpenAI to collect and piece together. AI chatbots are now rewriting blog posts and independent works by small creators and hobbyists as generated work. Data is yours; companies should not steal, sell, or use anyone's benefit without your knowledge. AI companies are no different. They may be ad-free, but the user is the product when a service is free. Work Cited Allyn, Bobby. “Scarlett Johansson Says She Is ‘shocked, Angered’ over New CHATGPT Voice.” NPR, NPR, 20 May 2024, www.npr.org/2024/05/20/1252495087/openai-pulls-ai-voice-that-was-compared-to-scarlett-johansson-in-the-movie-her. Allyn, Bobby. "'The New York Times' Takes OpenAI to Court. CHATGPT's Future Could Be on the Line." NPR, NPR, 14 Jan. 2025, www.npr.org/2025/01/14/nx-s1-5258952/new-york-times-openai-microsoft. Bensinger, Greg. “Google’s Privacy Backpedal Shows Why It’s so Hard Not to Be Evil.” The New York Times, The New York Times, 14 June 2021, www.nytimes.com/2021/06/14/opinion/google-privacy-big-tech.html. Singer, Natasha. “What You Don’t Know About How Facebook Uses Your Data.” The New York Times, The New York Times, 11 Apr. 2018, www.nytimes.com/2018/04/11/technology/facebook-privacy-hearings.html. Thorbecke, Catherine. “Facebook Says Government Requests for User Data Have Reached All-Time High.” ABC News, ABC News Network, 13 Nov. 2019, abcnews.go.com/Business/facebook-government-requests-user-data-reached-time-high/story?id=66981424. https://www.lib.purdue.edu/uco/infringement

How Ai Platforms Collect and Use Your Data. Understanding Ai Data Collection, Privacy Risks, and Big Tech's Influence.

Bad Data Practice

Pushing the Limit

AI: What's New in 2024?

Becoming Data-Conscious

Selling Data for Advertising: How Platforms Monetize User Information

Copyright Infringement in the Digital Age: Challenges and Implications for Ai

User Cookies: How They Work and Why They Matter

Retrieval Augmented Generation (rag): How It Shapes Ai and Raises Concerns

Digital Footprints: Understanding Their Impact and Importance in the Digital Era

Data Profiling: How Search Engines and Companies Use Your Information