*TOPIC: Intro to Hadoop for big data.*
*Presenter: Steven Lembark*
Steven Lembark will give us an introduction to Hadoop. It's a
framework/tool to deal with huge and dissimilar data spread across multiple
machines. It grew out of Google's early efforts to organize what they
gathered "surfing the net".
Apache Hadoop is an open source, Java-based software platform/ecosystem
that manages processing & storage for big data apps. It handles datasets
ranging in size from gigabytes to petabytes of data. They can be fed and
analyzed by many distributed computers over many distributed disk farms to
be read and analyzed by many dispersed computers requesting data.
Hadoop is an ecosystem of open source components that fundamentally changes
the way enterprises store, process, and analyze data.
In the infancy of The Internet, there was the quest to "find stuff".
“Search engines” were needed. Google, AltaVista, Yahoo, AskJeeves,...all
had ideas how to do it.
Inspired by their MapReduce, a programming model that divides an
application into small fractions to run on different nodes, Google started
Hadoop in 2002 while working on the Apache Nutch.
In 2003, Hadoop was in the academic paper describing the "Google File
System". In 2006, the Apache
<https://www.google.com/url?q=https://www.google.com/search?sca_esv%3D597118812%26rlz%3D1CATTSD_enUS1074%26sxsrf%3DACQVn0_uhXBfOaB52ojvC_jIMQ4LJXekqQ:1705035226923%26q%3DApache%2BSoftware%2BFoundation%26stick%3DH4sIAAAAAAAAAONgVmLXz9U3yKvKXcQq5ViQmJyRqhCcn1ZSnliUquCWX5qXkliSmZ8HAPoOSRQoAAAA%26sa%3DX%26ved%3D2ahUKEwiKn56Yh9eDAxULhYkEHbOnDjsQmxMoAHoECGAQAg&sa=D&source=calendar&usd=2&usg=AOvVaw2kVthfTWMKwVKvrVDzuRvR>
Software Foundation
<https://www.google.com/url?q=https://www.google.com/search?sca_esv%3D597118812%26rlz%3D1CATTSD_enUS1074%26sxsrf%3DACQVn0_uhXBfOaB52ojvC_jIMQ4LJXekqQ:1705035226923%26q%3DApache%2BSoftware%2BFoundation%26stick%3DH4sIAAAAAAAAAONgVmLXz9U3yKvKXcQq5ViQmJyRqhCcn1ZSnliUquCWX5qXkliSmZ8HAPoOSRQoAAAA%26sa%3DX%26ved%3D2ahUKEwiKn56Yh9eDAxULhYkEHbOnDjsQmxMoAHoECGAQAg&sa=D&source=calendar&usd=2&usg=AOvVaw2kVthfTWMKwVKvrVDzuRvR>
released
an open src version.
Altho now there are other tools used for such large data (ex Apache Hive ·
Apache Spark · Amazon EMR · Azure Data Lake Storage · IBM Analytics Engine
· Hortonworks Data Platform · Apache Pig, Clarissa,....) there are still
those depending on Hadoop, including Netflix.
*So, Steven will tell us…*
An Overview of the Apache Hadoop Ecosystem:
There is stuff that's growing on your data warehouse hard disks. In the
beginning was Hadoop, and was, well, Google's. And everyone tried it.
But as Google dropped the approach as ineffective lots of other folks had
found ways to make pieces of it work, added new pieces to it, and out of
the ashes of single-purpose Hadoop grew the Apache Hadoop
ecosystem.
Today this includes a variety of software for intake, querying, mapping SQL
to key:value stores, and a few other cute tricks.
This talk will look at the pieces of this ecosystem, a bit about how they
fit together, and how they can be used for Really Truly HUUUUUUGE data
processing.