hRaven v0.9.8

The @twitterhadoop team just released hRaven v0.9.8

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.

 
Requirements
  • Apache HBase (0.94+) – a running HBase cluster is required for the hRaven data storage
  • Apache Hadoop – hRaven current supports collection of job data on specific versions of Hadoop:
    • CDH upto CDH3u5, Hadoop 1.x upto MAPREDUCE-1016
    • Hadoop 1.x post MAPREDUCE-1016 and Hadoop 2.0 are supported in versions 0.9.4 onwards

https://github.com/twitter/hraven

 

 

Twemproxy v 0.3.0 has been released

twemproxy v0.3.0 is out: bug fixes and support for smartos (solaris) / bsd (macos)

twemproxy (pronounced “two-em-proxy”), aka nutcracker is a fast and lightweight proxy for memcached and redis protocol. It was primarily built to reduce the connection count on the backend caching servers.

Features

  • Fast.
  • Lightweight.
  • Maintains persistent server connections.
  • Keeps connection count on the backend caching servers low.
  • Enables pipelining of requests and responses.
  • Supports proxying to multiple servers.
  • Supports multiple server pools simultaneously.
  • Shard data automatically across multiple servers.
  • Implements the complete memcached ascii and redis protocol.
  • Easy configuration of server pools through a YAML file.
  • Supports multiple hashing modes including consistent hashing and distribution.
  • Can be configured to disable nodes on failures.
  • Observability through stats exposed on stats monitoring port.
  • Works with Linux, *BSD, OS X and Solaris (SmartOS)

 

More details and source code available here: https://github.com/twitter/twemproxy

Open-sourcing Parquet

Twitter and Cloudera are open-sourcing Parquet: columnar storage format for Hadoop

Parquet is a columnar storage format for Hadoop.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

The initial code, available at https://github.com/Parquet, defines the file format (parquet-format), provides Java building blocks for processing columnar data, and implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example of a complex integration — Input/Output formats that can convert Parquet-stored data directly to and from Thrift objects (parquet-mr).

A preview version of Parquet support will be available in Cloudera’s Impala 0.7.

With Impala’s current preview implementation, we see a roughly 10x performance improvement compared to the other supported formats. We observe this performance benefit across multiple scale factors (10GB/node, 100GB/node, 1TB/node). We believe there is still a lot of room for improvement in the implementation and we’ll share more thorough results following the 0.7 release.

Twitter is starting to convert some of its major data source to Parquet in order to take advantage of the compression and deserialization savings.

 

Parquet is currently under heavy development. Parquet’s near-term roadmap includes:

  • Hive SerDes (Criteo)
  • Cascading Taps (Criteo)
  • Support for dictionary encoding, zigzag encoding, and RLE encoding of data (Cloudera and Twitter)
  • Further improvements to Pig support (Twitter)

We’ve also heard requests to provide an Avro container layer, similar to what we do with Thrift. Seeking volunteers!

We welcome all feedback, patches, and ideas. We plan to contribute Parquet to the Apache Incubator when the development is farther along.

Parquet is Copyright 2013 Twitter, Cloudera and other contributors.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

 

Tweetping

http://tweetping.net/ provide real time data from  Twitter’s activity all around the world.

Twitter's fatcache available on GitHub

fatcache is memcache on SSD. Think of fatcache as a cache for your big data.

Overview

There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor.

To understand why network connected SSD makes sense, it is important to understand the role distributed memory plays in large-scale web architecture. In recent years, terabyte-scale, distributed, in-memory caches have become a fundamental building block of any web architecture. In-memory indexes, hash tables, key-value stores and caches are increasingly incorporated for scaling throughput and reducing latency of persistent storage systems. However, power consumption, operational complexity and single node DRAM cost make horizontally scaling this architecture challenging. The current cost of DRAM per server increases dramatically beyond approximately 150 GB, and power cost scales similarly as DRAM density increases.

Fatcache extends a volatile, in-memory cache by incorporating SSD-backed storage.

SSD-backed memory presents a viable alternative for applications with large workloads that need to maintain high hit rate for high performance. SSDs have higher capacity per dollar and lower power consumption per byte, without degrading random read latency beyond network latency.

Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria:

  • Minimize disk reads on cache hit
  • Eliminate small, random disk writes

The latter is important due to SSDs’ unique write characteristics. Writes and in-place updates to SSDs degrade performance due to an erase-and-rewrite penalty and garbage collection of dead blocks. Fatcache batches small writes to obtain consistent performance and increased disk lifetime.

SSD reads happen at a page-size granularity, usually 4 KB. Single page read access times are approximately 50 to 70 usec and a single commodity SSD can sustain nearly 40K read IOPS at a 4 KB page size. 70 usec read latency dictates that disk latency will overtake typical network latency after a small number of reads. Fatcache reduces disk reads by maintaining an in-memory index for all on-disk data.

https://github.com/twitter/fatcache

Data generated through social media tools

Few figures about data generated through social media:

  • People send more than 144.8 billion Email messages sent a day.
  • People and brands on Twitter send more than 340 million tweets a day.
  • People on Facebook share more than 684,000 bits of content a day.
  • People upload 72 hours (259,200 seconds) of new video to YouTube a minute.
  • Consumers spend $272,000 on Web shopping a day.
  • Google receives over 2 million search queries a minute.
  • Apple receives around 47,000 app downloads a minute.
  • Brands receive more than 34,000 Facebook ‘likes’a minute.
  • Tumblr blog owners publish 27,000 new posts a minute.
  • Instagram photographers share 3,600 new photos a minute.
  • Flickr photographers upload 3,125 new photos a minute.
  • People perform over 2,000 Foursquare check-ins a minute.
  • Individuals and organizations launch 571 new websites a minute.
  • WordPress bloggers publish close to 350 new blog posts a minute.
  • The Mobile Web receives 217 new participants a minute.
    (The most updated numbers are available from the sites themselves.)

Top Twitter Influencers on #BigData

#BigData – Twitter Influencers

Here is the list of Top 50 Big Data Influencers on Twitter.

Note: Big Data Twitter influencers were determined based on tweeted topics, influence as measured by Klout, number of followers, and number of tweets. Below are the “top” influencers at this time based on the combination of factors.

@bigdata – Ben Lorica
Big Data, Analytics, Cloud Computing resources from Ben Lorica, Chief Data Scientist @OReillyMedia – San Francisco, CA · http://www.bglorica.com
Klout – 49

@BigDataAnalysis – John Akred
Big Data R&D Lead at Accenture Technology Labs, Musician, Engineer, Technologist, Analog Audio and Vacuum Tube lover, Provocateur. These thoughts are my own. – Chicago / San Jose · http://www.linkedin.com/pub/john-akred/1/b0a/320
Klout – 44

@imbigdata – Manish Bhatt
News and Updates about BigData, NoSQL, Hadoop, BI and other Big Data related technologies from BigData enthusiast Manish Bhatt
Klout – 24

@ibmbigdata – IBM Big Data
Talking about the challenges and approaches to handling Big Data. Primarily managed by @TheSocialPitt – http://www.ibm.com/bigdata ·http://www.smartercomputingblog.com/category/big-data/
Klout – 37

@bobgourley – Bob Gourley
A CTO. Also find me @CTOvision and @CTOlist. National Security, Cyber Security, Enterprise IT and tech fun are key topics of interest.
Washington, DC · http://ctovision.com
Klout – 50

@klintron – Klint Finley
I write for SiliconAngle. I also run Technoccult and build strange soundscapes.
Portland, OR · http://klintfinley.com
Klout – 45

@KristenNicole2 – Kristen Nicole
News editor at SiliconANGLE, writer at Appolicious, recovering social media addict – http://kristennicole.com
Klout – 40

@dhinchcliffe – Dion Hinchcliffe
Business strategist, enterprise architect, keynote speaker, book author, blogger, & consultant on social business and next-gen enterprises. – Washington, D.C. ·http://dachisgroup.com
Klout – 52

@dmkimball05 – Dan Kimball
CMO at Kontagent, helping companies interpret patterns in social & mobile data to optimize their customer economics – San Francisco, CA ·http://www.linkedin.com/in/danielkimball
Klout – 18

@HadoopNews – John Ching
Latest news about Hadoop, NoSQL & BigData from John Ching, Big Data Guru, Consultant, and Evangelist for BI, Machine Learning, and Predictive Analytics
Klout – 44

@medriscoll – Michael E. Driscoll
CEO @Metamarkets. I ♥ Big Data, analytics, and visualization. – San Francisco, CA ·http://medriscoll.com/
Klout – 48

@peteskomoroch – Pete Skomoroch
My mission is to create intelligent systems that help people make better decisions. Principal Data Scientist @LinkedIn. Machine Learning, Hadoop, Big Data.
Silicon Valley · http://datawrangling.com
Klout – 56

@hmason – Hilary Mason
chief scientist @bitly. Machine learning; I ♥ data and cheeseburgers.
NYC · http://www.hilarymason.com
Klout – 56

@TimGasper – Tim Gasper
@Infochimps product manager, @Keepstream co-founder, techie, app addict, music writer/lover, #BigData, #Cloud – Austin, TX · http://timgasper.com
Klout – 43

@flowingdata – Nathan Yau
Data, visualization, and statistics. Author of ‘Visualize This.’ Background in eating. – California · http://flowingdata.com
Klout – 58

@bradfordcross – Bradford Cross
design and data @prismatic – San Francisco · http://getprismatic.com/
Klout – 44

@CityAge – City Age
We amplify good ideas through unique dialogues and their associated campaigns. We’re now organizing The Data Effect, amid other projects. – Vancouver, BC ·http://www.thedataeffect.org
Klout – 32

@BigDataExpo – Big Data Expo
Join 30,000+ Delegates in 2012 at World’s Largest #Cloud Events! New York [June 11-14] Silicon Valley [Nov 5-8] Register & Save! ▸ http://bit.ly/tucY2B – New York/Silicon Valley · http://BigDataExpo.net
Klout – 34

@acmurthy – Arun C Murthy
Founder & Architect, Hortonworks. VP, Apache Hadoop, Apache Software Foundation i.e. Chair, Hadoop PMC. Moving Hadoop forward since day one, since 2006. – online · http://people.apache.org/
Klout – 43

@infoarbitrage – Roger Ehrenberg
Big Data VC at IA Ventures. Data junkie. Quant dude. Baseball coach. – ÜT: 40.76136,-73.980129 · http://www.iaventures.com
Klout – 56

@jeffreyfkelly – Jeff Kelly
I am an Industry Analyst covering Big Data and Business Analytics at The Wikibon Project and SiliconANGLE – Boston · http://wikibon.org
Klout – 45

@timoreilly – Tim O’Reilly
Founder and CEO, O’Reilly Media. Watching the alpha geeks, sharing their stories, helping the future unfold. – Sebastopol, CA · http://radar.oreilly.com
Klout – 69

@digiphile – Alex Howard
Gov 2.0 @Radar Correspondent, @OReillyMedia: alex@oreilly.com. Intrigued by technological change, taken with ideas, cooking, the outdoors, books, dogs and media – Washington, DC · http://radar.oreilly.com/alexh
Klout – 74

@band – William L. Anderson
Sociotechnical systems developer, open access advocate, and editor at CODATA Data Science Journal – austin texas
Klout – N/A

@HenryR – Henry Robinson
Engineer @ Cloudera, Zookeeper committer / PMC member, professional dilettante – San Francisco, CA · http://the-paper-trail.org/
Klout – 43

@furrier – John Furrier
Silicon Valley entrepreneur Founder SiliconANGLE Network. Inventing New Things, Blogging, Tweeting Social Media – Palo Alto, California · http://SiliconAngle.com
Klout – 49

@mikeolson – Mike Olson
Cloudera CEO – Berkeley, California · http://www.cloudera.com/
Klout – 48

@davenielsen – Dave Nielsen
Co-founder of CloudCamp & Silicon Valley Cloud Center – Mountain View, Ca ·http://www.platformd.com
Klout – 44

@znmeb – M. Edward Borasky
Media Inactivist, Thought Follower, Sit-Down Comic, Former Boy Genius, Real-Time Data Journalism Researcher, Open Source Appliance Maker And Mathematician – Portland, OR · http://j.mp/compjournoserver
Klout – N/A

@rizzn – Mark ‘Rizzn’ Hopkins
I’m the editor in chief for SiliconANGLE and the purveyor of fine content at rizzn.com. ·http://rizzn.com
Klout – N/A

@edd – Edd Dumbill
Telling the story of our future, where technology is headed, and what we need to know now. O’Reilly Strata and OSCON program chair. Incurably curious – California · http://eddology.com/
Klout – 57

@kellan – Kellan E
Technological solutions for social problems. CTO, Etsy. (if you follow me, consider introducing yourself with @kellan message) #47 – Brooklyn, NY ·http://laughingmeme.org
Klout – 58

@mikeloukides – Mike Loukides
VP Content Strategy, O’Reilly Media, pianist, ham radio op usually in Connecticut
Klout – 54

@laurelatoreilly – Laurel Ruma
Director of Talent (speaker and author relations) at O’Reilly Media. Homebrewer, foodie, farmer in the city – Cambridge, MA · http://www.oreilly.com
Klout – 45

@neilraden – Neil Raden
VP/Principal Analyst,Constellation Research;Analytics, BigData, DecisionManagement. Author/Writer,Blogger,Speaker.Husband/(Grand)Father
Santa Fe, NM · http://www.constellationrg.com/users/nraden
Klout – 53

@greenplum – Greenplum
Greenplum, a division of EMC is driving the future of big data analytics.
San Mateo, California · http://www.greenplum.com/
Klout – 48

@squarecog – Dmitriy Ryaboy
Analytics Tech Lead at Twitter. Apache Pig committer.
San Francisco
Klout – 48

@BigData_paulz – Paul Zikopoulos
Director of Technical Professionals for IBM’s Information Management, BigData, and Competitive Database divisions. Published 15 books and over 350 articles.
Klout – 38

@moorejh – Jason H. Moore
Third century professor, Director of the Institute for Quantitative Biomedical Sciences at Dartmouth College, Editor-in-Chief of BioData Mining – Lebanon, NH, USA · http://www.epistasis.org
Klout – 47

@GilPress – Gil Press
I launched the #BigData conversation; Writing, research, marketing services;http://whatsthebigdata.com/ & http://infostory.wordpress.com/
Boston
Klout – 41

@ToddeNet – Todd E. Johnson PhD
Educational Access and Academic Sustainability • STEM •Data Informed Decisions (DID) • Always Dreaming and Learning(ADL)….Tweets here are my own!!
Olympia, WA · http://www.linkedin.com/in/toddenet
Klout – 27

@digimindci – Orlaith Finnegan
Provider of Competitive Intelligence & Market Intelligence Software. Online Reputation, Real-time Web Monitoring and Analysis, Social Media Monitoring.
Boston, Paris, Singapore · http://www.digimind.com
Klout – 36

@SmartDataCo – Smart Data Collective
Expert writers on analytics, BI and big data brought to you by the folks at Social Media Today.com · http://smartdatacollective.com 
Klout – 42

@al3xandru – Alex Popescu
NOSQL Dreamer http://mynosql.tv, Software architect, Founder/CTO InfoQ.com, Web aficionado, Speaker, iPhone: 44.441881,26.139629 · http://mynosql.tv
Klout – 49

@marksmithvr – Mark Smith
CEO & Chief Research Officer at Ventana Research –http://www.ventanaresearch.com & follow

@ventanaresearch – San Ramon, CA · http://marksmith.ventanaresearch.com/
Klout – 51

@BernardMarr – Bernard Marr
Leading global authority and best-selling author on delivering, managing and measuring enterprise performance – London
Klout – 50

@johnlmyers44 – John L Myers
Senior Analyst for EMA Business Intelligence and Data Warehousing practice specializing in telecom analytics and business process management – Boulder, Colorado · http://www.enterprisemanagement.com/about/team/John_Myers.php
Klout – 54

@leenarao – Leena Rao
Tech Writer (tech Crunch), dog-lover, foodie, quirky – Chicago
Klout – 68

@HKotadia – Harish Kotadia Ph.D
Big Data, Predictive Analytics, Social CRM and CRM. Work for Infosys (NASDAQ: INFY). Views and opinion expressed are my own. – Dallas, Texas, USA ·http://HKotadia.com/
Klout – 42

@chuckhollis – Chuck Hollis
technologist, marketeer, blogger and musician working for EMC.
Holliston, MA · http://chucksblog.emc.com
Klout – 49

MIT: Looking At The World Through Twitter Data

Great article from MIT

Me and my friend Kang, also an MIT CS undergrad started playing with some data from Twitter a little while ago. I wrote this post to give a summary of some of the challenges we faced, some things we learned along the way, and some of our results so far.I hope they’ll show to you, as they did to us, how valuable social data can be.

Scaling With Limited Resources

Thanks to AWS, companies these days are not as hard to scale as they used to be. However, college dorm-room projects still are. One of the less important things a company has to worry about is paying for its servers. That’s not the case for us, though since we’ve been short on money and pretty greedy with data.

Here are some rough numbers about the volume of the data that we analyzed and got our results from.

User data: ~ 70 GB
Tweets: > 1 TB and growing
Analysis results: ~ 300 GB

> 10 billion rows in our databases.

Given the fact that we use almost all of this data to run experiments everyday, there was no way we could possibly afford putting it up on Amazon on a student budget. So we had to take a trip down to the closest hardware store and put together two desktops. That’s what we’re still using now.

We did lots of research about choosing the right database. Since we are only working on two nodes and are mainly bottlenecked by insertion speeds, we decided to go with good old MySQL. All other solutions were too slow on a couple nodes, or were too restrictive for running diverse queries. We wanted flexibility so we could experiment more easily.

Dealing with the I/O limitations on a single node is not easy. We had to use all kind of different of tricks to get around our limitations. SSD Caching, Bulk insertions, MySQL partitioning, dumping to files, extensive use of bitmaps, and the list goes on.

If you have advice or questions regarding all that fun stuff, we’re all ears. : )

Now on to the more interesting stuff, our results.

Word-based Signal Detection

Our first step towards signal and event detection was counting single words. We counted the number of occurrences of every word in our tweets during every hour. Sort of like Google’s Ngrams, but for tweets.

Make sure you play around with some of our experimental results here. The times are EST/EDT. *

Here are some cool stuff we found from looking at the counts. If you also find anything interesting from looking at different words, please share it with us!

Daily and Weekly Fluctuations

If you look for a very common word like ‘a’ to see an estimate of the total volume of tweets, you clearly see a daily and weekly pattern. Tweeting peaks at around 7 pm PST (10 pm EST) and hits a low at around 3 am PST every day. There’s also generally less tweeting during Fridays and Saturdays, probably because people have better things to do with their lives than to tweet!

 

A side note:

We’ve tried to focus on English speaking tweeters within the States. Note that the percentage of tweets containing ‘a’ also fluctuates during the day, which is surprising at first. But, this is because non-English tweets that we have discarded are much more frequent during the night in our time zone, and they often don’t contain the word ‘a’ as often as English tweets do.

Sleep

I’m a night owl myself and I had always been curious to know at exactly what time the average person goes to sleep or at least thinks about it! I looked for the words “sleep”, “sleeping”, and “bed”. You can do this yourself, but the only problem you’ll see is that not all the tweets have the same time zones. To solve this issue, we isolated several million tweets which had users who had set their locations to Boston or Cambridge, MA. Then, we created a histogram of their average sleeping hours. Here’s the result:

 

It seems like the average Bostonian sleeps at around midnight! Of course, that’s probably not the average everywhere. After all, a fourth of our city are nocturnal college students!

You can look at all kinds of words relating to recurring events like ‘lunch’, ‘class’, ‘work’, ‘hungry’ and whatever you can imagine. I promise you, you’ll be fascinated.

Here are some suggestions:
Coke, Valentine, Hugo and other oscars-related words, IPO.
(please suggest other interesting things I should add to this list)

I’m Obsessed With Linguistics

As we were looking at different words, we noticed that the words Monday, Tuesday, etc show very interesting weekly patterns. They reflect a signal that has its peak on the respective day, as you’d expect, and which rises as you get closer to that day. This means that people have more anticipation for days that are closer, more or less linearly. But if you pay closer attention, you’ll see that the day immediately before the search term corresponds to a clear valley in the curve. This points to a very interesting linguistic phenomena. That in English, we never refer to the next day with the name of that weekday, and instead use the word ‘tomorrow’.

 

On a Wednesday, people don’t say ‘Thursday’

Events

We tried to find events that we thought would have a strong reflection in the Twitter sphere. ‘superbowl’, ‘sopa’, and ‘goldman’ were pretty interesting. Here are the graphs for those three, which you can also recreate yourself.

Tweets about ‘Superbowl’ during each hour

Tweets about ‘Sopa’ during each hour

Tweets about ‘Goldman Sachs’ during each hour

We’ll post more about our attempt to exactly dissect what happened on Twitter during these events as time progressed. In the Goldman Sachs case, for example, the peak happens on the day of the controversial public exit of a GS employee, which was reflected in the NYTimes. The earliest news release time was at 7am GMT which is the same as the first signs of a rise in our signal.

Politics and Public Sentiment

If you query the word Obama, this is what you’ll see:

 

When we first saw this spike, we were very suspicious. The spike seemed way too prominent to be associated with an event. (~25 times the average signal amplitude) But guess what. The spike was at 9 PM on Jan 24th when the state of the union speech happened!

We were curious to see some of the 250 K sample tweets containing ‘obama’ from that hour. Here are a few of them along with some self-declared descriptions from users:

ok. the obama / giffords embrace made me choke up a little.
A teacher from Killeen, Texas. ~200 followers.

I love that Obama starts out with a tribute to our military. #SOTU
A liker of progressive politics from Utah. ~100 followers.

Great Speech Obama #SOTU
A CEO from NYC. ~700 followers.

There were both positive and negative tweets. But we wanted to know whether the tweets were positive or negative because that’s what really matters in a context like this. Here are some results and an explanation of how our sentiment analysis works in general.

Sentiment Analysis

Our sentiment analysis is done by training our model using several thousand manually classified sample tweets. It reaches very good prediction accuracy according to our tests and as I’ll explain below.

The graph below shows the normalized sentiments of several different sets of tweets for each day during a 3 month period.

The two graphs here are sentiments of millions of independent randomly chosen tweets from our set. The fact that they follow each other so closely is the important achievement of our system. It means that the signal to noise ratio is so high that the sentiment is clearly measurable.

You can also observe a weekly periodicity in the general sentiment. Interestingly, it shows that people have happier and more positive tweets during the weekends compared to the rest of the week! In addition, the signal acts sort of unusually around January 1st and February 14th!

The graphs below, on the other hand, are indicative of the sentiments of tweets in which some combination of keywords related to ‘economy’ or ‘energy’ were talked about. As you can see, the patterns in the graphs are fairly stable other than at a single day, January 24th, where the sentiment significantly drops. That’s when Obama’s state of the union speech was, and it looks like his speech triggered a lot of negative tweets related to energy and the economy.

We were curious to see whether the sentiment was only negative for this bag of tweets (those that contain energy, economy), or if tweets about obama during the state of the union speech were negative in general. Here are our results of running sentiment analysis on all the data containing the word ‘obama’ in the past 4 months.

Here, the blue curve is the average sentiment of tweets for each day and the red curve shows the amount of variance in the sentiment. (If it’s high, then there were more happy and sad tweets and when it’s low, tweets were more neutral)

The graph clearly shows that there was some heated debate about Obama on exactly January 24th. On the same day, the sentiment has dropped and so we can tell that there was more negative tweeting than there was positive. It looks like people weren’t too happy during the #sotu speech.

As we looked at the graph, it was also hard to ignore the other peak in variance (polarity) that seems to appear around the 15th to 17th of January. It seems like sentiment about Obama fell and raised rapidly in only a few days in that time span. This is likely a result of tweets about SOPA/PIPA and Obama’s disagreement with the bill which happened during those days.

The Growth of Twitter

Twitter has had periods of slow and rapid growth since its inception on March 21st of 2006 up until now. We tried to capture the growth of Twitter, since its very first user until earlier this year. Here is the result showing the number of people joining Twitter during every hour since day one:

As you can see, there are some very abnormally large numbers of people joining during specific hours throughout Twitter’s lifetime. One is apparently in April 2009, when theyreleased their search feature. Another is probably when they rolled out the then New Twitter for everyone.

Summing up…

As we started looking at the data for the first time, we were absolutely blown away by all the cool insights you can extract from it. It makes you wonder why people aren’t doing these sorts of things more often with all this cheap data and computing power that they have these days.

This was our first blog post and we hope you liked what you saw. You’re awesome if you’re still reading. Stay tuned for more and please give us any feedback you may have!

*The count results aren’t perfectly accurate. There is a general upward trend because twitter deletes history before 3200 tweets.There is also a discontinuity on around Feb 11th which is because of a temporary glitch we had.

55.000+ Twitter account leaked

Today Anonymous hackers leaked more than 55.000 hacked twitter accounts username and password through Pastebin. It was very shocking to see such a massive number of Twitter accounts are hacked. Also celebrity accounts are hacked.

‘The micro blogging platform is aware of this hack and was taking necessary actions to save those people’s account from malicious activity’, said a Twitter insider.

It was huge, 55.000+ accounts has been hacked and it wasn’t possible to share such a huge pile of usernames and passwords in a single paste. So it took the hackers five Pastebin pages to leak the data. This hack is just an alert to other millions of Twitter users that they could be hacked anytime.

Unbelievable that Twitter isn’t taking any necessary steps to keep its users data safe. Even after encountering a huge number of hacks in the past including celebrities account. All they need to do is to add a password strength checker during signup while changing passwords. And guide the users to create a strong password. That could save a lot of users frustration.

To check if your account is hacked, go through this five Pastebin pages ( page 1 | page 2 | page 3 | page 4page 5 ) and to find your account easily just by using the find feature in your browser (CTRL+F) and type your email id.

 

Or  using you can download the whole file list , created using the following command:

curl http://pastebin.com/raw.php?i=Kc9ng18h > twitterpw.txt
curl http://pastebin.com/raw.php?i=vCMndK2L >> twitterpw.txt 
curl http://pastebin.com/raw.php?i=JdQkuYwG >> twitterpw.txt 
curl http://pastebin.com/raw.php?i=fw43srjY >> twitterpw.txt 
curl http://pastebin.com/raw.php?i=jv4LBjPX >> twitterpw.txt

Sword of the data, a step behind [part 2]

This article follow the “Sword of the data, a brief History [part 1]” available here

 

 

The Sword of the data, part #2

A step behind "the data issue" 

 

 

1/A horizon of new possibilities

We’re living a new era, which can probably be qualified as the big data era.In one decade, data is driving revolutionary changes in computing and the Internet,including new opportunities for generating revenue, improving the efficiency of business processes, any organization(whatever its size) can now operate over the world in real time etc …

 

2/Data-centric  enterprise

As all organization became more and more data-centric every day, they are capturing data at deeper levels of detail and keeping more history than they ever have before(and still increasing at an unprecedented pace/rate). Data is now the most important and valuable component of modern applications, websites, organisation business intelligence used for decision-making.

 

3/In this new era, “whatever lives or dies strictly by the sword of data”

When competing for the same ground, the keys to success will be linked to the ways you handle your data:

  • Safety: Information security became a requirement
  • Quality, Integrity, Consistency erroneous or conflicting data has a major cost to a business, bottom line, impacting customers, reputation and revenue.
  • Accessibility easy search and number of click required to reach a wanted data makes a difference
  • Availability downtime, now, have a cost
  • Velocity poor performance will reduces productivity