Skip to main content

An Example in How Data Mining Really Works

Data mining, it has connotations of unmarked white vans and NSA agents listening in on your phone call to your mum. In other words it’s creepy. But unlike that weird man on the train who just won’t stop starting at you, data mining is wonderfully misunderstood.

The practice is more about cutting out noise than eavesdropping. Data mining seeks to take a large amount of publicly available data (can’t stress the publicly part more) and sift through it to learn more about what your customers like.

Since talking about it only gets me so far, here are some real world results. Now, before I go on, there are numerous methods of conducting this sort of data mining reaching, I’m using a version that’s laid out in this Moz post.

It uses a Python-based script to sift through Twitter profiles and semantically analyzes each one. The results show what websites they are tweeting about and the content of those web pages. By honing in on topics a person tweets about, you’ll be able to reverse engineer a model of content that they’d like or are interested in.

Okay, now that that’s done, let’s move on.

Here, let’s look at a selection of hotels right here in Vancouver. I arbitrarily chose @FairmontPacific, @FSVancouver, @TheBurrard, @panpacificvan, @SandmanHotels, @WallCentreHotel and @TheWestinGrand as my starting point, and here is the data I mined:

Website Number of Links
sandmanhotels.ca 11
theburrard.com 11
vancitybuzz.com 9
t.co 7
insidevancouver.ca 6
culturecrawl.ca 3
modernmixvancouver.com 3
culturecrawl.ca 3
gov.bc.ca 2
facebook.com 2
vancouverchristmasmarket.com 1
giovanecafe.com 1
vancouvereconomic.com 1

This data represents the first 16 rows. I choose that number because after row 16 the quality of the links drops. I also followed the same logic for the concepts table below.

After only a quick look, it’s easy to tell what these different brands talk about: themselves and Vancouver; apparently Sandman Hotels and The Burrard are very fond of themselves.

But beyond that point, a lot of these tweets are related to news, information and cultural events going on in Vancouver. If I were to make any conclusions, it’s that these hotel brands are interested in events throughout the city that their guests might find interesting. In other words, guests are following their accounts because they are in touch with what’s going on in the city.

But that’s only half of it. I’m just making assumptions about the content of all these linked websites. That’s where the second element comes into play. Below you can see the linked websites.

Concept Number of Mentions
Vancouver 12
Christmas 11
English-language films 7
Christmas tree 7
Granville Island 6
Vancouver International Airport 6
British Columbia 6
Downtown Vancouver 6
Hotel 5
Yule 5
Sandman 4
Chinatown, Vancouver 4
2010 Winter Olympics 4
2003 singles 4
Stanley Park 4
Christmas Eve 4

My initial expectation was to see the brand names mentioned more often, as the domains were so well-ranked in the previous table. However, only The Sandman Suites’s brand name made it onto this table. Instead, for the most part, these hotels are tweeting about Vancouver, Christmas and tourist attractions like Granville Island.

Pretty close to what the linked domains implied.

However, as you can probably tell, there are some weird results in the concepts table. Things like “2003 singles” and “english-language films” feel out-of-place. Which gets me to the most important point about data mining:

Cutting Out The Noise

The above examples are just, well, an example. I’ve oversimplified them only to show you the basic procedure, results and analysis. The reality is that we have to make a concerted effort to remove ancillary data when mining for it.

For example, I only chose a few accounts. In reality, we should have double if not triple digits worth of accounts. By having a large sample number, we’ll be able to reduce the impact of noisy data. It could be that one hotel had an English-language film screening (which to me is really just a movie screening, but whatever), and was very fond of the event. Thus, it appeared higher in my data than in reality.

I also only chose brands in order to make the analysis simpler. In an ideal world, we’d take a cross-section of the population. That way, we could get an overall impression of what the average person is tweeting about.

And a final thing to consider when cutting out the noise is junk accounts. This example didn’t suffer from that problem, but if you are dealing with hundreds of profiles, cutting out bots and/or spam accounts will help to clean up the data.

See, It’s Not That Evil

All I did here was take the data available on each hotel’s own accounts, information they readily give out. The actual trick to data mining is finding a creative way to analyze it. If you know of a good method to do this, then the data you will pull will leave you with a more in-depth understanding of your audience.

What do you think? Does data mining still have connotations of spying or dodgy practices? Also, what do you think about your analysis of these accounts? Is it what you’d expect from a hotel brand?

Finally, If you’d like to learn more about how we conducted this analysis we’d be happy to go over it in-depth. Just contact us.