Lead Generation and Prediction in Residential Real Estate

Chen, Stefanos, “Thinking of Selling? They Already Know.” WSJ, 5/15/2015

Questions:

[1] What data is included in a residential real estate listing?

[2] What additional data does a residential real estate agent use in pricing a listing?

[3] Where does that data come from?

[4] For what reasons are residential homes sold?

[5] What data is available that is correlated with the reasons that people sell homes?

This article is a terrifict case study for the data acquisition stage of a data analytics project.  It is also useful for the “framing” question.  There are many different questions that one could ask based upon the available data.  One could attempt to predict “selling price” or “cost recovery of particular home improvement projects” or “estimated number of bids” or “estimated days on market – e.g. time to a sale” etc.  This is essentially a dating game – a match between potential sellers (homes) and potential buyers.  Given a home, one could also predict the optimal buyer or buyer type.  How one frames the question and different sources of data are all covered in this article.  Little is said about the type sof models that are actually constructed.

Improving Education: Issues-Based Strategic Design vs. Data-Driven Analytics

Rich, Motoko, “Some Schools Embrace Demands for Education Data,” NYT, 5/11/2015.

Questions:

[1] Who is the customer for public education in the United States?

[2] What are the fundamental needs driving education reform?

[3] What are the metrics (quantifiable, actionable, linked to the needs) for educational reform?

[4] What data do we have about the education process and its customers

[5] What would an explanatory analysis of the data in education reveal about the education process and opportunities for reform?

[6] How does one prioritize among the many needs and metrics?

This article captures neatly the opportunity for integration between a data-driven approach to innovation and a strategic, issues-based approach to innovation.  The problem of education reform is a great test case.  The writer is talking about how the data revolution has extended to education reform and documents the possibilities.  However, the author also explores doubters who note that “data is  not everything” and perhaps issues like creativity, “inspiring students to learn,” and “instilling wonder and curiosity” are not conducive to data-driven analysis:

Just as doctors need to observe more than blood pressure or cholesterol readings when treating patients, “the same is true in education,” said Pedro Noguera, a professor of education at New York University. “If you only look at the numbers, and you don’t probe and look at the learning environment, the culture of the school or the relationships between teachers and students, you’re going to miss out on a lot.”

The professor of education is speaking only to academic performance numbers. While I would be the first to concede that data is not everything, the professor misses out on the opportunity to gather data about the broader problem context – the things that he says are missing:  the learning environment, school culture, and relationships between teachers and students.  Those are all measureable factors – perhaps not perfectly observable, but even subjective assessments are measureable.

I think that the real unspoken danger about the data-driven approach (and the reason that the issues-based approach is such an important complement) is that we focus on metrics (how much dirt is there on the cafeteria floor) because it is a metric:  it is easily observed and it is actionable.  Whether it is linked to the underlying needs is debatable (although an argument could be made).  More importantly, how one prioritizes this measure (is it “the one metric that matters”, to use a term from Lean Analytics) relative to all other measures (and the attendant goals) is the question.

The problem of education reform has been long-been tackled from an issues-based, traditional “strategic” approach.  Somewhat newer to the game is the data-driven approach.  And while many types of soft and hard-metrics already exist, the application of this data to performance-based policies is fraught with political land mines.  Strictly from an operational standpoint, the data-driven approach highlights the question of “what data do you need,” “what can you do with the data that you have,” “how do you get the data that you are missing.”  The first and last points are particularly important because historically, people focus on what they have and then answer the questions that they have.  This is akin to the drunk man searching for his keys under the lightpost … because that is where the light is.  We may solve problems linked to the data that we have, but those may not be the problems that we care about and they may not be linked to the root causes of the problems that we *do* care about.  Worse yet, because of interdependencies, solving problems captured by the data that we can observe may lead to sub-optimal outcomes in the problems that we really care about.

This leads to the bigger question (alluded also to in an earlier post on exploring causes of the Flash Crash):  What data do you need to consider to begin with.  It is easy to say that measuring the amount of dirt on the floor is irrelevant and a waste of time.  But as an extreme, it begs the question; we need more data – but what data do we actually need?  If we are not going to go looking under the lamp post and instead are going to bring in portable lighting, what is the new area that we need to illuminate?

Constructing proxy variables for the cost of housing

Lahart, Justin, “For the Fed, Nothing Going on but the Rent,” Heard on the Street column, WSJ, 4/27/2015

1. How do you track inflation?

2. How do you track inflation in the cost of housing?

3. How do you track inflation in the cost of rent?

The article is a great mini-case for discussion on how economists (and more generally how data scientists can/should) create variables to assist in their analysis.  In this case, the article briefly overviews how one documents inflation.  To put it loosely, one defines a basket of goods and tracks the price of those goods over time.  However, people can disagree on what to put in the basket (among other things) hence the Bureau of Economic Analysis (Commerce Department) might come up with a different measure than the Labor Department (for example).

The cost of housing turns out to be tricky because housing costs are divisible into renters vs. homeowners.  Homeowners are complex, notes the article, because the “price” is anchored to a particular time (date).  People buy their houses at different times so what “price” do you use in measuring inflation?   A house purchased on California’s Santa Monica coast in 1960 has a very different price than the house next-door purchased in 2015.  Moreover, insurance payments mortgage costs (both then and now) are different.

One proxy for cost of housing is rent.  “Government statisticians have calculated [housing costs]by estimating what homeowners would pay in rent for comparable living spaces or owner’s-equivalent rent.  So when rents go up, they drive overall housing costs higher.”

What is the problem with using rent?  Consider the sensitivity of rent prices to housing supply-and-demand.  Indeed it is sometimes (often?) argued that in response to supply-and-demand, the price of housing vs. the price of rents are inversely related.  When house prices are high, people are forced to rent because they cannot afford to buy. Holding housing stock constant, then the demand for rental properties increases and, given the assumed fixed supply, the cost of rent goes up.  The problem is more complicated than that; the article says more.  But this is also a nice illustration of endogeneity and the general question of how to construct variables as a proxy for unobservables or difficult-to-measure concepts.

Acquiring data or processing data sets for analysis sometimes involves constructing variables for data that you are not able to acquire or aggregating/processing data into more usable forms of data.

Evolution of the scheduling problem in media and entertainment

Flint, Joe, “Equations Change for TV-Show Schedulers,” WSJ, 5/11/2015

1. What data do over-the-air broadcasters know about viewers and their viewing habits.

2. How do they know this data.

3. What data do entertainment media content producers know about viewers and their viewing habits.

4. How do they know this data.

5. Creating the broadcast schedule is a classic optimization question.  What are the decision variables in this optimization?

6. What business objective is the broadcaster trying to optimize in this problem?

7. What types of constraints does the scheduling assignment problem face

The article is talking about the problem of creating the broadcast schedule for the major television studios (ABC, CBS, NBC, and Fox).  This problem is separable into at least two pieces:  how to select shows and how to assign shows to specific slots in the broadcast schedule.  The optimization has traditional constraints such as show length, number of days in the week, time slots in a day, etc.

However, there is a game theory dimension in that every studio must account for the actions of the other networks.  “When the sitcom ‘Seinfeld’ was a massive hit for NBC in the 1990’s CBS, Fox and ABC typically tried to counterpunch with dramas hoping to appeal to a different audience.”  There are even more subtle issues of gamesmanship:  “[F]ormer NBC scheduler Preston Beckman says he and his bosses wanted Jennifer Aniston so badly for ‘Friends’ that the network scheduled popular made-for-TV movies against the CBS series she was on, ‘Muddling Through,’ hoping to kill it.  The gambit worked, he said.”

The business objective is not trivial.  Is the goal to maximize viewers?  Is the goal to maximize advertising dollars (which is a function of viewers)?  In some cases, the goal includes building an audience for syndication and/or afterlife reruns, “Hit prime-time shows also have a strong shot at a cash-filled afterlife in reruns, especially on the Web, ready for the binge-watcher.  Witness the recent licensing of ‘Friends’ to Netflix.”

“Another goal, schedulers say, is to keep viewers from surfing between shows once the TV is on.”  Reduced to cliche, this is the idea that it is easy to retain an existing customer than to acquire a new one.  In the microcosm of the viewing day, how do I generate a schedule that retains users?  “Exploiting so-called launch-pads remains a key to scheduling.”  NBC’s singing-competition show ‘The Voice’ draws a huge audience and the network was able to use that last season to get its drama ‘the Blacklist’ off the ground.  That worked so well NBC figured ‘The Blacklist’ could become its onw launchpad on Thursday night at 9PM.”  But there are interesting “association rule mining” implications for this strategy.  When Blacklist  moved, Blacklist’s viewership declined and the show that was scheduled after the Voice also failed.

Where does data come from?  “These days, schedulers … are often looking at how shows perform on other platforms to determine the best time for a show.  One aim is to convert delayed viewers into live viewers.”  (How would you measure that?  How would you create proxies for that?

Measuring the objective is further complicated by the growth in time-shifting and alternative viewing channels.  Finally, the influence of social media cannot be discounted.  “While time shifting will increase, schedulers say a show’s success is still indelibly linked to strong viewing during initial air times.

Knowledge discovery vs. fraud analysis: How problem framing governs the data subset(s) you analyze and where/how you look.

Hope, Bradley and Ackerman, Andrew, Clues Overlooked in ‘Flash Crash Probe,’ WSJ, 4/27/2015

Questions:

[1] Hindsight is twenty-twenty, but how are you supposed to know what clues to look for in-advance?

[2] The proverbial joke is about the drunk who looks for his lost keys under a lamp post, not because he lost the keys there, but because that is where the light is.  In a large search space (a sea of data), how do you know where to look (which data to include in the exploration)?  Answer:  How you frame the problem governs where you look and, as a consequence, colors the nature of your analysis and conclusions.

There are theses and best-sellers still to be written about the Flash Crash.  However, this particular article highlights a particularly salient detail for general data analytics projects.  Where people tend to think of data analytics as a tool to solve specific problems, data mining is also about “knowledge discovery.”  Put more generally and colloquially, simply sifting through the data to see what interesting things might surface.  In the context of 20/20 hindsight, it seems obvious that investigators should have considered not only the influence of actual trades on market stability, but on orders as well.  The article cites Mauren O’Hara, “a former member of the joint committee (commissioned by the Securities and Exchange Commission and the Commodities Futures Trading Commission to investigate the causes of the crash), and a professor specializing in market structure at Cornell University, said investigators ‘should have seen this.’ ‘Nowadays, market manipulation doesn’t just involve the trades.  It’s about the orders,’ she added.”

Whether it is true or not, at least one particular narrative of this incident highlights the reality that “How you frame the question” affects what data you include in your analysis and can/will color your results.  The article notes that ,'[w]hile investigators had access to the full set of data from that day (of the flash crash), they focused on a subset related to actual trades, the committee members said.  Had investigators delved deeper into the bigger set that included all the bids and offers entered, they said, they likely would have noticed that Mr. Sarao single-handedly put enormous pressure on a key futures contract tied to the S&P stock index by making bids and quickly canceling them in a bluffing tactic known as ‘spoofing.'”  In other words, how the problem was framed affected where (in what data) investigators searched for evidence of fraud and signs of behavior.

Note that I emphasize that this just one narrative.  The article also cites Andrei Kirilenko, a former CFTC chief economist who also worked on the investigation.  “[He] disputes the contention that investigators overlooked or gave short shrift to key evidence of manipulation  ‘All data [were] being used,’ he said.  ‘We were looking for statistical evidence of something that explains this enormous systemic event.’  Mr. Kirilenko says they did that by pinpointing the Waddell & Reed trade that triggered the crash.  Mr. Kirilenko who led the CFTC’s portion of the investigation, disputed his group missed Mr. Sarao’s alleged misconduct, arguing that his reported activity was statistically insignificant.”

Having stats is not enough … the “right” stats; Football and moneyball

Football and “moneyball” innovation …

It was said that “practice makes perfect.”

Then “perfect practice makes perfect”

Used to be … statistics and big data is it.

Now … the right statistics make it work

But not just any “right statistics” b/c what makes statistics” the “rigth” statistics?  The model and the numbers have to match … You have to have the right kind of measureables for the models. … but the models and measureables have to match your business strategy. see NFL

Clark, Kevin, “Why the Draft is More Awkward Than Ever,” WSJ, 4/28/2015

With all the information you might think that it would be easier than ever to pick the rigth players.

Problem. … players are changing.  The data that people collect does not align well with what they want to use it for.  NFL has data metrics (and know how to use it) for their current strategies.

But the new players are different and the old metrics do not measure the performance of new players in new strategies well.  The old stats do not do a good job of predicting how well these new types of players will perform.

The training is different … Players who are entering the league today aren’t necessarily trined to do what the NFL teams want to do.

Example:  How to find a player (from basketball) who can perform well as a tight-end (see the WSJ article on the basketball player would could be a tight end.

… convergence of factors in college and high-school … that change the way that young players learn skills — a new kind of draft that team executives don’t quite know what to do with … ”

“The heart of the matter is that teams haven’t yet figured out which positions are valuable in 2015.”

PROBLEM 1.  How to find “great” in non-traditional places b/c the statistics are not good measures – these people aren ‘t trained in the right way to perform well on the statistics — but they could learn fast … OR these people have not done the sport and so you do not have any statistics on them.

PROBLEM 2.  This is a game.  The game is adapting to the skills of the new players … you don’t know which statistics matter .. and the statistics you have are optimized for old strategies .. and you have new strategies.  You don’t know how your old stats fit the new strategies, and you don’t hav emodels (and stats) for thenew models.

“These developments, draft analysts say, have created a world where teams are increasingly reliant on combine statistics to gauge what a prospect can do.”

“That’s where people look at measurables,” says NFL Films senior producer Greg Cosell said.  “They have to say, ‘We think he can do this,” but they don’t really know … it’s all projections.”

Sports analytics

How to innovate – new strategies

Data framing – designing new strateiges, adapting

Data acquisition – what are the variables, what is the data that you have?  How to match that with what you already know?

Marketing Analytics for Forecasting


Binkley, Christina, “How Fashion Retailers Know Exactly What You Want,” WSJ, 4/30/2015

Questions:

[1] What types of questions would bricks-and-mortar retailers like to ask in order to improve their financial performance?

[2] What types of data do bricks-and-mortar retailers have about their customers and customer behavior?

[3] How do bricks-and-mortar retailers collect this data?

[4] What types of questions can retailers answer with the data that they have?

[5] What experiments can retailers run to answer questions that they cannot answer with the data that they have?

This article is about the application of data analytics in the retail fashion industry.  It is an excellent article for initiating a discussion about the entire data analytics process and highlighting the collaboration required between business people and data analysts to provide value.  The article highlights a company called APT, “a specialist in cause-and-effect analytics” in the role of data analyst.  Firms in the retail fashion industry, including those who both design and operate their own stores as well as retailers (e.g. Lane Bryant, Chico’s) are cast in the role of business people.  Questions that the business people would like to answer include:

– which types of promotions (percentage discounts or absolute dollar values) yield the greatest lift in sales?  One firm found that certain customers (personas) responded better to percentage discounts while others responded best to an absolute discount.

– what level of discounting draws new customers versus cannibalizes sales from existing customers?  For example, when Chico’s offered steep discounts, they discovered that the discounts caused sales to spike, driven by purchases from their most loyal customers who would purchase anyway.  When the discounts ended, sales dropped back to normal levels.

– are there “key products” that drive demand for complementary products in the sense that the complementary product has a cross-price elasticity of demand that is negative.  Note 1: market basket analysis might reveal that certain items are often purchased together (correlation) but that is different from a causal relationship.  For example, one retailer discovered that introducing golf apparel into their product mix increased the sales of other products, thus the retailer introduced golf apparel across a large percentage of their stores.

Part of the general analytics framework, however, also recognizes that certain questions that one would like to answer cannot be answered from the data on-hand.  In these circumstances, one needs to design an experiment to gather the necessary information.  The article cites the retail analytics specialist, APT, noting that “the amount of testing by bricks and mortar retailers has increased by 10% each year.”

For example, retailers have run tests on consumer responses to new fabrics and new styles.  Interestingly, some stores even test apparel displays.   Lane Bryant tested the layout of active wear in their stores:  how to arrange tops, bottoms, and sports bras.  This idea of testing consumer response to layouts parallels the A/B testing of page layouts in the digital online space.

One significant element of experimental design, highlighted implicitly in the article, is the question of sample population.  In the digital space, a large firm like Google or Amazon has sufficient scale to run dozens or hundreds of 1% tests simultaneously and be statistically certain of a truly random sample.  This is much more difficult for start-ups as well as bricks-and-mortar retailers.  Firms who rely upon loyalty programs must are vulnerable to sampling bias and think consciously of how to move beyond local maxima.  The article cites examples of experiments where firms discover significantly different behavior in one region vs. another (e.g. Florida), one customer segment (females v. males), or personas (customer who respond to different types of discounting).  The article states that “Chico’s captures data on 90% of sales through its loyalty program.”  In context, it is not clear whether this means that 90% of all sales are to customers who are in the loyalty program, or does it mean that of customers who are in the loyalty program (whatever that percentage happens to be), 90% of loyalty-program-customers are documented?  What difference does that make in decision-making?  For example.  Nike is both a retailer and a manufacturer.  While the majority of Nike product sales flow through third-party retailers, most of Nike’s data on consumers comes from their Nike outlet and Nike retail stores.  What types of questions can Nike answer given the limited data that it has?  What types of experiments can it run given so biased a sample population?

Managing by data

Discuss this as a framing question.

Discuss this (in part) as a data acquisition problem.  The point is to merge information in the organization.

Point out that this is what being an MBA and doing data analytics is all about.

Note that someone else creates the dashboards.  Your goal is to ask the questions, get the data, and make the decision.

Note also the quote in the article – this is not big data.  This is just data.

Also point out the industrial organization issues here:  markets vs. hierarchies and how in this context, you are co-locating information with the  actual people who need to execute and, as a result, you are also moving decision authority to the outer edges of the organization:  NOTE that in the original markets v. hierarchies, the idea was that IT would allow you to combat agency costs by decreasing the information asymetry between the decsion-makers (at the top of the food chain) and the people who executed actions and generated the data (at the leaves of the organizational tree).  Instead, the effect of this info transparency is to move decision rights to the edges of the tree, flattening the organization.  You still end up with a flatter organization, but the big difference is where decsion rights move – to the center or to the leaves (is that right?  I may well have mis-remembered the markets v. hierarchies paper … I probably did misremember it 🙂 ).

Data is Now the New Middle Manager
by: Christopher Mims
Apr 20, 2015
Click here to view the full article on WSJ.com

TOPICS: Data

SUMMARY: A curious thing is happening at many startups. Firms are keeping head counts low, and even eliminating management positions and replacing them with something you wouldn’t immediately think of: data. They key is turning a company’s data into a dashboard that anyone in the firm can use. Startups are nimbler than they have ever been, thanks to a fundamentally different management structure, one that pushes decision-making out to the periphery of the organization. And front-line employees are able to make decisions that were once made by managers, because they have essentially unlimited access to data. Several concrete examples are provided in the article that illustrate the article’s most important points.

CLASSROOM APPLICATION: This is a fascinating article. It illustrates how data and in particular cloud-based data is improving and changing how decisions are made in corporations. Talk to your students about the evolution of data-usage in corporations. Talk specifically about the applications of cloud-based data in the article and how it is replacing middle managers in business startups.

QUESTIONS: 
1. (Introductory) Briefly explain what “cloud-based” data means?

2. (Advanced) Briefly relate how cloud-based data is replacing middle-managers in business startups?

3. (Advanced) On a scale of 1-10 (10 is high), how important is cloud-based data to business organizations?

Applying analytics to match innovations to firms

Boulton, Clint, “Venture Capitalists Play Matchmaker,” WSJ, 4/16/2015

Questions.

[1] How do established firms identify threats to their existing market postions

[2] How do new start-ups create leads (lead generation) for their products

[3] How do new start-ups who are pivoting, re-visiting product-market fit, or considering expansion into adjacent markets (see: Horizon 2 Innovation in Terwiesch and Ulrich’s Innovation Tournaments) identify likely target markets?

[4] How do established firms with known problems identify emerging technologies that may solve those problems?
This is a short note in CIO Journal discussing how emerging mobile, security, and analytics trends are pressuring CIOs to look beyond traditional IT suppliers.  The problem is, “how to learn about and vet these new firms?”

Research at Haas on the next generation CIOs reveals that some CIOs participate on VC (venture capital) boards precisely for this reason — to learn about new technolgoies and to vet firms.  (See research by Karsten Zimmerman and Jim Spitze).

Today, it seems that VCs are creating custom forums where the program is constructed from a single VC firm’s portfolio companies.  CIOs are invited to attend and learn not only about the problem domain but possible solutions in the form of venture-backed start-ups.  This is essentially the VC as lead generation for their firms.  The WSJ labels this as “Sand Hill Road Speed Dating.”  Throughout a full-day session, CIO’s meet with a founder/team for 30 minutes and then move on to the next team.  The article cites Andressen Horowitz and Sequoia Capital as engaging in such exercises.  As a success, the article cites Equinix CIO Brian Lillie, who was introduced to the then unknown software container technology and a company called Docker.  Lillie is quoted, “We can’t just invest things and lose money, so the VCs really help with the vetting of startups.”  From an analytics perspective, perhaps the metaphor is apt.  If one can generate the appropriate data representation for a problem and a start-up’s solution, then we reduce the issue reduces to one of matching and ranking.

At some level, VCs have long provided lead generation in one form or another, and incubators and accelerators have adopted a similar model; rather than taking an equity stake in a start-up, they might charge a monthly fee for rent and overhead and open a separate revenue stream by building partnerships with corporate interests who “subscribe” to the incubator for the right to learn about the newest technologies – as potential suppliers, customers, partners, disrupters, or even acquisition targets.  My understanding is that Duncan Logan (founder of RocketSpace), is a pioneer in this model (see: RocketX).

Fastest recreational runners?

Helliker, Kevin, “The State of U.S. Running: Strong, but Slow,” WSJ, 4/20/15

Questions.
[1] Where would you find data to answer the question, “what nation has the fastest recreational runners in the world?”
[2] What parameters of the question would you choose to narrow-down and/or define precisely in order to answer the question?
[3] How would you analyze the data?
[4] How might you challenge or question the conclusions in the article?

The WSJ has a column titled, ‘The Count’ where they publish fun, data-driven analyses. Here, in the context of the Boston Marathon, they wanted to ask which nation had the fastest recreational runners. This is a great example of exploratory data analysis to help answer a question. They had to make a number of different assumptions to narrow the scope of the question and to answer the question in a precise way. There is no sense of a “right” or “wrong” answer here. This is just a fun analysis and it is a useful question to analyze in a managerial context b/c if you only see a ranked list as an answer without understanding the data or the analysis, you might not understand the impact on decision-making.

For example, although one could talk about recreational runners in general, the “fun” study looked only at marathon runners. They separated male versus female, but also could have broken things down by age range. There are two levels of possible sampling that immediately arise. They only drew data from 12 marathons – three in the US and nine in Europe. What types of bias does this introduce into the sample? Only 47 nations were covered because they required that a nation field at least 10 runners per year of each gender, and 100 or more total runners over the years studied (2009 to 2014).

This is not meant to be an exhaustive critique (not criticisms, just fun things to think about):
For smaller countries, how many times is the same runner counted vs. for larger countries?
Given the international nature of the races, are the samples from European vs. US races larger/smaller. What about the prominence of the race? More prominent races are likely to have more “scrubs” thereby bringing down an average. What about the age of the runners? Are certain populations older/younger thereby skewing the numbers one way or another? How is “recreational” runner defined? Anyone who is not a sponsored runner?