Key to a Successful Crowdfunding Campaign


Kickstarter Logo
Indiegogo Logo

Image sources: Internet

Crowdfunding has become one of the main sources of initial capital for small businesses and start up companies that are looking to launch their first products. Websites like Kickstarter and Indiegogo provide a platform for millions of creators to present their innovative ideas to the public. This is a win-win situation where creators could accumulate initial fund while the public get access to cutting-edge prototypical products that are not available in the market yet.

At any given point, Indiegogo has around 10,000 live campaigns while Kickstarter has 6,000. It has become increasingly difficult for projects to stand out of the crowd. Of course, advertisements via various channels are by far the most important factor to a successful campaign. However, for creators with a smaller budget, this leaves them wonder,

"How do we increase the probability of success of our campaign starting from the very moment we create our project on these websites?"

In particular, here are some of the questions we want to answer:

  • How do we write a better project title or blurb?
    • Does the length matter?
    • What are the keywords we should include?
  • How to set our award tiers so that people are more likely to pledge?
  • Does creating the project under a company rather than a person make a difference?
  • Does "aim high, shoot higher" work?
    • Is there a correlation between overall percentage funded and initial pledge goal?
  • ... and many more

Project Proposal

The project is divided into two stages. In the first stage, I will accumulate as much data as possible by scraping off the aforementioned websites. I will then perform preliminary analysis on those data to identify the key variables and factors to be included in the supervised learning process. In the second stage, I will perform supervised learning from previous training data to predict the success or failure of a set of ongoing projects as testing data.

Part 1: Data Mining

  1. Scrape data from all 300,000+ successful projects
  2. Attempt to obtain data from previous failed projects
  3. Follow the website for a week or two to keep track of new projects every day
  4. For the next 30-40 days (typical Kickstarter campaign duration), record daily updates of these projects
  5. Clean this dataset up to be used as training data for supervised machine learning
  6. Follow the website for another week or two to obtain new projects as testing dataset

Part 2: Machine Learning

  1. Using the training dataset, perform machine learning techniques such as Random Forests
  2. Test our model on the testing dataset

Possible factors to be included in the model:

  • length and content of project title
  • length and content of project blurb
  • combination of pledge tiers
  • overall campaign goal
  • project category
  • location
  • ..etc.

Data Sources

All of my raw data are scraped from

There are currently two raw datasets:

  1. First 4000 live projects that are currently campaigning on Kickstarter (scraped with Python code using BeautifulSoup)

    • Last updated: 2016-10-29 5pm PDT
    • amt.pledged: amount pledged (float)
    • blurb: project blurb (string)
    • by: project creator (string)
    • country: abbreviated country code (string of length 2)
    • currency: currency type of amt.pledged (string of length 3)
    • end.time: campaign end time (string "YYYY-MM-DDThh:mm:ss-TZD")
    • location: mostly city (string)
    • pecentage.funded: unit % (int)
    • state: mostly US states (string of length 2) and others (string)
    • title: project title (string)
    • type: type of location (string: County/Island/LocalAdmin/Suburb/Town/Zip)
    • url: project url after domain (string)
  2. Top 4000 most backed projects ever on Kickstarter (scraped with this code)

    • Last updated: 2016-10-29 6pm PDT
    • amt.pledged
    • blurb
    • category: project category (string)
    • currency
    • goal: original pledge goal (float)
    • location
    • num.backers: total number of backers (int)
    • num.backers.tier: number of backers corresponds to the pledge amount in pledge.tier (int[len(pledge.tier)])
    • pledge.tier: pledge tiers in USD (float[])
    • title
    • url
  3. Top 4000 most backed projects on Kickstarter with creators
    • Last updated: 2016-10-30 10pm PDT
    • was not used in the data analysis of this proposal



The website has certain rules for generating urls. For each seed value, the page number only goes up to at most 200. Given there are 20 projects per page, we can only get at most 4000 non-duplicated data entries per iteration. Therefore, if we want to analyze all of the 300,000+ projects on Kickstarter, we would need to generate a lot more seed values with a mix of different sorting orders to obtain a much larger sample before removing duplicated entries.

Another downside with the data scraped from is that they do not have the past unsuccessful projects indexed. Failed projects can only be accessed if one has the original link (e.g. this laser razor project that was successfully funded but suspended by Kickstarter administration). Hence, if we were to use our current "Most Backed" data for maching learning, the result would be skewed and might lead to false positives.

One way to go around this issue is to follow the current live projects until they end (the average duration is 30 days). In addition to a relatively fair sample, we would also gain additional meaningful data such as the pledge amount trend or number of backers over time.


Indiegogo has a unique feature that Kickstarter does not - flexible funding. This feature makes it difficult to define a performance benchmark. Moreover, the urls generated by the website do not have a generating parameter. This makes my current method of scraping data inapplicable. However, Indiegogo does contain a much larger database than Kickstarter so I would have to investigate on how to extract data from their website.

Key Plot 1

Word Count of Project Title and Blurb - Does it matter?


  • Most Backed projects tend to have slightly longer titles than live projects.
  • Most Backed projects tend to have slightly (insignificantly) short blurbs than live projects.
  • We may further investigate whether longer project titles attract more backers.
  • In addition to the length, content should matter, too. We should investigate whether using invented words as product names correlates with number of backers. We should also investigate whehter non-alphabetical characters play any significant role.

Key Plot 2

Difference in % Frequency between Most Backed and Live Projects by State


  • Projects created in California are overwhemingly more successful than those created in other states.
  • Projects created in Florida are significantly less backed.
  • Projects created on the West Coast are generally more backed than those created on the East Coast.
  • It is worth to note that, although location seems to be quite important based on the above observations, it is impratical to relocate a company or manufacturer just for that purpose. Also, there might be some lurking variables behind the geographical distinctions. For instance, the Silicon Valley in California might have contributed to most of the success in the technology category.
  • Upon identifying these lurking variables by further investigating differences between projects in California and the other states, we may be able to significantly increase the probability of success for a Kickstarter campaign.

Key Plot 3

Pledge Amount Tiers that Attract Most Backers


  • From raw data, backers tend to choose smaller pledge amounts, e.g. \$15-20, \$25-30, \$35-40.
  • Upon normalization, \$15-20 and \$35-40 are still the preferred amounts by most backers. However, several new peaks (\$95-100, \$115-120, \$185-190) emerged. This shows that people are in fact willing to pay more, but most campaigns do not have such tiers.
  • The above observations are further confirmed by the exponential decay of frequency in the tiers as pledge amount increases.

Key Plot 4

Aim High, Shoot Higher?


  • Although the data does not appear to be entirely linear, if we were to fit a linear trendline, the slope is around 2.25 which is significantly higher than 1. It suggests that higher campaign goals may correlate to even higher final pledged amounts.
  • If we limit the data to initial campaign goals under \$1 million, the increasing trend is even more apparent. In fact, an exponential trend line fits better than a linear trendline in this case.
  • Among the most backed projects, "aim high, shoot higher" is generally true.

Future Directions

  1. Of course, the first step is to follow the project proposal above.
  2. If the model produced can successful predict most of the outcomes of campaigns in the testing dataset, we can develop an application to rate potential Kickstarter campaigns before the creators actually launch them.
    • The application will give a score (probability of success, out of 100) based on the info the potential creator provide.
    • The application will also provide feedbacks on how the creator can modify their project (e.g. title, blurb, pledge tiers, etc) to improve their score.

More Plots

Ratio of Amount Pledged to Initial Goal v. Initial Campaign Goal

% Frequencies of Most Backed and Live Projects by State and by Country