Image sources: Internet
Crowdfunding has become one of the main sources of initial capital for small businesses and start up companies that are looking to launch their first products. Websites like Kickstarter and Indiegogo provide a platform for millions of creators to present their innovative ideas to the public. This is a win-win situation where creators could accumulate initial fund while the public get access to cutting-edge prototypical products that are not available in the market yet.
At any given point, Indiegogo has around 10,000 live campaigns while Kickstarter has 6,000. It has become increasingly difficult for projects to stand out of the crowd. Of course, advertisements via various channels are by far the most important factor to a successful campaign. However, for creators with a smaller budget, this leaves them wonder,
"How do we increase the probability of success of our campaign starting from the very moment we create our project on these websites?"
In particular, here are some of the questions we want to answer:
The project is divided into two stages. In the first stage, I will accumulate as much data as possible by scraping off the aforementioned websites. I will then perform preliminary analysis on those data to identify the key variables and factors to be included in the supervised learning process. In the second stage, I will perform supervised learning from previous training data to predict the success or failure of a set of ongoing projects as testing data.
Possible factors to be included in the model:
All of my raw data are scraped from Kickstarter.com.
There are currently two raw datasets:
The website has certain rules for generating urls. For each seed value, the page number only goes up to at most 200. Given there are 20 projects per page, we can only get at most 4000 non-duplicated data entries per iteration. Therefore, if we want to analyze all of the 300,000+ projects on Kickstarter, we would need to generate a lot more seed values with a mix of different sorting orders to obtain a much larger sample before removing duplicated entries.
Another downside with the data scraped from Kickstarter.com is that they do not have the past unsuccessful projects indexed. Failed projects can only be accessed if one has the original link (e.g. this laser razor project that was successfully funded but suspended by Kickstarter administration). Hence, if we were to use our current "Most Backed" data for maching learning, the result would be skewed and might lead to false positives.
One way to go around this issue is to follow the current live projects until they end (the average duration is 30 days). In addition to a relatively fair sample, we would also gain additional meaningful data such as the pledge amount trend or number of backers over time.
Indiegogo has a unique feature that Kickstarter does not - flexible funding. This feature makes it difficult to define a performance benchmark. Moreover, the urls generated by the website do not have a generating parameter. This makes my current method of scraping data inapplicable. However, Indiegogo does contain a much larger database than Kickstarter so I would have to investigate on how to extract data from their website.