catalini.com - Christian Catalini

  • Increase font size
  • Default font size
  • Decrease font size
catalini.com - Christian Catalini

Beware of STATA's insheet command

E-mail Print PDF

I've come across a bug in STATA's insheet command that is quite worrisome.

Many power users will have their original datasets in raw text format. This is often the case for data coming from a variety of sources (public datasets, data downloaded or scraped from the Internet, etc). Not only it is a good habit to store data as raw text, but it also reduces lock-in with a specific platform or software. You are more likely to open that file again in 5-10 years if it's in a pure text format.

A common format for raw text files is the tab-delimited one, as tabs rarely appear in data. Each column is separated by a tab symbol (\t) and when you import the data you use the tab to recognize when one variable ends and the next one begins.

Sometimes strings are enclosed within double quotes " " , but this can create issues if any of the quotes is missing in the original data. Personally, I prefer to avoid double quotes and rely on tabs to isolate one variable from another.

The STATA insheet command has a tab option that is supposed to do just that, i.e. take a raw text file and import it in memory using tabs as delimiters.

Now the problem is that insheet still relies on quotes if it finds them, even if most of the strings in a file are not enclosed in double quotes. This is a serious bug and can lead to some disastrous consequences.

Here is an example:

v1 v2 v3 
1Testing 123This is a "string" of text 
2Testing 456This is another "string of text 
3Testing 789One more "string" of text 

 

If you're working on data parsed from a website, data that contains user generated content or data that has been inserted by a human being, a missing double quote is very likely to occur, like in the VAR2 column above (row one is correct, but both in row 2 and 3 the quote symbol is not double). The raw file can be downloaded from here.

This will confuse insheet and import the data incorrectly. Moreover you won't receive any warning from STATA and it will look as if the file was imported correctly! 

If you type:

   insheet using test.txt, tab clear

Stata will return no errors and output:

   (3 vars, 2 obs)

 You dataset will look like this:

v1 v2 v3 
Testing 123  string of text 
2Testing 456     of text 

 

Now you may think that this is a minor issue and that it's easy to spot it while looking at the dataset, but now imagine a dataset with 500,000 rows and 50 variables. Unless you know your exact row count, you are very likely to miss a lot of data. What makes the bug even more creepy is that you may not just loose data at the bottom of the file, but at any intermediate row, which makes it harder to detect.As you can see, we lost 1 row of data and v3 contains the wrong information. This is because insheet saw the first quote and looked ahead until he could find another quote to close the string.

 

Here's a workaround:

1) MAC or LINUX USERS

Before using insheet, always type

shell wc - l nameofyourfile.txt

In the case above, this would return:

  3 test.txt

Where 3 equals the number of rows in your raw text file. If you notice that insheet imports less rows than 3, then you can search and replace any quotes (") in the raw file before import with your favourite text editor.

 

2) WINDOWS USERS

Before using insheet, always type

shell find /v /c "&randomtext&*" nameofyourfile.txt & pause

This will open a command shell and count the number of rows in your file that do not contain "&randomtext&", i.e. usually every row (feel free to make the string more complex if you want!). On large files, this may take a few seconds, so be patient.

Like in the mac case, the number you will see next to the filename will be the real number of rows.

 

Enjoy!

Christian 

Last Updated on Monday, 26 September 2011 16:15
 

The Next 36 first cohort: Tradyo wins Best Venture Award

E-mail Print PDF

 

From Tradyo.com: "We have cool stuff; you have cool stuff; everyone has cool stuff. The problem is, half of our stuff goes unused when it could be super valuable to someone else. Tradyo enables people to buy and barter the things they don't use with their neighbours in a simple, convenient, and downright fun way. Tradyo uses the GPS function on your smartphone to reveal the cool stuff available around you. The app is curiosity driven - who knows what kind of treasure you'll stumble upon?"

 http://www.tradyo.com 

Last Updated on Thursday, 01 September 2011 17:17
 

Our crowd-funding paper wins summer grant from NET institute

E-mail Print PDF

The working paper on the geography of crowd-funding I've been working on with Ajay Agrawal and Avi Goldfarb, received a summer grant from the NET institute.

 About the NET institute: "The Networks, Electronic Commerce and Telecommunications ("NET") Institute is a non-profit institution devoted to research on network industries, electronic commerce, telecommunications, the Internet, "virtual networks" comprised of computers that share the same technical standard or operating system, and on network issues in general. Of particular interest is research on innovation and introduction of new technology in network industries. The NET Institute functions as a world-wide focal point for research and open exchange and dissemination of ideas in these areas. The NET Institute competitively funds cutting edge research projects in these areas of research. It organizes conferences and seminars on these issues." (Source: http://www.netinst.org/)

 

Last Updated on Saturday, 26 March 2011 18:49
 

The Geography of Crowdfunding

E-mail Print PDF


 

Ajay K. AgrawalChristian CataliniAvi Goldfarb

NBER Working Paper No. 16820
Issued in February 2011
NBER Program(s):   PR 

Perhaps the most striking feature of "crowdfunding" is the broad geographic dispersion of investors in small, early-stage projects. This contrasts with existing theories that predict entrepreneurs and investors will be co-located due to distance-sensitive costs. We examine a crowdfunding setting that connects artist-entrepreneurs with investors over the internet for financing musical projects. The average distance between artists and investors is about 3,000 miles, suggesting a reduced role for spatial proximity. Still, distance does play a role. Within a single round of financing, local investors invest relatively early, and they appear less responsive to decisions by other investors. We show this geography effect is driven by investors who likely have a personal connection with the artist-entrepreneur ("family and friends"). Although the online platform seems to eliminate most distance-related economic frictions such as monitoring progress, providing input, and gathering information, it does not eliminate social-related frictions. 

http://www.nber.org/papers/w16820 

Last Updated on Saturday, 26 March 2011 18:48
 

Does Distance Matter in Online Entrepreneurial Finance? Evidence from Crowd-Funding in the Arts

E-mail Print PDF

Ajay Agrawal, Christian Catalini, Avi Goldfarb

Abstract

The most striking feature of “crowd-funding” for early stage entrepreneurial projects is the broad geographic dispersion of investors. This stands in stark contrast to existing theories that predict entrepreneurs and investors will be co-located due to distance-sensitive costs. We examine a crowd-funding setting that connects artist-entrepreneurs to investors over the internet for financing early stage musical projects where the average distance between entrepreneur and investor is about 3,000 miles, suggesting a reduced role for spatial proximity. Still, distance does play a role. Local investors are more likely to invest in the very early stages of a single round of financing and are less responsive to decisions by other investors. We show this geography effect is driven by investors who likely have a personal connection with the entrepreneur (“family and friends”). Although the online market platform eliminates most distance-related economic frictions such as monitoring progress, providing input, and gathering information (e.g., local reputation, stage presence), it does not eliminate social-related frictions such as information more likely to be held by personally-connected individuals (e.g., entrepreneur’s tendency to persevere, recover from setbacks, succeed in other endeavors).

Download Working Paper from SSRN 

Agrawal, Ajay, Catalini, Christian and Goldfarb, Avi, Does Distance Matter in Online Entrepreneurial Finance? Evidence from Crowd-Funding in the Arts (October 29, 2010). NET Institute Working Paper No. 10-08. Available at SSRN: http://ssrn.com/abstract=1692661

Last Updated on Saturday, 26 March 2011 18:48
 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  Next 
  •  End 
  • »


Page 1 of 3

About

Christian CataliniPhD Candidate in Strategy at the Rotman School of Management and technology enthusiast, I wrote my undergraduate degree thesis on the economics of open source development and my MSc final dissertation on "The link between science and technology: exploring the network of inventors and scientific authors in the semiconductor industry". After working at KITES-CESPRI Bocconi on the European research project “Highly cited patent”, I've started my PhD in Strategic Management at Rotman. Current projects include "Markets Making Music", with Ajay Agrawal; "Intellectual Property and the Diffusion of Formal Standards", with Timothy Simcoe; "Authors-inventors: life on the boundary between science and technology", with Stefano Breschi.

Areas of interest: economics of innovation, the market for ideas, knowledge flows between science and technology, open source, distributed innovation creative industries, entrepreneurship.


Twitter updates


search