Invalid Multibyte string

December 07, 2015

Quite an annoying problem, commonly found when receiving data from systems outside of your sphere of influence or from a variety of different sources. Some basic strategies for dealing with this:

First, try to read it in as UTF-8. You'll get good error messages that identify the first offender. So you can play with it.

d <- read.csv(  
   file = "path/to/file.csv", 
   encoding = "UTF-8
sapply(d, tolower) # will return first error

# > Error in tolower(d$column) : invalid input 'H�llo' 
# >  in 'utf8towcs'

So now you know there's an encoding issue. If your file is small, you can rerun the load over and over trying different encodings. In particular, try Windows-1252 and latin1.

If it's a large file you'll want to load as UTF-8 once to get an error that shows you the position. But after you've found an example either by grepping or by visual inspection, you can use iconv.

# Load file without encoding
d <- read.csv(  
   file = "path/to/file.csv", 
   encoding = "UTF-8

# Identify a record that's causing the issue
sapply(d,function(x) {  
   # look at the data, to identify the column

# Now grep for a specific record
grep("llo$",d$column) # 7, for example

# Now we can try converting to see which one works
# Again we use the typical culprits: 
# latin1, UTF-8, Windows-1252, macintosh


# You'll know you've hit the jackpot when the characters
# aren't corrupted and your special character is complete
> Héllo