I have a data frame and I want to add to it a column that contains not duplicated alphanumeric values.
Firstly, I adapted a function that I found on a blog. (<a href="https://ryouready.wordpress.com/2008/12/18/generate-random-string-name/" rel="nofollow">https://ryouready.wordpress.com/2008/12/18/generate-random-string-name/</a>)
My problem is that my dataframe has about 250k rows so with n = 250k this function is just running for ever.
I know that with n = 250k, if I increase the length of the id string (
) the odds to get the same string are unrealistic so the
loop is such a waste of ressources but I really need to be sure that will not happen, I mean "sure" with control structures.
So I found a more efficient way to do it, instead of calling a
and checking all the vector for each
in the loop, I check if there is duplicated in the final vector :
It's very slow if the
must run a lot of times => When n is very close to the number of permutations.
But it's quite acceptable for a long id string :
However it's still 3sec+ to create a column so I'm looking for a faster way.
I know that the loop isn't so good and I should prefer vectorization but I'm not realy a master of code optimization. So if you have any ideas, thank you in advance.
Firstly, I adapted a function that I found on a blog. (<a href="https://ryouready.wordpress.com/2008/12/18/generate-random-string-name/" rel="nofollow">https://ryouready.wordpress.com/2008/12/18/generate-random-string-name/</a>)
Code:
idGenerator <- function(n, lengthId) {
alphaNum <- c(0:9, letters, LETTERS)
if (n > length(alphaNum)^lengthId) {
return("Error! n > perms : Infinite loop")
}
idList <- rep(NULL, n)
for (i in 1:n) {
idList[i] <- paste(sample(alphaNum,
lengthId, replace = TRUE), collapse = "")
while(idList[i] %in% idList[-i]) {
idList[i] <- paste(sample(alphaNum,
lengthId, replace = TRUE), collapse = "")
}
}
return(idList)
}
My problem is that my dataframe has about 250k rows so with n = 250k this function is just running for ever.
I know that with n = 250k, if I increase the length of the id string (
Code:
lengthId
Code:
while
So I found a more efficient way to do it, instead of calling a
Code:
while
Code:
i
Code:
idGenerator <- function(n, lengthId) {
alphaNum <- c(0:9, letters, LETTERS)
if (n > length(alphaNum)^lengthId) {
return("Error! n > perms : Infinite loop")
}
idList <- 1:n
for (i in 1:n) {
idList[i] <- paste(sample(alphaNum,
lengthId, replace = TRUE), collapse = "")
}
while(any(duplicated(idList))) {
idList[which(duplicated(idList))] <- paste(sample(alphaNum, lengthId,
replace = TRUE), collapse = "")
}
return(idList)
}
It's very slow if the
Code:
while
Code:
> system.time(idGenerator(62^2, 2))
utilisateur système écoulé
8.00 0.00 8.02
> system.time(idGenerator(62^3, 3))
Timing stopped at: 584.35 16.66 602.46
But it's quite acceptable for a long id string :
Code:
> system.time(idGenerator(250000, 12))
utilisateur système écoulé
3.2 0.0 3.2
However it's still 3sec+ to create a column so I'm looking for a faster way.
I know that the loop isn't so good and I should prefer vectorization but I'm not realy a master of code optimization. So if you have any ideas, thank you in advance.