R에서 문자열 조작하기 (String Manipulation in R)

programming

R에서 문자열 조작하기 (String Manipulation in R)

2022. 3. 4. 13:57

해당 포스트에서는 R에서 문자열(String) 조작 사례를 소개합니다.

이번 포스트에서는 다양한 방법을 사용하여 R에서 문자열(String) 조작 방법을 설명합니다. 설명에 사용할 데이터는 아래 코드로 생성 및 활용할 예정입니다.

data <- c("Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.",
"",
"Data science is a \"concept to unify statistics, data analysis, informatics, and their related methods\" in order to \"understand and analyze actual phenomena\" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a \"fourth paradigm\" of science (empirical, theoretical, computational, and now data-driven) and asserted that \"everything about science is changing because of the impact of information technology\" and the data deluge.[4][5]",
"",
"A data scientist is someone who creates programming code, and combines it with statistical knowledge to create insights from data.[6]")

head(data)

data 변수에는 문서 다섯 줄(lines)의 내용이 각각 들어있으며, 아래에서 해당 라인의 예를 볼 수 있습니다.

참고 : Draw a trend line using ggplot-Quick Guide

[1] "Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data."                                                                                  
[2] ""                                                                                                             
[3] "Data science is a \"concept to unify statistics, data analysis, informatics, and their related methods\" in order to \"understand and analyze actual phenomena\" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a \"fourth paradigm\" of science (empirical, theoretical, computational, and now data-driven) and asserted that \"everything about science is changing because of the impact of information technology\" and the data deluge.[4][5]"
[4] ""                                                                                                             
[5] "A data scientist is someone who creates programming code, and combines it with statistical knowledge to create insights from data.[6]"

R 문자열 조작

nchar()

nchar() 함수를 사용하면 문자열(String)을 인수로 제공하여 문자열의 문자 수를 계산할 수 있습니다.

nchar(data[1])

[1] 362

data 벡터의 첫 번째 요소는 위 결과에서 보여지듯이 362자로 구성된 문자열입니다.

toupper()

toupper() 함수는 문자열의 모든 문자를 대문자로 변환하는 데 사용할 수 있습니다.

toupper(data[1])

[1] "DATA SCIENCE IS AN INTERDISCIPLINARY FIELD THAT USES SCIENTIFIC METHODS, PROCESSES, ALGORITHMS AND SYSTEMS TO EXTRACT KNOWLEDGE AND INSIGHTS FROM NOISY, STRUCTURED AND UNSTRUCTURED DATA,[1][2] AND APPLY KNOWLEDGE AND ACTIONABLE INSIGHTS FROM DATA ACROSS A BROAD RANGE OF APPLICATION DOMAINS. DATA SCIENCE IS RELATED TO DATA MINING, MACHINE LEARNING AND BIG DATA."

tolower()

위 출력에서 어떻게 표시되는지에 대한 예를 볼 수 있으며, 마찬가지로 모든 문자열의 문자를 소문자로 변경하려면 tolower() 메소드를 사용할 수 있습니다.

tolower(data[1])

[1] "data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. data science is related to data mining, machine learning and big data."

chartr()

chartr() 함수는 문자열의 특정 문자 집합을 대체하는 데 사용할 수 있습니다.

chartr(" ","-",data[1])

첫 번째 입력은 대체되어야 하는 문자가 포함된 문자열입니다. 대체 문자는 문자열인 두 번째 인수에 저장됩니다.

참고 : Dot Plots in R-Strip Charts for Small Sample Size

마지막 인수는 작업을 적용해야 하는 문자열입니다. 함수가 출력에서 모든 공백 문자를 하이픈(-)으로 대체한 방법을 볼 수 있습니다.

[1] "Data-science-is-an-interdisciplinary-field-that-uses-scientific-methods,-processes,-algorithms-and-systems-to-extract-knowledge-and-insights-from-noisy,-structured-and-unstructured-data,[1][2]-and-apply-knowledge-and-actionable-insights-from-data-across-a-broad-range-of-application-domains.-Data-science-is-related-to-data-mining,-machine-learning-and-big-data."

strsplit()

strsplit() 함수를 사용하면 표현식을 사용하여 문자열을 두 부분으로 분할할 수 있습니다.

mylist <- strsplit(data[1]," ")
mylist

[[1]]
 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data."

첫 번째 입력은 분할하려는 문자열이고 두 번째 인수는 분할에 사용할 표현식입니다.

공백 문자는 이 상황에서 문자열을 분리하는 데 사용됩니다. 이렇게 하면 목록(list)이 생성되므로 unlist() 메서드를 사용하여 문자형 벡터를 만들어야 합니다.

mylist1 <- unlist(mylist)
mylist1

원래 문자열의 각 단어는 공백 문자로 구분되었으므로 출력을 보면 벡터에 단어당 하나의 요소가 포함되어 있음을 알 수 있습니다.

 [1] "Data"              "science"           "is"                "an"               
 [5] "interdisciplinary" "field"             "that"              "uses"             
 [9] "scientific"        "methods,"          "processes,"        "algorithms"       
[13] "and"               "systems"           "to"                "extract"          
[17] "knowledge"         "and"               "insights"          "from"             
[21] "noisy,"            "structured"        "and"               "unstructured"     
[25] "data,[1][2]"       "and"               "apply"             "knowledge"        
[29] "and"               "actionable"        "insights"          "from"             
[33] "data"              "across"            "a"                 "broad"            
[37] "range"             "of"                "application"       "domains."         
[41] "Data"              "science"           "is"                "related"          
[45] "to"                "data"              "mining,"           "machine"          
[49] "learning"          "and"               "big"               "data."

sort()

방금 생성한 list1 벡터를 sort() 함수에 입력하여 정렬할 수도 있습니다.

sorting <- sort(mylist1)
sorting

 [1] "a"                 "across"            "actionable"        "algorithms"       
 [5] "an"                "and"               "and"               "and"              
 [9] "and"               "and"               "and"               "application"      
[13] "apply"             "big"               "broad"             "data"             
[17] "data"              "Data"              "Data"              "data,[1][2]"      
[21] "data."             "domains."          "extract"           "field"            
[25] "from"              "from"              "insights"          "insights"         
[29] "interdisciplinary" "is"                "is"                "knowledge"        
[33] "knowledge"         "learning"          "machine"           "methods,"         
[37] "mining,"           "noisy,"            "of"                "processes,"       
[41] "range"             "related"           "science"           "science"          
[45] "scientific"        "structured"        "systems"           "that"             
[49] "to"                "to"                "unstructured"      "uses"

결과적으로 구성 요소는 알파벳순으로 정렬됩니다.

paste()

paste() 함수를 사용하여 문자형 벡터의 요소를 연결할 수도 있습니다.

참고 : Types of Data Visualization Charts > Advantages

paste(sorting, collapse = " ")

고유한 요소를 구분하는 데 사용할 문자열 값은 collapse= 옵션에 의해 결정됩니다.

[1] "a across actionable algorithms an and and and and and and application apply big broad data data Data Data data,[1][2] data. domains. extract field from from insights insights interdisciplinary is is knowledge knowledge learning machine methods, mining, noisy, of processes, range related science science scientific structured systems that to to unstructured uses"

우리는 단순히 하나의 공백 문자를 사용하여 이들을 구분할 것입니다. 알파벳순으로 정렬된 목록은 이 출력에서 단일 문자열로 표시됩니다.

substr()

substr() 함수는 문자열의 지정된 부분을 분리하는 데 사용할 수 있습니다.

subs <- substr(data[1], start = 3, stop = 30)
subs

세그먼트의 시작 및 끝 인덱스를 입력하기만 하면 이 연속된 섹션이 출력됩니다.

[1] "ta science is an interdiscip"

그러나 이 하위 문자열에 선행 및 후행 공백 문자가 있음을 알 수 있습니다.

참고 : What is mean by the best standard deviation?

trimws()

문자열의 시작과 끝에서 공백을 제거하는 trimws() 함수를 사용하여 제거할 수 있습니다. 하위 문자열을 작성하기 위해 마지막 위치에서 거꾸로 계산할 수도 있습니다.

따라서 예를 들어 위에 표시된 것처럼 마지막 5개 문자를 원할 수 있습니다. 이를 위해 stringr 라이브러리의 str_sub() 기능을 사용해야 합니다.

library(stringr)
str_sub(data[1], -5, -1)

이 상황에서 시작 및 끝점 인수가 모두 음수임을 확인하십시오. 결과적으로 시작점은 문자열의 마지막 점에서 다섯 번째 문자이고 끝점은 마지막 문자의 인덱스입니다.

[1] "data."

출력은 마지막 5개 문자가 성공적으로 반환되었음을 보여줍니다.

이제 문자열의 문자를 변경하고, 문자열을 벡터로 분할하고, 특정 하위 문자열을 검색할 수 있어야 합니다.

참고 : tidyverse in r – Complete Tutorial > Unknown Techniques

R에서 사용하는 숫자 데이터 타입 (0)	2021.11.29
폴더 내 csv, txt 파일 모두 불러오기 (Basic R : Read so many CSV files) (0)	2021.09.03
R에서 날짜 데이터 다루기 (Date Formats in R) (0)	2021.08.28

programming