Przygotowanie danych¶

Import potrzebnych bibliotek¶

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
c:\Users\dariu\AppData\Local\pypoetry\Cache\virtualenvs\foai-model-AetUN4RB-py3.12\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dariu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dariu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[1]:
True

Wczytywanie datasetów¶

In [3]:
df_1 = pd.read_csv('./../data/datasets/dataset_1.csv')
df_2 = pd.read_csv('./../data/datasets/dataset_2.csv')

df = pd.concat([df_1, df_2], ignore_index=True)
df.head()
Out[3]:
Category Resume
0 Data Science Skills * Programming Languages: Python (pandas...
1 Data Science Education Details \r\nMay 2013 to May 2017 B.E...
2 Data Science Areas of Interest Deep Learning, Control Syste...
3 Data Science Skills • R • Python • SAP HANA • Table...
4 Data Science Education Details \r\n MCA YMCAUST, Faridab...

Podstawowe informacje o danych¶

Ilość danych:¶

In [4]:
df.shape
Out[4]:
(1362, 2)

Typy danych:¶

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1362 entries, 0 to 1361
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  1362 non-null   object
 1   Resume    1362 non-null   object
dtypes: object(2)
memory usage: 21.4+ KB

Czyszczenie danych¶

Usuwanie pustych komórek¶

In [6]:
df = df.dropna()
df = df[(df['Category'].str.strip() != '') & (df['Resume'].str.strip() != '')]
df.shape
Out[6]:
(1362, 2)

Unifikowanie datasetów pod względem kategorii¶

In [7]:
category_mapping = {
    'Data': [
        'Data Science', 'Data Scientist', 'Machine Learning Engineer', 'AI', 'Hadoop'
    ],
    'Development - Backend': [
        'Backend Developer', 'Python Developer', 'Java Developer', 'DotNet Developer', 'Full Stack Developer'
    ],
    'Development - Frontend': [
        'Frontend Developer', 'Web Designing'
    ],
    'Mobile Development': [
        'Mobile App Developer (iOS/Android)'
    ],
    'DevOps & Cloud': [
        'DevOps Engineer', 'Cloud Engineer'
    ],
    'Testing': [
        'Testing', 'Automation Testing', 'ETL Developer'
    ],
    'Engineering': [
        'Mechanical Engineer', 'Civil Engineer', 'Electrical Engineering'
    ],
    'Security': [
        'Network Security Engineer'
    ],
    'Business': [
        'Business Analyst', 'PMO', 'Operations Manager', 'Sales'
    ],
    'HR': ['HR'],
    'Design': ['Arts', 'Health and fitness'],
    'Blockchain': ['Blockchain'],
    'SAP': ['SAP Developer'],
    'Database': ['Database'],
    'Legal': ['Advocate']
}

def simplify_category(category):
    for group, values in category_mapping.items():
        if category in values:
            return group
    return category

df['Category'] = df['Category'].apply(simplify_category)
print(df['Category'].unique())

df.iloc[960:965]
['Data' 'HR' 'Legal' 'Design' 'Development - Frontend' 'Engineering'
 'Business' 'Development - Backend' 'SAP' 'Testing' 'DevOps & Cloud'
 'Security' 'Database' 'Blockchain' 'Mobile Development']
Out[7]:
Category Resume
960 Testing COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...
961 Testing Skill Set OS Windows XP/7/8/8.1/10 Database MY...
962 Development - Frontend As a seasoned Frontend Developer, I have a pro...
963 Development - Backend With a solid background in Backend Development...
964 Development - Backend As a Python Developer, I leverage my expertise...

Czyszczenie tekstu CV¶

In [8]:
stop_words = set(stopwords.words('english'))

def clean_resume(text):
    text = text.lower()  # To lower case
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)  # Delete links
    text = re.sub(r'\S+@\S+', '', text)  # Delete emails
    text = re.sub(r'\+?\d[\d -]{8,}\d', '', text)  # Delete phone numbers
    text = re.sub(r'\n', ' ', text)  # Change new line to space
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Delete everything beside words
    text = re.sub(r'\s+', ' ', text).strip()  # Delete big empty spaces

    # Delete stopwords
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

df['Resume'] = df['Resume'].apply(clean_resume)
df.head()
Out[8]:
Category Resume
0 Data skills programming languages python pandas num...
1 Data education details may may uitrgpv data scienti...
2 Data areas interest deep learning control system de...
3 Data skills r python sap hana tableau sap hana sql ...
4 Data education details mca ymcaust faridabad haryan...

Prezentacja wyczyszczonych danych¶

Wykres kategorii¶

In [9]:
category_counts = df['Category'].value_counts()
category_percentages = category_counts / category_counts.sum() * 100

plt.figure(figsize=(12, 6))
plt.bar(category_percentages.index, category_percentages.values)

plt.xticks(rotation=45, ha='right')
plt.ylabel('Procent (%)')
plt.title('Procentowy udział kategorii w CV')
plt.tight_layout()
plt.show()
No description has been provided for this image

Długości CV¶

In [10]:
resume_lengths = df['Resume'].str.len()

print("Średnia długość:", resume_lengths.mean())
print("Minimalna długość:", resume_lengths.min())
print("Maksymalna długość:", resume_lengths.max())
Średnia długość: 1970.240088105727
Minimalna długość: 115
Maksymalna długość: 11500

Word Cloud¶

In [11]:
all_text = " ".join(df['Resume'].tolist())

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
No description has been provided for this image

Tokenizacja kategorii¶

In [12]:
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['Category'])
df
Out[12]:
Category Resume labels
0 Data skills programming languages python pandas num... 2
1 Data education details may may uitrgpv data scienti... 2
2 Data areas interest deep learning control system de... 2
3 Data skills r python sap hana tableau sap hana sql ... 2
4 Data education details mca ymcaust faridabad haryan... 2
... ... ... ...
1357 DevOps & Cloud dynamic cloud engineer specializing architecti... 5
1358 DevOps & Cloud skilled cloud engineer focus building secure s... 5
1359 Development - Backend experienced full stack developer expertise dev... 6
1360 Development - Backend innovative full stack developer passion buildi... 6
1361 Development - Backend dynamic full stack developer specializing buil... 6

1362 rows × 3 columns

DataFrame to Dataset¶

In [13]:
dataset = Dataset.from_pandas(df[['Resume', 'labels']])
dataset = dataset.train_test_split(test_size=0.2)
dataset
Out[13]:
DatasetDict({
    train: Dataset({
        features: ['Resume', 'labels'],
        num_rows: 1089
    })
    test: Dataset({
        features: ['Resume', 'labels'],
        num_rows: 273
    })
})

Tokenizacja resume¶

In [14]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = dataset.map(lambda e : tokenizer(e["Resume"], truncation=True, padding="max_length", max_length=512))
tokenized_dataset
Map: 100%|██████████| 1089/1089 [00:01<00:00, 922.04 examples/s]
Map: 100%|██████████| 273/273 [00:00<00:00, 956.20 examples/s] 
Out[14]:
DatasetDict({
    train: Dataset({
        features: ['Resume', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1089
    })
    test: Dataset({
        features: ['Resume', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 273
    })
})