Trending November 2023 # Making Exploratory Data Analysis Sweeter With Sweetviz 2.0 # Suggested December 2023 # Top 15 Popular

You are reading the article Making Exploratory Data Analysis Sweeter With Sweetviz 2.0 updated in November 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Making Exploratory Data Analysis Sweeter With Sweetviz 2.0

This article was published as a part of the Data Science Blogathon.

What is Exploratory Data Analysis?

The very first step in data science is exploratory data analysis, aka EDA. All types of data models do not fit all data types, so it is better to thoroughly analyze the data before proceeding further. For example, mathematical operations cannot be applied to categorical data, or the issue of missing values in the dataset must be addressed.

The accuracy of the data model directly depends on the quality of the data. However, in the real-world, data is collected from various sources and must be handled accordingly to reduce the repercussions.

EDA being the initial step of data mining, helps in getting insight into data without any assumptions. This helps in forming the hypotheses. The fundamental ingredients of EDA are data summarization, data description and inferences, and data visualization.

The traditional way of doing EDA in Python involves tools such as NumPy, Pandas, Scipy, and Matplotlib. However, a quick sneak peek into data can be done using Sweetviz.

Introduction to Sweetviz 2.0

Sweetviz 2.0 is an open-source pandas-based library to perform the primary EDA task without much hassle or with just two lines of code. It also generates a summarised report with great visualizations.

Installation using pip

The basic command to install a package using pip is:

pip install sweetviz

Alternatively, use the following command inside Notebook/Colab

!pip install sweetviz


How to Use?

Let us get our hands dirty and start writing the code… (Feel free to use the source code here)

For this purpose, we will be using the Student Performance dataset (get it here). It is mixed data, that is, both numerical and categorical data are present. The dataset consists of 1000 students record with a total of eight features, viz.,

gender: categorical

race/ethnicity: categorical

parental level of education: categorical

lunch: categorical

test preparation course: categorical

math score: numerical

reading score: numerical; and

writing score: numerical

We will be using pandas for reading the csv (Comma Separated Values) file.

Sweetviz has a powerful function called analyze() that helps in analyzing the data at a glance.

# Analyzing data report=sv.analyze(data) # Generating report report.show_html('eda_report.html')

Bang! Our report is ready in a split second.

The function show_html() generates a detailed report consisting of the following details:


If show_html() function is not supplied with any parameter, by default, it generates a file named ‘SWEETVIZ_REPORT.html’.

Apart from this, we can compare two datasets side-by-side. To have a glance at this, we would split the dataset into two halves.

# Spliting the data into two datasets data1=data[0:400] data2=data[400:]

Now, let us compare both of them side-by-side using


If the function compare() is left with default parameters, refer to the two datasets by Dataframe and Compared, respectively.

We can also perform target analysis, but currently, it only supports numerical or binary targets, rather than categorical targets. Let’s consider math scores as a target:


Version 1.0 vs 2.0

Although working with Sweetviz makes EDA hassle-free, it does introduce some difficulties as well.

Firstly, the reports generated using the base OS module; hence it is incompatible with custom environments such as Google Colab. Secondly, the reports are in HTML format; therefore, the graphs cannot be plotted inline.

However, in version 2.0, these issues have been taken care of by a new feature show_notebook() which embeds the visualizations in the notebooks using an iframe. Also, you’ll be able to save a report in HTML format that could be accessed later.

Playing with Reports

The additional yet optional parameters make this task easy.

Version 2.0 allows the user to manipulate the appearance of the report with some parameters as –

 report_comp.show_html(filepath='report.html', open_browser=True, layout='vertical', scale=0.7)

Another feature introduced in version 2.0 called show_notebook() displays the report within the notebook environment rather than any browser-based external environment using an IFRAME HTML element. Typical usage of this feature is demonstrated by the following lines of code –

report_comp.show_notebook(w=None, h=None, scale=None, layout='vertical', filepath='E:/sweetviz_report.html')


w and h: signifies the width and height of the window. It can be defined in absolute pixels (e.g. 400) or relatively (e.g. 70%).

filepath: saves the file in the directory with the name specified which can be accessed later

layout: there are two layout modes available, viz., widescreen and vertical

scale: this defines the scale of the report within the window. It is a fractional number ranging from 0 to 1.


In this short article, I outlined how to load data using pandas and have quick insights on the data using Sweetviz 2.0 in just a couple of lines of code.



The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.


You're reading Making Exploratory Data Analysis Sweeter With Sweetviz 2.0

What Is Data Wrangling And How Does It Improve Data Analysis?

blog / Data Science and Analytics How Data Wrangling is Helping Businesses Make Better Decisions

Share link

In 2023, most major organizations are led by data while making business decisions. As a result, data professionals are essential for businesses to function. At the same time, Gartner research recently found that organizations believe that poor-quality data is costing them an average of $15 million in losses annually. This combination of high dependency on data as well as uncertainty about data quality is making practices like data wrangling vital for businesses to function efficiently. Which brings us to the question: what is data wrangling and how does it help? Let’s explore. 

What is Data Wrangling?

It is the practice of removing errors from data sets or restructuring complex datasets to make them more suitable for analysis. Wrangling consists of cleaning, organizing, and transforming raw data into the desired format for analysts to use. It helps businesses use more complex data, faster, and more accurately. 

What are its Benefits?

It transforms raw data and makes it usable for businesses. Here are the key benefits of data wrangling: 

Data Consistency

It helps turn raw data into consistent data sets which businesses can use. For example, data collected from consumers is usually error-ridden. Data wrangling can help eliminate these human errors and make the data more uniform. 

Improved Insights

The consistency brought through wrangling often provides better insights about metadata. 

Cost Efficiency

Cleaning up and organizing data through wrangling reduces errors in the data, saves time for the person who will be using the data, and thus reduces costs for the company.

Importance of Data Wrangling

McKinsey has estimated that big data projects could account for a reduction of $300-450 billion in US healthcare spending. It is clear that data analysis has a significant impact on business practices. However, any analyses that businesses perform will only be as effective as the data informing them. To ensure accurate results, consistent, reliable data is necessary. Data wrangling proves to be essential to achieve this accuracy.

Best Practices for Data Wrangling

To ensure effective results, there are certain practices one should be aware of: 

Remember Your Objective

Think about the objective of the person who needs the data you are working with. By doing this, you will be focused on the data that they need. 

Choosing the Right Data

Selecting the right data is necessary. To ensure quality: 

Avoid duplicate data

Use the original source

Use recent data

Double Check

Humans are always capable of errors, even data wranglers. It is necessary to re-check the data once wrangling is complete. 

Steps to Perform Data Wrangling Step 1: Discovery

This process involves thinking about the desired results, understanding what kind of data is necessary to achieve the objectives, and collecting the desired data. 

Step 2: Organization

After the raw data is gathered, it needs to be structured into a less overwhelming and more organized form.

Step 3: Cleaning

After the data is structured, you can start cleaning it. This involves removing outliers, null, and duplicate data. 

Step 4: Enrichment

In this step, you review if you have gathered enough data. If a data set is too small, it may compromise the results of the analysis.

Step 5: Validation

Once enrichment is complete, you can apply validation rules to your data. Validation rules applied in iterations can confirm if your data is consistent.

Step 6: Publishing

The last step is data publishing. Here you prepare the data for future use. This includes making notes and documenting the entire process. 

Data Wrangling Examples Financial Insights

Data wrangling can be used to discover insights hidden in data, predict trends, and forecast markets. It helps in making informed investment decisions.

Improved Reporting

Creating reports with unstructured data can be a challenge. Data wrangling improves data quality and helps in reporting.

Understanding Customer Base

Customers exhibit different behaviors which can be reflected in the data they generate. Data wrangling can help identify common behavioral patterns.

Who Uses Data Wrangling?

Data analysts spend most of their time conducting data wrangling rather than data analysis. This is to ensure that the best results are delivered for businesses using the most accurate data. It is essential for businesses in nearly every industry. 

Frequently Asked Questions. 1. What are Popular Data Wrangling Tools?



Google DataPrep

Data wrangler

2. What’s the Difference Between Data Wrangling and Data Cleaning

The objective of data cleaning is to remove inaccurate data from the data set, whereas the objective of wrangling is to transform the data into a more usable format. 

3. How can Data Wrangling Improve Data Quality?

Data wrangling helps remove errors from the data set and also structures it in a more usable format. When the data is well structured and error-free, the subsequent data analysis is able to yield more accurate results which in turn end up in better business outcomes. 

As big data finds even greater acceptance in business, the need for data professionals is only going to be on the rise. Having learned what is data wrangling, if you are interested in going deeper into this field, explore the courses on data science and analytics on Emeritus. These are offered in collaboration with top universities and will help you in your career as a data professional.

By Tanish Pradhan 

Write to us at [email protected]

Celebrate Pytorch 2.0 With New Ai Developer Performance Features

PyTorch 2.0 celebrates with new AI developer performance features and more exciting news inside Torchinductor CPU FP32 Inference Optimized

In this article, we have discussed the insights on PyTorch 2.0 with new AI developer performance features. Read to know more about PyTorch 2.0 with the new AI developer features.

As a component of the PyTorch 2.0 compilation stack, TorchInductor CPU backend optimization significantly boosts performance over celebrating PyTorch 2.0 eager mode through graph compilation.

Utilizing the PyTorch ATen CPU kernels for memory-bound operations with explicit vectorization on top of OpenMP*-based thread parallelization and the Intel Extension for PyTorch for Conv/GEMM ops with post-op fusion and weight pre-packing, the TorchInductor CPU backend is made faster.

These improvements, combined with the potent loop fusions in TorchInductor codeGen, allowed us to outperform three sample deep learning benchmarks-TorchBench, HuggingFace, and timm1-by up to 1.7 times in terms of FP32 inference performance. The development of low-precision support and training.

See the Improvements

This TouchInductor CPU Performance Dashboard tracks the performance enhancements on multiple backends.

Make Graph Neural Network (GNN) in PYG Perform Better for Inference and Training on CPU

GNN is an effective method for analyzing data with a graph structure. On Intel® CPUs, including the brand-new 4th Gen Intel® Xeon® Scalable processors, this capability is intended to enhance GNN inference and training performance.

The popular library PyTorch Geometric (PyG) was developed using PyTorch to carry out GNN operations. Currently, PyG’s GNN models perform poorly on the CPU because of the absence of SpMM_reduce, a crucial kernel-level optimization, and other GNN-related sparse matrix multiplication operations (scatter/gather, etc.).

Message passing optimizations between nearby neural network nodes are offered to overcome this:

When the edge index is recorded in coordinate format (COO), message forwarding in scatter_reduce suffers from a performance bottleneck.

Gather A variant of scatter_reduce that is tailored specifically for the GNN computation when the index is an enlarged tensor.

When the edge index is stored in a compressed sparse row (CSR), sparse. mm with the reduced flag experiences a performance bottleneck in message-passing. Reduce flags for sum, mean, AMAX, and amin are supported.

Accelerating Pyg on Intel CPUs discusses the end-to-end performance benchmark results for both inference and training on the 3rd Gen Intel® Xeon® Scalable processors 8380 platforms and the 4th Gen 8480+ platform.

Unified Quantization Backend to Improve INT8 Inference for X86 CPU Platforms

The new X86 quantization backend, which takes the place of FBGEMM as the standard quantization backend for X86 systems, is a combination of FBGEMM (Facebook General Matrix-Matrix Multiplication) and one API Deep Neural Network Library (oneDNN) backends. Better end-to-end INT8 inference performance as compared to FBGEMM as a result.

Accordingly, the X86 backend takes the role of FBGEMM and, depending on the use case, may provide higher performance.

The criteria for selection are:

FBGEMM is always employed on platforms lacking VNNI (such as those with Intel® CoreTM i7 CPUs).

For linear, FBGEMM is usually utilized on platforms supporting VNNI (such as those running 2nd-4th generation Intel Xeon Scalable CPUs and the next platforms).

For depth-wise convolution with layers more than 100, FBGEMM is used; otherwise, oneDNN is utilized.

Use the OneDNN Graph API to Speed Up CPU Inference

OneDNN Graph API adds a customizable graph API to OneDNN to increase the possibilities for optimizing code generation on Intel® AI hardware. It recognizes the graph divisions that should be accelerated by fusion automatically. For both inference and training use cases, the fusion patterns concentrate on fusing compute-intensive processes like convolution, matmul, and their neighbor operations.

PyTorch requires little to no changes to enable more recent OneDNN Graph fusions and optimized kernels. User options for OneDNN Graph include:

Before JIT tracing a model, either use the API torch.jit.enable_onednn_fusion(True), OR…

H1B Visa Data Analysis: Unveiling Patterns Of H1B Visa Approval


The H1B visa program opens doors for skilled individuals worldwide to bring their expertise to the United States. Thousands of talented professionals enter the US through this program each year, contributing to various industries and driving innovation. Let’s dive into the fascinating world of H1B visa data from the Office of Foreign Labor Certification (OFLC) and explore the stories behind the numbers. This article reveals H1B Visa data analysis and we get insights and interesting stories from the data. Through feature engineering, we enhance the dataset with additional information from external sources. Use the meticulous data wrangling to organize data carefully so that we can better understand and analyze it. Finally, data visualizations unveil fascinating trends and untold insights about skilled workers in the US in-between the years 2014 and 2023.

Explore and analyze H1B visa data from the Office of Foreign Labor Certification (OFLC) and understand its significance in attracting skilled foreign workers to the United States.

Learn about the process of data preprocessing, including data cleaning, feature engineering, and data transformation techniques.

Examine and analyze the acceptance and rejection rates of H1B visa applications that potentially influence these rates.

Gain familiarity with data visualization techniques to present and communicate the findings effectively.

Note:🔗 Please find the complete code and dataset for this analysis on Kaggle to explore the whole process and code behind the analysis: H1B Analysis on Kaggle

This article was published as a part of the Data Science Blogathon.

What is H1B Visa?

The H1B visa program is a key component of U.S. immigration policy, aimed at attracting highly skilled foreign workers to fill specialized positions in various industries. It addresses skill shortages, promotes innovation, and drives economic growth.

To obtain an H1B visa, a person must follow these key steps:

Find a U.S. employer willing to sponsor the visa.

The employer files an H1B petition with the USCIS on behalf of the foreign worker.

The petition is subject to an annual cap and may go through a lottery if there are more applications than available spots.

If selected, the USCIS reviews the petition for eligibility and compliance.

If approved, the foreign worker can obtain the H1B visa and begin working for the sponsoring employer in the U.S.

The process involves meeting specific requirements, such as holding a bachelor’s degree or equivalent, and navigating additional considerations, such as prevailing wage determinations and documentation of the employer-employee relationship. Compliance and thorough preparation are crucial for a successful H1B visa application.


Combined 2014, 2023 and 2023 datasets provided by the Office of Foreign Labor Certification (OFLC) for the H1B visa program include the columns such as Case Number, Case Status, Employer Name, Employer City, Employer State, Job Title, SOC Code, SOC Name, Wage Rate, Wage Unit, Prevailing Wage, Prevailing Wage Source, Year, etc.

These columns provide essential information about H1B visa applications, including case details, employer information, job titles, wage rates, and prevailing wage data.


Join me on a fascinating data transformation journey! I convert the xlsx file to CSV, rename columns for consistency since each column contains the same data but has different names across each year., and combine three years of data. The result? A vast dataset with 1,786,160 rows and 17 columns.

Feature Engineering

Enhanced the dataset by creating an employment period column based on start and end dates.

Calculated the duration in days by subtracting the end date from the start date.

Removed rows with negative values, as a negative employment period is logically impossible.

Converted the duration to months by dividing it by 30.44, representing the average number of days in a month over four years. This approach ensures accurate estimation, accounting for leap years.

Handled missing values by replacing null entries with 0.

# convert date columns to datetime format and assign nan to invalid data final_df['LCA_CASE_EMPLOYMENT_START_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_START_DATE'], errors='coerce') final_df['LCA_CASE_EMPLOYMENT_END_DATE'] = pd.to_datetime(final_df['LCA_CASE_EMPLOYMENT_END_DATE'], errors='coerce') # subtract the LCA_CASE_EMPLOYMENT_END_DATE from LCA_CASE_EMPLOYMENT_START_DATEto find employment period LCA_CASE_EMPLOYMENT_PERIOD=final_df["LCA_CASE_EMPLOYMENT_END_DATE"]-final_df["LCA_CASE_EMPLOYMENT_START_DATE"] # create a new column with LCA_CASE_EMPLOYMENT_PERIOD value final_df.insert(7, 'LCA_CASE_EMPLOYMENT_PERIOD', LCA_CASE_EMPLOYMENT_PERIOD) # converting LCA_CASE_EMPLOYMENT_PERIOD into days format final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = final_df['LCA_CASE_EMPLOYMENT_PERIOD'].dt.days # delete the outlier value, i.e employment days less than 0. final_df['LCA_CASE_EMPLOYMENT_PERIOD'].describe() # the employment period is converted into months final_df['LCA_CASE_EMPLOYMENT_PERIOD'] = (round(final_df['LCA_CASE_EMPLOYMENT_PERIOD'] / 30.44)) #filled the missing value with 0 and converted the column type to int final_df['LCA_CASE_EMPLOYMENT_PERIOD']=final_df['LCA_CASE_EMPLOYMENT_PERIOD'].fillna(0).astype(int)

Determined the sector of each employer by extracting the first two digits from the provided NAICS code.

Downloaded the NAICS code and sector data online to obtain the corresponding sector information.

Created a new column called EMPLOYER_SECTOR, mapping each employer’s basic code to its respective sector.

# Convert the LCA_CASE_NAICS_CODE column to string data type final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].astype(str) # Extract the first two digits of each string value final_df['LCA_CASE_NAICS_CODE'] = final_df['LCA_CASE_NAICS_CODE'].str[:2] # reading the NAICS_data to cross check and create a new column for employer sector NAICS_data=pd.read_csv("/kaggle/input/h1b-visa/NAICS_data.csv") NAICS_data.head() # loop through all the NAICS in the naics_unique_values for i in naics_unique_values: try: # assuming your dataframe is called 'df' NAICS_data_code = NAICS_data.loc[NAICS_data['NAICS_CODE'] == i, 'NAICS_TITLE'].iloc[0] except: #if there is no index with the particular soc code the occupation name will be null NAICS_data_code = "Unknown" # create a boolean mask for the conditions mask = (final_df['LCA_CASE_NAICS_CODE'] == i) # update the LCA_CASE_SOC_NAME column for the filtered rows final_df.loc[mask, 'EMPLOYER_SECTOR'] = NAICS_data_code

Additionally, I extracted the year information from the LCA_CASE_SUBMIT field, creating a dedicated year column. This simplifies data analysis and allows for convenient year-based insights.

# extract the year component from the datetime column LCA_CASE_SUBMIT and store it in a new column year final_df['year'] = final_df['LCA_CASE_SUBMIT'].dt.year Data Transformation

I performed a series of data transformations to refine the dataset. Duplicates were removed, ensuring clean and unique records.

I preprocessed the LCA_CASE_SOC_CODE column by removing special characters in each row.

# remove numbers after "." period in 'LCA_CASE_SOC_CODE' column final_df['LCA_CASE_SOC_CODE'] = final_df['LCA_CASE_SOC_CODE'].astype(str).apply(lambda x: x.split('.')[0]) # function to correct the LCA_CASE_SOC_CODE def preprocess_column(column): pattern = r"^d{2}-d{4}$" # regex pattern for "XX-XXXX" format def preprocess_value(value): if ("-" not in value) and len(value) < 6: cleaned_value=np.nan elif "-" in value : value=value.replace('-','') cleaned_value=value[0:2]+"-"+value[2:6] if len(cleaned_value) != 7: cleaned_value=np.nan value=value.replace('/', '') cleaned_value=value[0:2]+"-"+value[2:6] return cleaned_value cleaned_column = column.apply(lambda x: chúng tôi if pd.isna(x) else (x if, str(x)) else preprocess_value(x))) return cleaned_column final_df["LCA_CASE_SOC_CODE"] = preprocess_column(final_df["LCA_CASE_SOC_CODE"]) # Replace the values in the 'LCA_CASE_WAGE_RATE_FROM' column # define a custom function to preprocess the wage_rate column def preprocess_wage_rate(cell_value): if isinstance(cell_value, float): return cell_value elif '-' in cell_value: return cell_value.split('-')[0].strip() else: return cell_value # apply the custom function to the wage_rate column final_df['LCA_CASE_WAGE_RATE_FROM'] = final_df['LCA_CASE_WAGE_RATE_FROM'].apply(lambda x: preprocess_wage_rate(x))

# initialize webdriver driver = webdriver.Chrome() # navigate to webpage # find all li elements li_elements = driver.find_elements("xpath","//li") # create empty list to store data data = [] # loop through li elements for li in li_elements: text = li.text if "-" in text: # use regular expression to extract SOC code and occupation name words = text.split() soc=words[0] name = (" ".join(words[1::])).replace('"', '').strip() name_list=(words[1::]) if "-" in name: for i, word in enumerate(name_list): name =(' '.join(name_list[:i])).replace('"', '').strip() break data.append({'SOC Code': soc, 'Occupation Name': name}) # close webdriver driver.quit() # create dataframe occupation_data = pd.DataFrame(data) # save dataframe as CSV occupation_data.to_csv('occupations.csv', index=False) Explanation

The column ‘LCA_CASE_WAGE_RATE_FROM’ in the dataset had wage rates expressed in various units, which needed to be standardized for consistent analysis.

To achieve standardization, I converted the wage rates to a uniform annual value. This involved multiplying the rates by specific factors depending on their original units.

For instance, monthly rates were multiplied by 12 to account for the number of months in a year. Similarly, weekly rates were multiplied by 52 (the number of weeks in a year), and bi-weekly rates were multiplied by 26 (assuming 26 bi-weekly periods in a year).

Handling hourly rates required considering whether the position was full-time or not. For positions marked as full-time (‘FULL_TIME_POS’ = ‘Y’), I multiplied the hourly rate by 40 hours per week and 52 weeks in a year.

For non-full-time positions (‘FULL_TIME_POS’ = ‘N’), I used 35 hours per week and 52 weeks in a year as the basis for calculating the annual rate.

After performing the necessary calculations, the units in the ‘LCA_CASE_WAGE_RATE_FROM’ column were replaced with ‘Year’ to reflect the standardized representation of the wage rates on an annual basis.

This standardization enables meaningful comparisons and analysis of the wage rates across different positions and categories within the dataset.

To clean the LCA_CASE_SOC_NAME column, I converted all data to lowercase and singular form if it ended with ‘s’.

To facilitate comprehension, I divided the LCA_CASE_WAGE_RATE_FROM column values by 1000 to represent wages in thousands. Rows with negative wage values were removed as they are not valid.


Additionally, I employed the IQR method to eliminate outliers in the 0.1 and 0.99 quantiles to ensure accurate analysis.

The values “INVALIDATED” and “REJECTED” in the ‘STATUS’ column are replaced with “DENIED”. This simplifies the representation of denied visa applications, as both “INVALIDATED” and “REJECTED” refer to applications that have been denied and ensures consistency by using a single label, “DENIED”, for all denied H1B visa applications in the dataset.

Unnecessary columns, including “LCA_CASE_EMPLOYMENT_START_DATE,” “LCA_CASE_EMPLOYMENT_END_DATE,” “LCA_CASE_WAGE_RATE_UNIT,” “FULL_TIME_POS,” and “LCA_CASE_NAICS_CODE,” were removed, simplifying the dataset and enhancing clarity.

The data is now refined and ready for insightful exploration.

Analysis What is the total number of H-1B visa applications? What is the growth rate of the applications over the past three years?

number of application growth rate per year

Between 2014 and 2023, the number of H1B visa applications skyrocketed, growing by an impressive 17.7%. Skilled individuals from around the world were eager to seize opportunities in the United States. However, things took an unexpected turn in 2023 when there was a sudden 9% drop in applications. This decline left us wondering: What caused this significant change?

What caused the sudden drop in the application rate? Is it due to an increase in rejection rates, or were other factors contributing to this decline?

Total count by year and status

Surprisingly the rejection rate for the visa has decreased significantly from 5.41% to 3.4% over the years. On the other hand, the acceptance rate has been steadily increasing every year. It may also suggest employers have become more adept at submitting strong applications, thereby reducing the rejection rate.

The decreasing rejection rate reflects a positive trend and signifies a more favourable environment for H1B visa applicants. The increasing acceptance rate indicates a growing demand for highly skilled foreign workers in the United States.

These positive trends could be attributed, to the US government’s efforts to foster a welcoming environment for skilled immigrants. The government may have implemented more favourable policies, streamlining the visa process and removing unnecessary barriers. This, in turn, has likely contributed to the decrease in the rejection rate. Therefore, the decline in applications cannot be solely attributed to a higher rejection rate.

What are the top sectors for H1B visa applications?

Employer sector distribution

After conducting our analysis, we have found that a significant portion of H1B visa applications, approximately 72.4%, with the professional, scientific, and technical services sectors. This sector encompasses various fields, including computer programming, scientific research, engineering, and consulting services. The surge in applications are escalating demand for skilled professionals in these domains, given the specialized expertise and knowledge required.

Furthermore, influence by larger companies actively contributing to the increase in H1B visa sponsorships for their employees, within the professional, scientific, and technical services sectors. These companies rely on highly skilled workers to uphold their competitive edge and sustain growth in the industry.

Which are the top 10 employers with the highest number of H1B visa applications, and in which sectors do they belong?

top 10 employers by total application count

Based on comprehensive analysis, we found that the professional, scientific, and technical services sectors are held with 9 out of the top 10 employers with the highest number of H1B visa applications. Known for its consistent demand for skilled professionals and encompasses diverse fields such as computer programming, scientific research, engineering, and consulting services.

Infosys stands out as the leading employer, with a staggering 82,271 approved applications and the lowest number of denied applications among the top 10 employers. This dominance in the H1B visa application count surpasses the combined numbers of the second-ranked TCS and the third-ranked Wipro.

The outstanding performance of Infosys in the H1B visa application process raises intriguing questions about the company’s approach and the specific job roles they are recruiting for.

How much of an impact do the top 10 employers have on the distribution of job positions for H1B visas?

Top 10 LCA_CASE_NAME with employer group

After analyzing the data, we created a chart to visually represent the contribution of the top 10 H1B visa-sponsoring employers to the top 10 job positions. The remaining employers were grouped as “other employers.” The chart highlights that while “other employers” hold a substantial portion, the top 10 employers have made a significant impact on the top 10 job positions.

For instance, Infosys has played a major role in the computer systems analyst position, while Microsoft has made a notable contribution to the Software developers, application position. Similarly, IBM has significantly influenced the computer programmer, applications position.

This chart emphasizes the significant influence of the top 10 employers in the H1B visa application process and the specific job positions.

To what extent does the salary range affect the approval or denial of H1B visa applications for job positions!?

Top 10 certified LCA_CASE_SOC_NAME with highest count and average wage rate Top 10 denied LCA_CASE_SOC_NAME with highest count and average wage rate

After conducting data analysis, we saw that there is no significant correlation between the salary range and the application status. The observation holds true for both the top 10 accepted and denied job positions, they had the same salary range.

To gain further insight into the relationship between application status and salary range, we divided the salary range into four quantiles: low, average, above-average, and higher pay salary levels. We then analyzed the data for each category. The findings reveal that the majority of approved H1B visa applications fell into (Q1) and average (Q2) salary range categories.

Distribution of wage rates by status and quantile

The low (Q1) and average (Q2) salary range categories encompassed the majority of approved H1B visa applications. Nevertheless, no clear trends were seen between the salary range and the application status. Factors other than salary have played a significant role in determining the outcome of H1B visa applications.

So does the length of employment impact the decision on the H1B visa application?

Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period Top 10 certified LCA_CASE_SOC_NAME with highest count and average employment period

Upon analyzing the data, we discovered that there was no substantial correlation between the employment period and the visa decision. Surprisingly, both the top approved and top denied job positions showcased an average employment period of 33 months. The duration of employment doesn’t appear to be a determining factor in the visa decision-making process.

We divided the applicants into 2 groups based on the average duration of their employment period: below 33 months and above 33 months.

Distribution of employment_range by status and mean of employment period

Despite thorough analysis, we couldn’t identify trend regarding the employment period and its influence on the outcome of H1B visa applications. The absence of a discernible pattern suggests that the duration of employment might not have played a pivotal role.

Are there any trends or patterns in the geographic distribution of H1B visa workers?

H1B visa applications by state

Ta-da! finally, the data, in its infinite wisdom, has given us a virtual high-five, confirming that we were right finally.

it becomes apparent that specific states within the US exhibit a higher concentration of employers who applied for H1B visas. Notably, Texas, California, and New Jersey emerge as the top three states with the greatest number of employers, accounting for approximately 684,118 applications combined. This observation suggests that these states likely experience a heightened demand for skilled workers, attracting a larger pool of employers to seek H1B visas.

The analysis uncovers distinct hotspots across the US where the majority of H1B employers are available. These regions are indicative of areas with significant demand for skilled professionals.


In conclusion, this blog delves into the significance of H1B visa data from the OFLC, providing a comprehensive exploration of its role in attracting skilled foreign workers to the United States. The article focuses on various data preprocessing techniques, including cleaning, feature engineering, and transformation, to ensure accurate analysis. Through an examination of H1B visa acceptance and rejection rates, the blog uncovers influential factors impacting these rates, presenting valuable insights through data visualization.

Key Takeaways

Job positions and designations are crucial factors in determining H1B visa success. Developing technical skills in high-demand fields like programming and engineering enhances the chances of securing an H1B visa job.

California, Texas, and New Jersey are the top states for H1B visa applications, offering abundant opportunities due to the presence of leading companies across diverse industries.

Pay attention to influential employers within the dataset, such as Infosys and Tata Consultancy Services, as aligning efforts with these industry giants can significantly boost the H1B visa journey.

Frequently Asked Questions


Introduction To Tibco Spotfire For Interactive Data Visualization And Analysis

This article was published as a part of the Data Science Blogathon

This article will introduce you to the Spotfire Business Intelligence tool for creating interactive visualization, performing data analysis, and data science. Spotfire a major player in the BI space is a product from TIBCO. The latest version is Spotfire 11. Spotfire Beginner-friendly and intuitive UI helps new users to quickly adapt. Spotfire 100+ data connectors enable to fetch data from multiple sources into Spotfire Analysis. Spotfire Data function is used to integrate Spotfire Analysis with R-Programming by TERR (Tibco Enterprise Runtime for R) and Python.


5. Dashboard Deployment

Spotfire Installation & UI

After installing the trial version of Tibco Spotfire Analyst, we may load data into Spotfire Analyst, explore data using interactive graphics, and publish findings/insights as a Dashboard into Spotfire Server for end-users/consumers to consume. When you run Spotfire Analyst, you’ll find that it’s connected to Spotfire Server, where you can publish your Dashboard.

Locally saved Spotfire analysis has .dxp as an extension. Spotfire consumers/end-users will access Spotfire Analysis using Web browsers (using Google Chrome, Firefox, and IE)

Spotfire UI 

A Spotfire Analysis file might have multiple pages with one or more visualization on each page

Data Loading and Transformation in Spotfire

Options for loading data, adding visuals, and applying data transformation may all be conveniently accessed from the Authoring Bar.

Authoring Bar option: “Files and Data” 

The “Files and Data” option is used to import data from multiple sources into Spotfire.

              –  Local/Direct Files : Excel, CSV, TXT, MDB, SAS7bdat, and more..

                –  Spotfire file format : SBDF (Spotfire Binary Data File),STDF (Spotfire Text Data Format)

                –  Relational Database : Oracle, MySQL, PostgreSQL and more

                 – Data from Cloud : AWS Redshift, GCP BigQuery, and more

Data loaded into Spotfire is referred to as “

Data Table

” each with its unique name.There are one or more columns in a “

Data Table”


Authoring Bar option: “Data in Analysis”

Once the data is imported, we can utilize the “Data in Analysis” option to view columns available in the Data Table and explore its attributes and distribution.

By default, Spotfire associates each column to a particular Data type, Category, and formatting.

Authoring Bar option: “Data Canvas”

In the “Data Canvas” option, we get an ETL-style view of each Data Table source data and the transformations that were applied to it.

Nodes are used to add new transformations. A few examples of transformation are

– Adding rows to the existing Data table

– Adding New columns to Data Table

          – Replace Value in column

– Creating Calculated column from an existing column


Data Exploration in Spotfire

Authoring Bar option: “Visualization Types”

Spotfire supports multiple visualization types and has the provision to Import custom visuals as Mods (Modules) from Spotfire version 11.

Multiple Visuals in Spotfire Page

visuals in it. Markings establish interaction between visuals

is established.


Data seen in Visual can be filtered by using the “Filters” option in the Menu Bar.

In Filter Panel, Spotfire associates each column with a type of filter. Filtering applied on a column affects related visuals.

Filter types

– Range filter

– Text Box Filter

– Check Box

– Radio Filter

– List Box Filter

– Hierarchy Filter

E.g.  Use Range filter for column Year to filter for period 2000 to 2012

Visual properties 

Properties of Visualization 

Apart from the above list, We have a few properties that are specific to a particular visual.

E.g. Map Charts have Positioning properties to add Geocoding data

Example of Trellis option

Trellis option enables the creation of visuals for different categories of data. The above visual properties show trellis by “Region” with 2X2 layout. So a separate chart is created for each category of “Region”.

Visualization Types

Data can be projected in raw form using “Table” visual.

Cross Table: Display data aggregation with grand totals for each row and column with color hue

table that is used to display the statistical value of numeric columns

TreeMap: Uses rectangular area to visualize data. Size/Split of the rectangle is controlled using size by and hierarchy option

Box Plot: Shows the minimum, median, maximum, first, and third quartiles of a numerical column, which is useful for examining the distribution of a category and identifying outliers.

Heat Map: Numerical values are colored based on aggregated function

Waterfall chart: Display running total of a value in a bar segment

Data Analysis 

Major Data Analysis options are integrated into Spotfire, In the backend Spotfire use TERR Statistical server for executing data analysis task

E.g. Regression Analysis, Classification modelling, Clustering, Line similarity, Data Relationship can be accessed from the Tools option in Menu Bar.


Regression Model

Regression model computations will be performed in TERR and the model’s results will be sent back to the Spotfire environment. In Spotfire Regression, the model can be evaluated, exported, and used to predict new data

Report Deployment

Consumers/End-Users can view by accessing URL of a report from web browsers and have provision to  export Data table, export report as Image and reload data from the browser


In 2023 t for BI platform places TIBCO Software in visionary space. TIBCO Spotfire Automation Services allows for alerting, pre-loading, and scheduling of reports. Spotfire integration with TERR/Python allows for the creation of end-to-end analytics. Spotfire Advanced Analytics feature, integration of NLP search into the Dashboard positions it as a unique BI platform. Hope this article has helped you understand Spotfire Basics  as well as its ability to handle complex Analytical use cases


The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Data Modeling With Dax

Data Modeling with DAX – Concepts

Business Intelligence (BI) is gaining importance in several fields and organizations. Decision making and forecasting based on historical data have become crucial in the evergrowing competitive world. There is huge amount of data available both internally and externally from diversified sources for any type of data analysis.

However, the challenge is to extract the relevant data from the available big data as per the current requirements, and to store it in a way that is amicable for projecting different insights from the data. A data model thus obtained with the usage of key business terms is a valuable communication tool. The data model also needs to provide a quick way of generating reports on an as needed basis.

Data modeling for BI systems enables you to meet many of the data challenges.

Prerequisites for a Data Model for BI

A data model for BI should meet the requirements of the business for which data analysis is being done. Following are the minimum basics that any data model has to meet −

The data model needs to be Business Specific

A data model that is suitable for one line of business might not be suitable for a different line of business. Hence, the data model must be developed based on the specific business, the business terms used, the data types, and their relationships. It should be based on the objectives and the type of decisions made in the organization.

The data model needs to have built-in Intelligence

The data model should include built-in intelligence through metadata, hierarchies, and inheritances that facilitate efficient and effective Business Intelligence process. With this, you will be able to provide a common platform for different users, eliminating repetition of the process.

The data model needs to be Robust

The data model should precisely present the data specific to the business. It should enable effective disk and memory storage so as to facilitate quick processing and reporting.

The data model needs to be Scalable

The data model should be able to accommodate the changing business scenarios in a quick and efficient way. New data or new data types might have to be included. Data refreshes might have to be handled effectively.

Data Modeling for BI

Data modeling for BI consists of the following steps −

Shaping the data

Loading the data

Defining the relationships between the tables

Defining data types

Creating new data insights

Shaping the Data

The data required to build a data model can be from various sources and can be in different formats. You need to determine which portion of the data from each of these data sources is required for specific data analysis. This is called Shaping the Data.

For example, if you are retrieving the data of all the employees in an organization, you need to decide what details of each employee are relevant to the current context. In other words, you need to determine which columns of the employee table are required to be imported. This is because, the lesser the number of columns in a table in the data model, the faster will be the calculations on the table.

Loading the Data

You need to load the identified data – the data tables with the chosen columns in each of the tables.

Defining the Relationships Between Tables

Next, you need to define the logical relationships between the various tables that facilitate combining data from those tables, i.e. if you have a table – Products – containing data about the products and a table – Sales – with the various sales transactions of the products, by defining a relationship between the two tables, you can summarize the sales, product wise.

Defining Data Types

Identifying the appropriate data types for the data in the data model is crucial for the accuracy of calculations. For each column in each table that you have imported, you need to define the data type. For example, text data type, real number data type, integer data type, etc.

Creating New Data Insights

This is a crucial step in date modeling for BI. The data model that is built might have to be shared with several people who need to understand data trends and make the required decisions in a very short time. Hence, creating new data insights from the source data will be effective, avoiding rework on the analysis.

The new data insights can be in the form of metadata that can be easily understood and used by specific business people.

Data Analysis

Once the data model is ready, the data can be analyzed as per the requirement. Presenting the analysis results is also an important step because the decisions will be made based on the reports.


Update the detailed information about Making Exploratory Data Analysis Sweeter With Sweetviz 2.0 on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!